Exercises&Solutions Intro2DWH

Exercises to Intro2DWH Last Update: 07.12.
2021
Exercices (+Solutions) to DHBW

Lecture Intro2DWH
by
Dr. Hermann Völlinger and Other
Status: 7 December 2021
Goal: Documentation of all Solutions to the

Homework/Exercises in the Lecture “Introduction to Data
Warehouse (DWH)”.
Please send your solutions (if you want) to your lecturer:

[email protected]
Authors of the Solutions: Dr. Hermann Völlinger and Other
Page 1 of 120 Pages

Exercises to Intro2DWH Last Update: 07.12.2021
Content
Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 1 ..................................... 4
Exercise E1.1*: Investigate the BI-Data Trends in 2021. .................................................. 4
Exercise E1.2*: Investigate the catchwords: DWH, BI and CRM ..................................... 8
Exercise E1.3*: Compare two Data Catalogue Tools ...................................................... 18
Exercise 1.4: First Experiences with KNIME Analytics Platform ................................... 21
Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 2 ................................... 25
Exercise E2.1*: Compare 3 DWH Architectures ............................................................. 25
Exercise E2.2*: Basel II and RFID .................................................................................. 32
Exercise E3.1: Overview about 4 Database Types........................................................... 45
Exercise E3.2: Build Join Strategies ................................................................................ 55
Exercise E3.3: Example of a Normalization .................................................................... 56
Exercise E3.4: Example of a Normalization .................................................................... 59
Exercise E4.1: Create SQL Queries ................................................................................. 59
Exercise E4.2: Build SQL for a STAR Schema ............................................................... 61
Exercise E4.3*: Advanced Study about Referential Integrity.......................................... 66
Exercise E5.1: Compare ER and MDDM ........................................................................ 71
Exercise E5.2*: Compare Star and SNOWFLAKE ......................................................... 72
Exercise E5.3: Build a Logical Data Model ..................................................................... 77
Exercise E6.1: ETL: SQL Loading of a Lookup Table .................................................... 78
Exercise E6.2*: Discover and Prepare ............................................................................. 79
Exercise E6.3: Data Manipulation and Aggregation using KNIME Platform ................. 81
Exercise E7.1*: Compare 3 ETL Tools............................................................................ 84
Exercise E7.2: Demo of Datastage ................................................................................... 88
Exercise E7.3: Compare ETL and ELT Approach ........................................................... 90
Exercise E7.4: ETL : SQL Loading of a Fact Table ........................................................ 93
Exercise E8.1: Compare MOLAP to ROLAP .................................................................. 97
Page 2 of 120 Pages

Exercise E8.2*: Compare 3 Classical Analytics Tools .................................................... 98

Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 9 ................................. 103
Exercise E9.1: Three Data Mining Methods (Part1) ...................................................... 103
Exercise E9.2: Three Data Mining Methods (Part2) ...................................................... 107
Exercise E9.3: Measures for Association ....................................................................... 108
Exercise E9.4*: Evaluate the Technology of the UseCase “Semantic Search” ............. 109
Exercise E9.5*: Run a KNIME-Basics Data Mining solution ....................................... 111
Exercises (+Solutions) to DHBW Lecture Intro2DWH-Chapter 10 .................................. 113
Exercise E10.1*: Compare Data Science/Machine Learning (i.e. DM) Tools .............. 113
Exercise E10.2*: Advanced Analytics vs. Artificial Intelligence. ................................. 114
Exercise E10.3*: Create a K-Means Clustering in Python .......................................... 114
Exercise E10.4*: Image-Classification with MNIST Data using KNIME .................... 117
References .......................................................................................................................... 119
* This exercise is also a task for a Seminar Work (SW).
Page 3 of 120 Pages

Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 1
Exercise E1.1*: Investigate the BI-Data Trends in 2021.

Prepare and present the results of the e-book “BI_ Daten_Trends _2021” TINF18D-DWH:
Supporting Material (dhbw-stuttgart.de) in the next exercise session (next week,
duration = 20 minutes). 2 students.
Task: Show how can DWH and BI help to overcome the current problems (i.e. corona
pandemic) and build the basics for more digitalization. Examine the ten data trends to support
the new digital requirements.
* This exercise is also a task for a Seminar Work (SW).
Solution:
Page 4 of 120 Pages

Page 5 of 120 Pages

Page 6 of 120 Pages

Page 7 of 120 Pages

Exercise E1.2*: Investigate the catchwords: DWH, BI and CRM
Task: Prepare a report and present it next week; duration = 30 minutes (10 min for each area).
Information sources are newspaper or magazine articles or books (see literature list). 3
students.
Theme: Trends or new development in the following areas (project reports are also possible):
1. Data Warehousing (DWH)

2. Business Intelligence (BI)
3. Customer Relationship Management (CRM)
(operational, analytical, collaborative)
For Explanation of these ‘catchwords’ see also the slides of the lesson or search in the internet
Optional: Give an explanation also for the synonyms like: OLAP, OLTP, ETL, ERP, EAI
Solution:
DWH – Data Warehousing:

In vielen Organisationen sammeln sich in den operativen Systemen große, isolierte und meist
unterschiedlich formatierte Datenmengen an. Durch Transformation dieser Daten und
hinzufügen externer Daten wird es möglich, Informationen integriert im Data Warehouse –
eine Art Warenlager für Daten – für Abfragen und weitergehenden Analysen bereitzustellen.
BI – Business Intelligence:
BI ist der Prozess, die angesammelten, rohen, operationalen Daten zu analysieren und
sinnvolle Informationen daraus zu extrahieren, um auf Basis dieser integrierten Informationen
bessere Geschäftsentscheidungen treffen zu können.
BI ist wenn Geschäftsprozesse anhand der aus dem Data Warehouse gewonnenen Fakten
optimiert werden.
CRM – Customer Relationship Management:

CRM steht für kundenorientiertes Handeln, d.h. nicht das Produkt, sondern der Kunde ist
Mittelpunkt aller Geschäftsentscheidungen. Durch besseren und individuelleren Service sollen
neue Kunden gewonnen und bestehende Kundenkontakte gepflegt werden.
Operatives CRM:
Lösungen zur Automatisierung / Unterstützung von Abwicklungsprozessen mit Kunden
(Online Shop, Call Center,…)
Analytisches CRM:
Lösungen, die auf Informationen des Data Warehouse zurückgreifen und auf
aufgabenspezifische Analysen (Data Mining) beruhen.
Kollaboratives CRM:
Kommunikationskomponente, die die Interaktion mit dem Kunden ermöglicht.
Page 8 of 120 Pages

Gewinnung von Erkenntnissen durch Zusammenarbeit mit dem Kunden. Diese können dann
zur Optimierung der Geschäftsprozesse oder Personalisierung der Kundenbeziehung genutzt
werden.
OLAP – Online Analytical Processing:
Der Begriff OLAP fasst Technologien, also Methoden, wie auch Tools, zusammen, die
die Ad-hoc Analyse multidimensionaler Daten unterstützen. Die Daten können aus dem
Data Warehouse, Data Marts oder auch aus operativen Systemen stammen.
(Abgrenzung Data Mining: Suche nach Mustern und bislang unbekannten

Zusammenhängen (Neuronale Netze, Warenkorbanalysen,…))
OLTP – Online Transactional Processing:
Operative Softwaresysteme mit deren Transaktionsdaten. Heute analysiert man weniger

diese operationalen Daten als vielmehr multidimensionale, navigierbare Daten
(OLAP).
ETL – Extraction, Transformation and Loading:
Ein ETL – Tool ist dafür zuständig, um aus den operationalen Daten (real-time-data)
gesäuberte und eventuell aggregierte Informationen sowie zusätzliche Metadaten zu
erhalten.
ERP – Enterprise Resource Planning:
Unternehmensübergreifende SW-Lösungen, die zur Optimierung von

Geschäftsprozessen eingesetzt werden. Dabei handelt es sich um integrierte Lösungen,
die den betriebswirtschaftlichen Ablauf in den Bereichen Produktion, Vertrieb,
Logistik, Finanzen und Personal steuern und auswerten.
EAI – Enterprise Application Integration:
EAI beschäftigt sich mit der inner- und über-betrieblichen Anwendungsintegration, um

einen problemlosen Daten- und Informationsaustausch zu gewährleisten.
Page 9 of 120 Pages

Aktuelle Trends:
1) Explodierendes Datenvolumen
• Stärkster Trend
• Laut Gartner soll 2004 das Datenvolumen 30x so hoch wie 1999 sein.
• Skalierbarkeit
2) Integrierte 360° Sicht

• Der Kunde soll völlig transparent sein
 Trotz verteilter Applikationen soll ein vollständiges Bild des Kunden
vorhanden sein. → wichtig für CRM
3) Komplexe Anfragen und Analysen

• Benutzeranforderungen an DWH- / BI- und CRM- Systeme steigen
• Anfragen nehmen zudem zu
4) Mehr Endbenutzer
• BI- und DWH- Systeme müssen zugänglicher werden
 Benutzbarkeit „weniger ist mehr“
5) Fussion von DWH und CRM

• Information (in den DWH’s) ist die Basis, um Kunden zu versehen
6) Active DWH
• Wettbewerbsdruck → Daten müssen schnell da sein
• Aktive DWH sind eng an operationale Systeme gekoppelt → sehr aktuelle
Daten + sehr detailliert
7) Datenansammlungen (’Data Hubs’) statt relationaler DBs

• Billiger + schneller, aber: kein SQL + nicht für jede Situation
8) Outsourcing
• Zu Anfang Applikationen + Daten; zukünftig auch die Informationshaltung im
DWH
9) Starkes Anwachsen von Datenquellen (z.B. e-Business)

• Mehr Daten in unterschiedlichen Plätzen
10) Re-Engeneering oder sogar Neuaufbau von Business- Systemen (DWH, … )

• Kunde war nicht Mittelpunkt oder wurde nicht vollständig betrachtet;
Falschplanung (Größe, Geschwindigkeit, … )
Further Solution (SS 2014):
Page 10 of 120 Pages



Further Solution (WS 2019):

Further Solution (WS 2021, Leon Berger, Dennis Schmidt):




Exercise E1.3*: Compare two Data Catalogue Tools
Task: Select two of the Data Catalog (DC) tools from the two “Market Study - DC” slides
and prepare a report about the functionality of these tools (2 Students, next week, duration =
20 minutes).
Hint: Information source is the internet. See also links in the “Market Study –DC” slides. See
also the directory “Supporting Material” in the Moodle of this lecture [DHBW-Moodle].
Solution:



Exercise 1.4: First Experiences with KNIME Analytics Platform
Task: Install the tool and report about your first experiences. Give answers to the following
questions:
1. What can be done with the tool?
2. What are the features for Data-Management?
3. What are the features for Analytics and Data Science?
Information source is the KNIME Homepage KNIME | Open for Innovation and the three
mentioned documents in the lesson DW01 (see lesson notes).
Hint: The installation of KMIME is described in the “KNIME-BeginnersGuide.pdf”. The

document can be found in the first category of the “Supporting Information for DWH
Lecture” in the Course-Moodle: Kurs: T3INF4304_3_Data Warehouse (dhbw-stuttgart.de)
Solution: Copyright: Creative Commons (CC) license by Felix Grohme, Finn

Markwitz (WS2021) - “Bridge the gap between data science and business”:

Knime Features:
Blend & Transform:
● Access data from different sources (e.g Databases, Files, etc.)
● Merging of data from different data sources (adapting data if necessary)
● Prefabricated interfaces for various DBs and DWHs
● Interfaces are extensible
● Documentation of executed steps for better traceability
Model & Visualize:

● Allows to combine data with context -> different visualization possibilities
● Apply different tools via Knime -> Tensorflow, H2O, R and Python
● Create high quality data models -> data is accessible and easy to find -> visual
documentation through framework
Deploy & Manage:

● Create interfaces to make data available in other systems
● Integrated rights system -> who is allowed to access which data
● Persons with little knowledge can map processes thanks to workflow editor
Consume & Interact:

● Allows the easy creation of reports (diagrams, Excel)

● Security for sensitive data in the form of encryption, versioning, logging, etc.
How does Knime work:

Knime uses so-called workflows to create a process. This allows people who do
not have the required skills in data science but have expertise of the economic
process, or vice versa, to easily create a data-workflow which can be
implemented within a production environment.
An easy to understand example is shown in the following text:

First, we import data using a JSON-Reader Node, since KNIME holds the
processed data of each node in the context of the node, this is where the now
imported dataset is present.
This allows the user to view each step of the workflow and recap which
node transforms the data in which way. After importing the JSON data we’re
telling the import node to only represent the data matching a given JSON-Path.
This can be achieved via the “dotwalk”-annotation within the configuration of

the JSON-Reader node. The JSON-Reader node is able to automatically convert
the JSON-Array into a table using one row for each monster.
When passing the table containing the objects into a JSON to Table node. This
node takes the properties of the JSON-objects in the rows and maps them to new
columns.
Taking a look at the generated table, we can see a good overview of the
monsters. Each property is now sorted into a new column. If we take a look at
the resulting table we see a lot of columns with no values. This can happen since
KNIME maps the JSON-object with all the values for all the objects. Since we
only want the important properties, we sort out the important columns using a
Column Filter. This node allows us to remove or even merge, certain columns
from the table.
The transformation results in a table containing only the wanted columns.
Let’s say the use case ends here and our company wants to use the now
corrected dataset within a third-party-software, we could for example export it
into a CSV-File to make it available for further usage.

Exercise E2.1*: Compare 3 DWH Architectures
Task: Compare the three DWH architectures (DW only, DM only and DW & DM) in the
next slide. List the advantages and disadvantages and give a detailed explanation for it. Find
also a fourth possible architecture (hint: ‘virtual’ DWH)
Solution hint: Use a table of the following form:
DW DM DW & ??? Explanation

Only Only DM ?
Criteria 1 ++ + 0 0 Text1
Criteria 2 -- - + - Text2
Criteria 3
....
Solution:
Implementation costs

The implementation of a Data Warehouse with Data Marts is the most expensive solution,
because it is necessary to build the system including connections between Data Warehouse
and its Data Marts.
It is also necessary to build a second ETL which manages the preparation of data for the
Data Marts.
In case of implementing Data Marts or a Data Warehouse only, the ETL is only implemented
once. The costs may be almost the same in building one of these systems. The Data Marts
only require a little more hardware and network connections to the data sources. But due to
the fact, that building the ETL is the most expensive part, these costs may be relatively low.
The virtual Data Warehouse may have the lowest implementation costs, because e.g.
existing applications and infrastructure is used.
Administration costs
The Data Warehouse only solution offers the best effort in minimizing the administration
costs, due to the centralized design of the system. In this solution it is only necessary to
manage a central system. Normally the client management is no problem, if using web
technology or a centralized
client deployment, which should be a standard in all mid-size to big enterprises. A central
Backup can cover the whole data of the Data Warehouse.
The solution with Data Marts only are more expensive, because of its decentralized design.
There are higher costs in cases of product updates or maintaining the online connections,
you also have to backup each Data Mart for itself, depending on his physical location.
Also the process of filling a single Data Mart is critical. Errors during update may cause loss
of data. In case of an error during an update, the system administration must react at once.
Data Marts with a central Data Warehouse are more efficient, because all necessary data is
stored in a single place. When an error during an update of a Data Mart occurs, this is
normally no problem, because the data is not lost and can be recovered directly from the
Data Warehouse. It may also be possible to recover a whole Data Mart out of the Data
Warehouse.
Virtual Data Warehouses administration costs depend on the quality of the implementation.
Problems with connections to the online data sources may cause user to ask for support,
even if the problem was caused by a broken online connection or a failure in the online data
source. End-users may not be able to realize whether the data source or the application on
their computer cause a problem.
Average data age

The virtual Data Warehouse represents the most actual data, because the application
directly connects to the data sources and fetches its information online. The retrieved
information is always up to date.
Information provided by Data Mart only or Data Warehouse only solutions are collected to
specific time. Generally, each day by night. These times can vary from hourly to monthly or
even longer. The selected period depends on the cost of the process retrieving and checking
the information.
A solution with one central Data Warehouse and additional Data Marts houses less actual
data then Data Warehouse only. The data of the Data Warehouse must be converted and
copied to the Data Marts, which is time consuming.
Performance
A virtual Data Warehouse has the poorest performance all over. All data is retrieved during
runtime directly from the data sources. Before data can be used, it must be converted for
presentation. Therefore, a huge amount of time is spent by retrieval and converting of data.
The Data Marts host information, which are already optimized for the client applications. All
data s stored in an optimal state in the database. Special indexes in the databases speed up
information retrieval.

Implementation Time
The implementation of a Data Warehouse with its Data Marts takes the longest time,
because complex networks and transformations must be created. Creating Data Warehouse
only or Data Marts only should take almost the same amount of time. Most time is normally
spent on creating the ETL (about 80%), so the differences between Data Warehouse only
and Data Marts only should not differ much.
Implementing a Virtual Data Warehouse can be done very fast because of its simple
structure. It is not necessary to build a central database with all connectors.
Data Consistency
When using Data Warehouse or Data Mart technology a maximum consistency of data is
achieved.
All provided information is checked for validity and consistency. A virtual Data Warehouse
may have problems with data consistency because all data is retrieved at runtime. When
data organization on sources changes, the consistency of new data may be consistent, but
older data may not be represented in its current model.
Flexibility
The highest flexibility has a virtual data warehouse. It is possible to change the data
preparation process very easy because only the clients are directly involved. There are
nearly no components, which depend on each other.
In Data Warehouse only solution flexibility is poor, because there may exist different types of
clients that depend on the data model of the Data Warehouse. If it would be necessary to
change a particular part of the data model intensive testing for compatibility with existing
applications must be done, or even the client applications have to be updated.
A solution with Data Marts, with or without a central Data Warehouse has medium flexibility
due that client applications normally uses Data Marts as their point of information. In case of
a change in the central Data Warehouse or the data sources, it is only necessary to update
the process of filling the Data Marts.
In case of change in the Data Marts only the depending, client applications are involved
and not all client applications.
Data Consistency
Data consistency is poor in a virtual Data Warehouse. But it also depends on the quality of
the process, which gathers information from the sources.
Data Warehouses and Data Marts have very good data consistency because the information
stored in their databases have been checked during the ETL process.
Quality of information
The quality of information hardly depends on the quality of the data population process (ETL
process) and how good the information is processed and filtered before stored in the
Data Warehouse or presented to a user. Therefore, it is not possible to give a concrete
statement.
History
A virtual Data Warehouse has no history at all, because the values or information are
retrieved at runtime. In this architecture it is not possible to store a history because no central
database is present.
The other architectures provide a central point to store this information. The history provides
a basis for analysing business process and their efforts, because it is possible to compare
actual information with information of the past.
Second Solution (SS2021):


Third Solution (WS2021):



Exercise E2.2*: Basel II and RFID

Task: Prepare a report and present it at the next exercise session (next week, duration = 15
minutes). Information sources are newspaper or magazine articles or internet
Theme: Give a definition (5 Minutes) and impact of these new trends on Data Warehousing
(10 Minutes)
1. Basel II
2. RFID
Look also for examples of current projects in Germany
Solution:

Agenda
▪ Warum Basel-Abkommen?
Basel II ▪ Überblick Basel I + II

▪ Basel II Roadmap
▪ Basel II und Data Warehousing
Michael Illiger, Stefan Tietz, Steve Gebhardt, ▪ Tools
Thomas Dürr
▪ Ausblick
© 2004 IBM Corporation

© 2004 IBM Corporation
Warum Basel-Abkommen? Basel I
▪ Risiko: Kreditausfall ▪ 8% der Kreditsumme durch Eigenkapital abdecken

▪ Geringe Eigenkapitalquote ▪ Kunden-Rating anhand interner Prüfungen
▪ Keine einheitlichen Rating-Richtlinien ▪ Grundlage: Bilanzen + bisherige Kreditwürdigkeit

▪ → falsche Anreizsetzung, unabgedeckte Risiken
▪ → Basel I (1988) Kirch-Krise, Bankenkrise in Japan

→ Basel II
© 2004 IBM Corporation © 2004 IBM Corporation
Basel II Basel II - Säulenmodell
▪ Kundenrating intern und extern

▪ Reservebildung je nach Kreditrisiko
▪ Aufteilung in Qualitative und Quantitative
Risikofaktoren
▪ erweiterte Offenlegung der Finanzsituation in
Banken
▪ Eigenkapital = Kreditsumme x Risikogewicht x Kapitalquote
Basel II Roadmap Basel II und Data Warehousing
▪ 2003: Aufnahme von Basel II in die Strategie ▪ grosse Datenmengen zur Analyse
der Institute
▪ DWH werden benötigt von:
▪ 2004: Aufbau der DWH-Infrastruktur – Banken → Kunden-Rating
▪ 2005: Datensammlung + Auswertungsstrategie – Rating-Agenturen → Service zur Verfügung stellen
▪ 2006: ... Parallel-Lauf von Basel I + II – Unternehmen → optimale Finanzsituation verringert
Kreditkosten
▪ 2007: Basel II wird bindend

Tools Ausblick
▪ IBM-Lösung → Banking Data Warehouse (BDW) ▪ Vorbereitungen laufen seit 2003

▪ Start: 01.01.2007
▪ verändert Verhältnis zwischen Kapitalgeber und
Kapitalnehmer
▪ umfangreicher Absatzmarkt für DWH-Services
entsteht (weltweit!)
Quelle: IBM
Eine weitere Lösung zu Basel2 und DWH ist wie folgt:
Agenda
• Sicherung der Stabilität im Finanzsektor
• Eigenkapitalvereinbarung von 1988 (Basel I)
• Von Basel I zu Basel II
• Gründe für Basel II
Basel II & DWH • Rating von Krediten nach Basel II
• Die Bank will uns kennen lernen
• Auswirkungen von Basel II
Christian Schäfer, 28.10.2005 • Herausforderungen an Data Warehouse Systeme
© 2005 Hewlett-Packard Development Company, L.P.

The information contained herein is subject to change without notice
November 6, 2005 2
Sicherung der Stabilität im

Finanzsektor Sicherung der Stabilität im
• Umgang mit Kredit-, Markt, Liquiditäts- und anderen Finanzsektor
Risiken ist Aufgabe und Geschäftszweck von
Kreditinstituten/Banken
Lösung:
Probleme: • Sicherung einer angemessenen

Eigenkapitalausstattung der Banken
• Der freie Umgang mit Risiken darf nicht zu Instabilitäten • Schaffung einheitlicher internationaler
im Finanzsektor führen Wettbewerbsbedingungen
• Freier Umgang bei der Sicherung von Risiken führt zu

unterschiedlichen Wettbewerbsbedingungen im
Bankwesen
November 6, 2005 3 November 6, 2005 1
Von Basel I zu Basel II

Eigenkapitalvereinbarung von 1988
(Basel I) • Kritik an Basel I
− Risiken der Kreditvergabe werden unzureichend abgebildet
− Neue Finanzierungsmöglichkeiten werden nicht berücksichtigt
• Verbesserungen durch Basel II

• Richtlinie zur Förderung und Sicherung eines − Internes oder externes Rating
funktionierenden Bankwesens. von Kreditrisiken
− Marktrisiko (Branche etc.)
− Operationelles Risiko (Ausfall
Wichtigstes Merkmal: Von Mitarbeiter, Systeme etc.)
− Bankaufsichtlicher
• Eigenkapitalunterlegung seitens der Bank muss Überprüfungsprozess
mindestens 8% der Kreditsumme betragen − Erweiterte Offenlegung
(Selbstkontrolle des Marktes)
November 6, 2005 5 November 6, 2005 6


Eine weitere Lösung (dritte Lsg.) zu Basel2 und DWH finden Sie in der folgenden
Darstellung:
Basel I: Kreditvergabepraxis limitiert durch Verknüpfung mit Eigenkapital

Vergabe von Krediten an Kunde mit mäßiger Bonität -> höhere Zinssätze
1974: Zusammenbruch Herrstatt-Bank

- Devisenspekulationen
1988: Eigenkapitalvereinbarung „Basel I“
- Kreditvergabepraxis
Basel II
Mindestkapital
- Überprüfungen Marktdisziplin
anforderungen durch durch
Bankenaufischt Offenlegungs-
Kreditrisiko,
pflicht
Marktrisiko,
Operationelles
Risiko
Basel I: ≥ 8% Eigenkapital
Basel II: nur Mindestkapital basierend auf Kredit- und Marktrisiken

Marktdisziplin: Verhalten, Öffentlichkeit über Kapital & Risiko zu informieren ->
günstige Bedingung bei Beschaffung Fremdkapitals
http://www.bundesbank.de/bankenaufsicht/bankenaufsicht_basel.php
 Standards
◦ Migration alter Daten
◦ Anbindung weiterer Datenquellen
 Qualität und Zuverlässigkeit

◦ Berechnung der Kreditrisiko-Kennzahlen
 Umstellung von IT-Systemen

◦ Große Datenmengen speichern und auswerten
 Qualitätskontrollen
DM strategy: Risk International

http://db.riskwaters.com/data/Risk__free_article_/basel.pdf

PD: Ausfallwahrscheinlichkeit, Verlustquote bei Ausfall, Höhe bei Ausfall ->

erwarteter Verlust
http://www.it-observer.com/data-management-challenges-basel-ii-readiness.html
http://www.facebook.com/topic.php?uid=25192258947&topic=5725&_fb_noscript=
1
1. Basel II
2. Internes
Meldewesen
3. Analyse
und
Auswertungen
4. Externes
Meldewesen
CreditBank Plus AG, Stuttgart

www.information-works.de
Erweiterte
Standards für Offenlegung und
Überprüfung
Erhöhte Kapitalanforderungen
Liquiditätsanforderungen
◦ Echtzeitüberwachung
http://www.finextra.com/community/fullblog.aspx?blogid=4988
frei verfügbare Anlagen hoher Qualität halten, welche auch in Krisenzeiten
verkäuflich, Echtzeit -> data quality challenge
http://www.information-management.com/news/data_risk_management_Basel-10018723-1.html
http://www.pwc.lu/en/risk-management/docs/pwc-basel-III-a-risk-management-perspective.pdf

An additional presentation about RFID & DWH:
Agenda
RFID ◼ Was ist RFID

Radio Frequency Identifikation ◼ Anwendungsgebiete
◼ RFID & Data-

Data-Warehouse
◼ Ausblick
Stefan Baudy, Max Nagel, Andreas Bitzer
Was ist R F I D R F I D - Funktionalität

◼ Kontaktlose Kommunikation über
elektromagnetische Wellen
◼ Sillicon-
Sillicon-Chip mit
gespeicherter ID
◼ Abruf von Lesegerät über
Aussenden von Wellen
◼ Chip sendet ID zurück
◼ Empfänger leitet Information weiter
Was ist RFID → Anwendungsgebiete → RFID & Data Warehouse → Ausblick Was ist RFID → Anwendungsgebiete → RFID & Data Warehouse → Ausblick
RFID & Data-

Data-Warehouse Anwendungsgebiete RFID
◼ Anforderungen an ein DWH ◼ Barcode: Ersatz, Erweiterung

– hohe Anzahl gleichzeitiger Transaktionen – Inventarüberwachung
– extrem hohe Datenmengen – Automatische Lagersysteme
– kurze Antwortzeiten ◼ Sicherheitssysteme
◼ Edge-
Edge-Computing – Zugangskontrolle
◼ Dezentale Speicherung der Daten – Diebstahlschutz
– Gepäckkontrolle
RFID & Data-

Data-Warehouse Ausblick
◼ Anforderungen an ein DWH ◼ Standards?

– hohe Anzahl gleichzeitiger Transaktionen ◼ Kosten vs. Nutzen (Barcodeersatz)
– extrem hohe Datenmengen ◼ Nutzen vs. Ausspionieren d. Kunden
– kurze Antwortzeiten
◼ Höchst politisches Thema
◼ Edge-
Edge-Computing
◼ Dezentale Speicherung der Daten
One further Solution:


A further presentation to Basel II/III & DWH (SS2021):


A further presentation to RFID & DWH (SS2021):

A further presentation to Basel II/III & RFID (WS2021):


Exercise E3.1: Overview about 4 Database Types

Build 4 groups. Prepare a small report about the following database themes. Concentrate only
on basics. The presentation should just give an overview about the theme.
1.Non-relational databases (IMS, VSAM …) (3.1.1)
2.Relational DBMS (3.1.2)
3.SQL Basics (3.1.3)
4. Normalization (3.1.4)
For this you can use the material you learned in the former BA database lesson or use
standard literature sources.
Goal: Present your report in the next exercise session (10 minutes duration). Send your
solution to [email protected]
Solution to 3.1.1 - Non-relational databases (IMS, VSAM …):
Datenmodell: Die zur Beschreibung von Daten und deren Beziehungen untereinander auf
logischer Ebene zur Verfügung stehenden Datenstrukturen bezeichnet man
zusammenfassend als Datenmodell.

Dient zur formalen Beschreibung des konzeptionellen (bzw. logischen)

Schemas und der externen Schemata mit Hilfe entsprechender
Datendefinitionssprachen.
Das Hierarchische Datenmodell – HDM

- primär können nur hierarchisch-baumartige Beziehungen von Objekttypen dargestellt
werden.
- Reale Beziehungen sind oft von netzwerkartiger Struktur, sodass Erweiterungen des
Datenmodells erforderlich sind => z.B. bei IMS
Strukturelemente:
- Objekttypen
- Hierarchische unbenannte Beziehungen (Kanten haben keine Bezeichnungen)
Ergebnis: Baumstruktur
Wurzelbaum-Typ (Hierarchie-Typ) stellt Objekttypen und deren Beziehungen

zueinander dar.
Hierarchische Datenbank ist eine Menge von disjunkten Wurzelbaum-(Hierarchie)-

Typen.
Im hierarchischen Modell ist jedes Wurzel-Objekt über einen Primärschlüssel erreichbar,

alle anderen Objekte gemäß der hierarchischen Ordnung. Der Zugriff auf Datenobjekte
erfolgt also entlang den logischen Zugriffspfaden (durch Kanten dargestellt). Dies setzt
seitens des Anwenders eine genaue Kenntnis der DB-Struktur voraus und bedingt eine
prozedurale Beschreibung des Zugriffs. Man spricht bildlich von einem Navigieren durch
die Datenbank.
Darstellung von Strukturen im HDM:
In einem (strengen) HDM können netzwerkartige Strukturen nicht dargestellt werden.

Eine n:m Beziehung, wie z.B. die zwischen Bauteilen und Lieferanten, kann nur durch
zwei getrennt Hierarchie-Typen dargestellt werden
=> Redundanz!
Bsp:
Lieferant Bauteil
Geht nicht!

Lösung:
Lieferant Bauteil
Lieferant Bauteil
Problem: Lieferanten und Bauteile sind mehrfach gespeichert.
Problemlösung: Pairing
Abweichend vom strengen HDM werden zusätzliche logische Zugriffe eingeführt, damit n:m
Beziehungen dargestellt werden können.
Lieferant Bauteil
L-B Preis B-L Preis
Lieferanten und Bauteile sind nun nur einfach vorhanden.
Problem: Preise, die als Attribute bei zusätzlich eingeführten Objekttypen B-L und L-B
gespeichert werden, sind immer noch redundant.
IMS Information Management System:

- kennt keine genaue Unterscheidung zwischen den 3 Schemas (extern, konzeptionell.
intern)
- logische Datenmodellierung und physikalische Datenorganisation ineinander verwoben
- Datendefinition erfolgt mit Hilfe der Sprache DL / I
- hierarchische Strukturen können über logische Zeiger miteinander verkettet werden
- Anwender-Sichten können definiert werden
1. Das Netzwerkmodell
- „Erweiterung“ des HDM um netzwerkartige Beziehungen
Strukturelemente:
- Objekttypen
- hierarchische Beziehungen (1:mc), die als Set-Typen bezeichnet werden
E1 E1
Owner-Typ
1
b b Page 47 of 120 Pages
mc
Set-Typ
Member-Typ
In einem Set-Typ gibt es genau 1 Owner.

1 Owner kann viele Members haben (0 ..*)
1 Owner kann Member sein (in einem anderen Set-Typ), 1 Member kann auch Owner sein
Darstellung von Strukturen im NDM:
1:m ist trivial

m:n durch K ett-Objekt-Typ (link entity type)
Lieferant Bauteil
Lieferant B-L Bauteil
Objekt-Typen können auch mit sich selbst in Beziehung stehen, z.B. kann ein Bauteil ein
Bauteil eines anderen Bauteils sein.
Bauteil
Bauteil K Bauteil
VSAM: Virtual Storage Access Method
Virtuel: Hardware-Unabhängigkeit, d.h. bei der Dateiorganisation wird primär kein Bezug auf
die physische Speicherorganisation (z.B. Zylinder und Spuren der Magnetplatte) genommen.
Die auch den B- und B+-Bäumen zugrunde liegenden Prinzipien, nämlich
- in Speicherbereichen fester Größe (Knoten) verteilten freien Speicherplatz zur

Aufnahme einzufügender Datenobjekte vorzusehen

- durch „Zell-Teilung“ (cellular splitting) neuen Speicherplatz zu schaffe, falls der Platz
beim Einfügen nicht ausreicht,
werden hier auch auf die Speicherung der Datensätze selbst (Primärdaten) angewendet und
als Index ein B+-Baum verwendet, dessen Blätter gekettet sind, so dass eine logisch
fortlaufende Verarbeitung nach aufsteigenden und absteigenden Schlüsselwerten und auch der
(quasi-) direkte Zugriff möglich ist.
Eine weitere Lösung (2. Lösung):


Eine dritte Lösung zu 3.1.1 finden Sie hier:

Project Voldemort
 key-value storage system

 Daten werden automatisch über mehrere Server
repliziert / verteilt
 Daten werden versioniert um Integrität zu
gewährleisten
 3 Operationen: value=get(key),
put(key,value),delete(key)
 effiziente Operationen→vorhersehbare Performance

Solution to 3.1.2 - Relationale Datenbanken:
→ Relation = Beziehung/Abhängigkeit von Objekten und Daten zueinander
→ Definition:
- rel. DB-Modell 1970 von Codd
- Datenspeicherung in Tabellen (Relationen) mit einer festen Anzahl an Spalten und
einer flexiblen Anzahl an Zeilen
- Durch das Verteilen der Informationen auf einzelne Tabellen werden Redundanzen
vermieden.
- Mit Schlüsselfeldern können Verknüpfungen zw. den Tabellen erstellt werden.
→ Tabellen = Relationen Attribut

Relation
ID Name Alter
12 Meier 23 Row
13 Müller 45
14 Bauer 34 Tupel = ganzer Datensatz
Feld
Column
→ Eine Menge von miteinander verbundenen Relationen bildet eine Datenbank.
→ In einer Tabelle gibt es keine zwei Tupel, die für alle Attribute die gleichen Werte haben.
→ Schlüssel = identifizierende Attributmenge
→ Primärschlüssel
= eine Spalte der Tabelle, durch deren Werte jeder Datensatz der Tabelle eindeutig
identifiziert wird.
Der Wert eines Primärschlüsselfeldes einer Tabelle darf nicht doppelt vorkommen.
Jede Tabelle kann nur einen Primärschlüssel haben.
Er kann sich aus mehreren Datenfeldern zusammensetzen und darf nicht leer sein.
→ Fremdschlüssel
= eine Spalte einer Tabelle, deren Werte auf den Primärschlüssel einer anderen Tabelle
verweisen.
Eine Tabelle kann mehrere Fremdschlüssel enthalten.
Er kann aus mehreren Feldern der Tabelle bestehen, er kann leer sein.
Für jeden Wert eines Fremdschlüssels muss es einen entsprechenden Wert im
Primärschlüssel der korrespondierenden Tabelle geben (Integrität)
→ Basisoperationen: (siehe SQL-Anweisungen)

- Selektion
- Verbund

- Projektion
→ Weitere Regeln der relationalen Datenbank:
- Transaktionen müssen entweder vollständig durchgeführt werden oder, bei einem
Abbruch, vollständig zurückgesetzt werden.
- Der Zugriff auf die Daten durch den Benutzer muss unabhängig davon sein, wie die
Daten gespeichert wurden oder wie physikalisch auf sie zugegriffen wird.
- Ändert der Datenbankverwalter die physikalische Struktur, darf der Anwender davon
nichts mitbekommen.
Solution to 3.1.3- SQL Basics:

Compare standard books about SQL language
Solution to 3.1.4 - Normalization:

Ziel von Normalformen
➢ Update-Anomalien innerhalb einer Relation vermeiden

➢ Update-Anomalien: Redundanzen in Datenbanken, die einerseits unnötigen
Speicherplatz verbrauchen und andererseits dazu führen, dass sich
Änderungsoperationen nur schwer umsetzen lassen (Änderung bei allen Vorkommen
einer Information)
➢ Ziel: Redundanzen entfernen, die aufgrund von funktionalen Abhängigkeiten
innerhalb einer Relation entstehen
Abhängigkeiten
a) funktional abhängig
zu einer Attributkombination von A gibt es genau eine Attributkombination von B
B ist funktional abhängig von A: A -> B
b) voll funktional abhängig

A und B als Attributkombination der gleichen Relation R
B ist voll funktional abhängig von A, wenn es von der gesamten Attributkombination
von A funktional abhängt, aber nicht schon von einem Teil: A => B
c) transitiv abhängig
B ist abhängig von A und C ist abhängig von B: A -> B -> C
C darf dabei nicht Schlüsselattribut sein und nicht in B vorkommen
Anomalien:
Prüfungsgeschehen
PNR Fach Prüfer Student
MATNR Name Geb Adr Fachbereich Dekan Note
3 Elektronik Richter 123456 Meier 010203 Weg Informatik Wutz 1
1
124538 Schulz 050678 Str 1 Informatik Wutz 2

4 Informatik Schwinn 245633 Ich 021279 Gas. Informatik Wutz 1

2
246354 Schulz 050678 Str 1 Informatik Wutz 1
5 TMS Müller 856214 Schmidt 120178 Str 2 Informatik Wutz 3
369852 Pitt 140677 Gas. BWL Butz 1
1
Einfüge-Anomalien
Wo fügt man in dieser Relation einen Studenten ein, der noch nicht an einer Prüfung
teilgenommen hat?
a) Lösch-Anomalien
Mit Löschung des Studenten Pitt, geht auch die Information über den Dekan vom
Dachbereich BWL verloren.
b) Änderungs-Anomalien
Zieht ein Student um, der an mehreren Prüfungen teilgenommen hat, so muß die
Adressänderung in mehreren Tupeln vollzogen werden
Exercise E3.2: Build Join Strategies
Build all join strategies for the following tables SAMP_PROJECT and SAMP_STAFF:
i.e.
1. Cross Product
2. Inner Join
3. Outer Join
a. Left Outer Join
b. Right Outer Join
c. Full Outer Join
SAMP_PROJECT:
SAMP_STAFF:

Solution:
See lesson notes.
Exercise E3.3: Example of a Normalization
Do the normalization steps 1NF, 2NF and 3NF to the following unnormalized table
(show also the immediate result):
Solution:
Erste Normalform
➢ Nur atomare Attribute, also Elemente von Standard-Datentypen und nicht Listen,
Tabellen oder ähnliche komplexe Strukturen
Prüfungsgeschehen
PNR Fach Prüfer Student
MATNR Name Geb Adr Fachbereich Dekan Note
3 Elektronik Richter 123456 Meier 010203 Weg Informatik Wutz 1
1
124538 Schulz 050678 Str Informatik Wutz 2
1
4 Informatik Schwinn 245633 Ich 021279 Gas. Informatik Wutz 1
2
246354 Schulz 050678 Str Informatik Wutz 1
1
5 TMS Müller 856214 Schmidt 120178 Str Informatik Wutz 3
2
369852 Pitt 140677 Gas. BWL Butz 1
1

Bsp. enthält eine weitere Relation

1. Lösung: jede Zeile um die ersten drei Attribute erweitern, dann entstehen aber
Redundanzen
2. Lösung: Auslagerung in eine neue Tabelle Prüfung
PNR Fach Prüfer

3 Elektronik Richter
4 Informatik Schwinn
5 TMS Müller
Prüfling
PNR MATNR Name Geb Adr Fachbereich Dekan Note
3 123456 Meier 010203 Weg 1 Informatik Wutz 1
3 124538 Schulz 050678 Str 1 Informatik Wutz 2
4 245633 Kunz 021279 Gas. 2 Informatik Wutz 1
5 856214 Schmidt 120178 Str 2 Informatik Wutz 3
5 369852 Pitt 140677 Gas. 1 BWL Butz 1
Beide Relationen sind nun in 1. NF
Zweite Normalform
Ziel: aufgrund von funktionalen Abhängigkeiten Redundanzen entdecken

Erlaubt keine partiellen Abhängigkeiten zwischen Schlüsseln des Relationen Schemas und
weiteren Attributen (jedes Nicht-Primärattribut muss also voll funktional abhängig sein von
jedem Schlüsselattribut der Relation)
Prüfling
PNR MATNR Name Geb Adr Fachbereich Dekan Note
3 123456 Meier 010203 Weg 1 Informatik Wutz 1
4 245633 Kunz 021279 Gas. 2 Informatik Wutz 1
5 856214 Schmidt 120178 Str 2 Informatik Wutz 3
5 369852 Pitt 140677 Gas. 1 BWL Butz 1
Erkennbar: Daten des Studenten (Name, Geb, Adr, Fachbereich, Dekan) hängen nur von
MATNR ab und nicht von PNR, ist somit nicht voll funktional abhängig
Erzeugung der zweiten Normalform durch Elimination der rechten Seite der partiellen
Abhängigkeit und Kopie der linken Seite
Student
MATNR Name Geb Adr Fachbereich Dekan
123456 Meier 010203 Weg 1 Informatik Wutz
124538 Schulz 050678 Str 1 Informatik Wutz
245633 Kunz 021279 Gas. 2 Informatik Wutz
856214 Schmidt 120178 Str 2 Informatik Wutz
369852 Pitt 140677 Gas. 1 BWL Butz

Prüfungsergebnis
PNR MATNR Note
3 123456 1
3 124538 2
4 245633 1
4 124538 1
5 856214 3
5 369852 1
➢ Eine Relation R ist in 2. NF, wenn sie in 1.NF ist und jedes Nicht-Primärattribut von
R voll von jedem Schlüssel in R abhängt (also keine Attribute des Schlüssels
unwesentlich ist)
➢ Problem der Anomalien noch nicht beseitigt
Einfüge-A.: Fachbereichsdaten nicht ohne eingeschriebenen Studenten speicherbar
Lösch-A: Fachbereichsdaten verschwinden mit Löschen des letzten Studenten
Änderungs-A: Wechsel des Dekans muss an mehreren Stellen vollzogen werden
Dritte Normalform
➢ 3. NF: keine transitiven Abhängigkeiten
Student
MATNR Name Geb Adr Fachbereich Dekan
123456 Meier 010203 Weg 1 Informatik Wutz
245633 Kunz 021279 Gas. 2 Informatik Wutz
856214 Schmidt 120178 Str 2 Informatik Wutz
369852 Pitt 140677 Gas. 1 BWL Butz
➢ transitive Abhängigkeit: Dekan ist von Fachbereich abhängig, da es zu jedem

Fachbereich genau einen Dekan gibt (demnach ist Dekan transitiv abhängig von
MATNR)
➢ Eliminieren von transitiven Abhängigkeiten: Auslagerung der abhängigen Attribute in
eine neue Relation
Fachbereich
Fachbereich Dekan
Informatik Wutz
BWL Butz
Student
MATNR Name Geb Adr Fachbereich
123456 Meier 010203 Weg 1 Informatik
124538 Schulz 050678 Str 1 Informatik
245633 Kunz 021279 Gas. 2 Informatik
124538 Schulz 050678 Str 1 Informatik
856214 Schmidt 120178 Str 2 Informatik
369852 Pitt 140677 Gas. 1 BWL

Exercise E3.4: Example of a Normalization
Do the normalization steps 1NF, 2NF and 3NF to the following un-normalized
table (show also the immediate results):
Prerequisites: Keys are PO# and Item#, SupName = Funct (Sup#) , Quant =
Funct (Item#,PO#) and $/Unit=Funct (Item#)
PO# SUP# SupName Item# ItemDescription $/Unit Quant
12345 023 Acme Toys XT108 Buttons 2.50 100
XT111 Buttons 1.97 250
BW322 Wheels 6.20 50
12346 094 Mitchells BW641 Chassis 19.20 100
BW832 Axles 3.40 220
Solution to 3.4:
The table is not in First Normal Form (1NF) – there are “Repeating Row Groups”.
By adding the duplicate information in the first three row to the empty row cells, we
get five complete rows in this table, which have only atomic values. So we have First
Normal Form. (1NF).
PO# SUP# SupName Item# ItemDescription $/Unit Quant
12345 023 Acme Toys BW322 Wheels 6.20 50
12346 094 Mitchells BW641 Chassis 19.20 100
12346 094 Mitchells BW832 Axles 3.40 220
.........
Exercise E4.1: Create SQL Queries
Given the two tables:

Airport:
FID Name
MUC Muenchen
FRA Frankfurt
HAN Hannover
STU Stuttgart
MAN Mannheim
BER Berlin
Flight:
Fno From To Time

161 MUC HAN 9:15
164 HAN MUC 11:15
181 STU MUC 10:30
185 MUC FRA 6:10
193 MAH BER 14:30
Define the right SQL such that:
1. you get a list of airports which have no incoming flights (no arrivals) (6
points)
2. create a report (view) Flights_To_Munich of all flights to Munich(arrival)
with Flight-Number, Departure-Airport (full name) and Departure-Time as
columns (6 points)
3. insert a new flight from BER to HAN at 17:30 with FNo 471 (4 points)
4. Change FlightTime of Fno=181 to 10:35 (4 points)
Optional (difficult) –10 points:
5. calculates the numbers of flights from (departures) for each airport

Solution:
Ad 1.:
select fid, name from airport
where fid not in
(select distinct to from flight)
Ad 2.:
create view Flights_to_Munich2
as select f.Fno as FNr, a.name as Dep_Airp, f.time as DepT from flight f, airport
a
where f.to='MUC' and a.fid=f.from
Ad3.:
insert into flight
values (471,'BER','HAN','17.30.00')
Ad4.:
update flight
set time = '10.35.00'

where Fno=181
Ad5 (optional):
select name as Departure_Airport, count (*) as Departure_Count
from airport, flight
where fid=from
group by name
union
select name as Departure_Airport, 0 as Departure_Count
from airport
where not exists (select * from flight where from=fid)
order by departure_count
Delivers the following result:

**********************************************************************
db2 => select name as Departure_ Airport, count (*) as Departure_Count from airpo
rt, flight where fid=from group by name union select name as Departure_Airpo
rt, 0 as Departure_Count from airport where not exists (select * from flight whe
re from=fid) order by departure_count
DEPARTURE_AIRPORT DEPARTURE_COUNT
------------------------------ -------------------------------
Berlin 0
Frankfurt 0
Hannover 1
Mannheim 1
Stuttgart 1
Muenchen 2
6 record(s) selected.
**************************************************************************
Here is also a second solution (which is shorter) and gives the same results as above by
Stefan Seufert:
SELECT Name as Departure_Airport, count (Flight.From) as Departure_Count
FROM Airport LEFT OUTER JOIN Flight ON Airport.FID = Flight.From
GROUP BY Name
ORDER BY Departure_Count
The idea is, that count(Field) in contradiction to count(*) only count the fields which
are not NULL. Since the attribute in the count function is from the flight table, only the
flights which have departures are counted, all other get the 0 value.
Exercise E4.2: Build SQL for a STAR Schema
Consider the following Star Schema:

Product Time
Prod_id
Time_id
Brand
Sales_Fact
Subcategory Fiscal_Period
Category Prod_id Quarter
Department Time_id Month
...... Promo_id Year
Store_id ......
Store
Dollar_Sales
Store_id Unit_Sales Promotion
Dollar_Cost
Name Cust_Count Promo_id
Store_No …
Store_Street
Store_City Promo_Name
Price_Reduct.
Build the SQL, such that the result is the following report, where time condition is the
Fiscal_Period = 4Q95‘, such that we get the result table below. Why is this a typical DWH
query (result table)?
Brand Dollar Unit Sales

Axon Sales
780 263
Framis 1044 509
Widget 213 444
Zapper 95 39
Solution with Standard SQL(for example with DB2):
SELECT p.brand AS Brand, Sum(s.dollar_sales) AS Dollar_Sales, Sum(s.unit_sales) AS
Unit_Sales
FROM sales_fact s, product p, time t
WHERE p.product_key = s.product_key
AND s.time_key = t.time_key
AND t.fiscal_period="4Q95"
GROUP BY p.brand
ORDER BY p.brand

By using the SQL Wizard (Design View) in the database Microsoft Access, we see the
following ‘Access SQL‘:
SELECT Product.brand AS Brand, Sum([Sales Fact].dollar_sales) AS

Dollar_Sales,Sum([Sales Fact].unit_sales) AS Unit_Sales
FROM ([Sales Fact]
INNER JOIN [Time] ON [Sales Fact].time_key = Time.time_key)
INNER JOIN Product ON [Sales Fact].product_key = Product.product_key
WHERE (((Time.fiscal_period)="4Q95"))
GROUP BY Product.brand
ORDER BY Product.brand;
Solution with Standard SQL(for example with DB2) by loading the data
(flat files) into DB2:
First connect to database “Grocery”. Then create the necessary tables and load the data from
flat Files (*.txt Files) into the corresponding tables:
CREATE TABLE "DB2ADMIN"."SALES_FACT" (

"TIME_ID" INTEGER,
"PRODUCT_ID" INTEGER,
"PROMO_ID" INTEGER,
"STORE_ID" INTEGER,
"DOLLAR_SALES" DECIMAL(7 , 2),
"UNIT_SALES" INTEGER,
"DOLLAR_COST" DECIMAL(7 , 2),
"CUSTOMER_COUNT" INTEGER
)
ORGANIZE BY ROW
DATA CAPTURE NONE
IN "USERSPACE1"
COMPRESS NO;
Load the data from the Sales_Fact.txt file by using the “Load Data” feature of the table
DB2ADMIN.Sales_Fact in the GROCERY database:
Do the same for the four dimension-tables: “Product”, “Time”, “Store” and “Promotion”.
CREATE TABLE "DB2ADMIN"."TIME" ("TIME_ID" INTEGER,
"DATE" varchar(20),"DAY_IN_WEEK" varchar(12),

"DAY_NUMBER_IN_MONTH" Double,
"DAY_NUMBER_OVERALL" Double,
"WEEK_NUMBER_IN_YEAR" Double,
"WEEK_NUMBER_OVERALL" Double,
"MONTH" Double, "QUARTER" int,
"FISCAL_PERIOD" varchar(4),"YEAR" int,
"HOLIDAY_FLAG" varchar(1))
ORGANIZE BY ROW
DATA CAPTURE NONE
IN "USERSPACE1"
COMPRESS NO;
CREATE TABLE "DB2ADMIN"."PRODUCT" ("PRODUCT_ID" INTEGER,

"DESCRIPTION" varchar(20),"FULL_DESCRIPTION" varchar(30),
"SKU_NUMBER" decimal(12,0),"PACKAGE_SIZE" varchar(8),
"BRAND" varchar(20),"SUBCATEGORY" varchar(20), "CATEGORY"
varchar(15),
"DEPARTMENT" varchar(15),"PACKAGE_TYPE" varchar(12),"DIET_TYPE"
varchar(10),
"WEIGHT" decimal(5,2),"WEIGHT_UNIT_OF_MEASURE" varchar(2),
"UNITS_PER_RETAIL_CASE" int,"UNITS_PER_SHIPPING_CASE" int,
"CASES_PER_PALLET" int, "SHELF_WIDTH_CM" decimal(8,4),
"SHELF_HEIGHT_CM" decimal(8,4),"SHELF_DEPTH_CM" decimal(8,4))
ORGANIZE BY ROW
DATA CAPTURE NONE
IN "USERSPACE1"
COMPRESS NO;
Finally run the SQL to produce the result for the quarter “4Q95”:
SELECT p.BRAND AS Brand, Sum(s.DOLLAR_SALES) AS Dollar_Sales, Sum(s.UNIT_SALES) AS
Unit_Sales
FROM "DB2ADMIN"."SALES_FACT" s, "DB2ADMIN"."PRODUCT" p, "DB2ADMIN"."TIME" t
WHERE p.PRODUCT_ID = s.PRODUCT_ID
AND s.TIME_ID = t.TIME_ID
AND t."FISCAL_PERIOD" = '4Q95'
GROUP BY p.BRAND
ORDER BY p.BRAND;
Alternative:
Unit_Sales
AND t.QUARTER = 4

AND t.YEAR = 1995

GROUP BY p.BRAND
ORDER BY p.BRAND;
Finally run the SQL to produce the result for the both quarters “4Q95” and “4Q96”:
Unit_Sales
AND (t."FISCAL_PERIOD" = '4Q95' OR t."FISCAL_PERIOD" = '4Q94')
GROUP BY p.BRAND
ORDER BY p.BRAND;
Alternative:
You just omit the selection of a special quarter. In addition, you can create a View with name
“Sales_Per_Brand”:
Create View "DB2ADMIN"."Sales_Per_Brand" AS
Unit_Sales
GROUP BY p.BRAND;
Remark: You have also to omit “ORDER BY” not to get an error in DB2. Nevertheless,
the result is ordered automatically by the brand name. See resulting view:

Create View "DB2ADMIN"."Sales_Per_Brand1" AS

Unit_Sales
AND (t."FISCAL_PERIOD" = '4Q95' OR t."FISCAL_PERIOD" = '4Q94')
GROUP BY p.BRAND;
Exercise E4.3*: Advanced Study about Referential Integrity
Explain: What is “Referential Integrity” (RI) in a Database?

Sub-Questions:
1. What means RI in a Data Warehouse?
2. Should one have RI in a DWH or not? (collect pro and cons)
Find explanations and arguments in DWH forums or articles about this theme in the internet
or in the literature.
First SOLUTION:

Second SOLUTION:
REFERENTIELLE INTEGRITÄT BEISPIEL
 Sicherung der Datenintegrität bei RDB

 Datensätze dürfen nur auf existierende Datensätze verweisen
FRANCOIS TWEER-ROLLER & MARCO ROSIN 12/17/2013 11 FRANCOIS TWEER-ROLLER & MARCO ROSIN 12/17/2013 12

RI IN DWH
 Nicht wenn DWH auf einer transaktionalen Datenbank basiert

 Fokus auf Datenmenge oder Qualität
 Prüfung der Integrität erhöht Ressourcenkosten
FRANCOIS TWEER-ROLLER & MARCO ROSIN 12/17/2013 13
Third Solution:
Definition
“Über referentielle Integrität werden in einem DBMS die Beziehungen zwischen
Datenobjekten kontrolliert“.
Vorteile
• Steigerung der Datenqualität: Referenzielle Integrität hilft Fehler zu vermeiden.
• Schnellere Entwicklung: Referenzielle Integrität muss nicht in jeder Applikation neu
implementiert werden.
• Weniger Fehler: Einmal definierte referenzielle Integritätsbedingungen gelten für alle
Applikationen derselben Datenbank
• Konsistentere Applikationen: Referenzielle Integrität ist für alle Applikationen, die auf
dieselbe Datenbank zugreifen gleich.
Nachteile
• Löschproblematik aufgrund von Integrität
• Temporäres außer Kraft setzen der RI für großen Datenimport.
Referenzielle Integrität in einem DWH

• Daten müssen im DWH nicht 100%ig konsistent sein.
• Durch Import von großen Datenmengen ist die Kontrolle der Integrität zu aufwendig
• Inkonsistente Daten können in keinen konsistenten Zustand gebracht werden.
Meiner Meinung nach ist die Realisierung von der referentiellen Integrität möglich, aber mit
viel Aufwand und Kosten verbunden.
Fourth Solution (SS2021):



Exercise E5.1: Compare ER and MDDM
Compare ER Modelling (ER) with multidimensional data models (MDDM), like STAR or
SNOWFLAKE schemas (see appendix page):
Compare in IBM Reedbook “Data Modeling Techniques for DWH” (see DWH lesson
homepage) Chapter6.3 for ER modeling and Chapter 6.4 for MDDM
Build a list of advantages/disadvantages for each of these two concepts, in the form of a table:
ER Model MDDM Model
Criteria1 ++ Criteria5 ++
Crit.2 + Crit.6 +
Crit.3 - Crit.7 -
Crit.4 -- Crit.8 --
Solution:

Entity-relationship An entity-relationship logical design is data-centric in nature. In other

words, the database design reflects the nature of the data to be stored in the database, as
opposed to reflecting the anticipated usage of that data.
Because an entity-relationship design is not usage-specific, it can be used for a variety of
application types: OLTP and batch, as well as business intelligence. This same usage
flexibility makes an entity-relationship design appropriate for a data warehouse that must
support a wide range of query types and business objectives.
MDDM Model: Compare as examples the Star - and Snowflake schemas, which are
explained in the next solution (5.2)
Exercise E5.2*: Compare Star and SNOWFLAKE
Compare MDDM Model schemas STAR and SNOWFLAKE

Compare in IBM Reedbook ‘Data Modeling Techniques for DWH‘ (see DWH lesson
homepage) Chapter 6.4.4.
Build a list of advantages and disadvantages for each of these two concepts, in the form of a
table (compare exercise 5.1):
Solution:
Star schema The star schema logical design, unlike the entity-relationship model, is
specifically geared towards decision support applications. The design is intended to provide
very efficient access to information in support of a predefined set of business requirements.
A star schema is generally not suitable for general-purpose query applications.
A star schema consists of a central fact table surrounded by dimension tables, and is
frequently referred to as a multidimensional model. Although the original concept was to have
up to five dimensions as a star has five points, many stars today have more than five
dimensions.
The information in the star usually meets the following guidelines:
• A fact table contains numerical elements
• A dimension table contains textual elements
• The primary key of each dimension table is a foreign key of the fact table
• A column in one dimension table should not appear in any other dimension table
Snowflake schema The snowflake model is a further normalized version of the star schema.
When a dimension table contains data that is not always necessary for queries, too much data
may be picked up each time a dimension table is accessed.
To eliminate access to this data, it is kept in a separate table off the dimension, thereby
making the star resemble a snowflake. The key advantage of a snowflake design is improved
query performance. This is achieved because less data is retrieved and joins involve smaller,
normalized tables rather than larger, de-normalized tables.
The snowflake schema also increases flexibility because of normalization, and can possibly
lower the granularity of the dimensions. The disadvantage of a snowflake design is that it
increases both the number of tables a user must deal with and the complexities of some
queries.
For this reason, many experts suggest refraining from using the snowflake schema. Having
entity attributes in multiple tables, the same amount of information is available whether a
single table or multiple tables are used.
Expert Meaning (from DM Review):

First, let's describe them.
A star schema is a dimensional structure in which a single fact is surrounded by a single circle
of dimensions; any dimension that is multileveled is flattened out into a single dimension. The
star schema is designed for direct support of queries that have an inherent dimension-fact
structure.
A snowflake is also a structure in which a single fact is surrounded by a single circle of

dimensions; however, in any dimension that is multileveled, at least one dimension structure
is kept separate. The snowflake schema is designed for flexible querying across more
complex dimension relationships. The snowflake schema is suitable for many-to-many and
one-to-many relationships among related dimension levels. However, and this is significant,
the snowflake schema is required for many-to-many fact-dimension relationships. A good
example is customer and policy in insurance. A customer can have many policies and a policy
can cover many customers.
The primary justification for using the star is performance and understandability. The
simplicity of the star has been one of its attractions. While the star is generally considered to
be the better performing structure, that is not always the case. In general, one should select a
star as first choice where feasible. However, there are some conspicuous exceptions. The
remainder of this response will address these situations.
First, some technologies such a MicroStrategy require a snowflake and others like Cognos
require the star. This is significant.
Second, some queries naturally lend themselves to a breakdown into fact and dimension. Not
all do. Where they do, a star is generally a better choice.
Third, there are some business requirements that just cannot be represented in a star. The
relationship between customer and account in banking, and customer and policy in Insurance,
cannot be represented in a pure star because the relationship across these is many-to-many.
You really do not have any reasonable choice but to use a snowflake solution. There are many
other examples of this. The world is not a star and cannot be force fit into it.
Fourth, a snowflake should be used wherever you need greater flexibility in the
interrelationship across dimension levels and components. The main advantage of a
snowflake is greater flexibility in the data.
Fifth, let us take the typical example of Order data in the DW. Dimensional designer would
not bat an eyelash in collapsing the Order Header into the Order Item. However, consider this.
Say there are 25 attributes common to the Order and that belong to the Order Header. You sell
consumer products. A typical delivery can average 50 products. So you have 25 attributes
with a ratio of 1:50. In this case, it would be grossly cumbersome to collapse the header data
into the Line Item data as in a star. In a huge fact table you would be introducing a lot of
redundancy more than say 2 billion rows in a fact table. By the way, the Walmart model,
which is one of the most famous of all time, does not collapse Order Header into Order Item.
However, if you are a video store, with few attributes describing the transaction, and an
average ratio of 1:2, it would be best to collapse the two.

Sixth, take the example of changing dimensions. Say your dimension, Employee, consists of
some data that does not change (or if it does you do not care, i.e., Type 1) and some data that
does change (Type 2). Say also that there are some important relationships to the employee
data that does not change (always getting its current value only), and not to the changeable
data. The dimensional modeler would always collapse the two creating a Slowly Changing
Dimension, Type 2. This means that the Type 1 is absorbed into the Type 2. In some cases I
have worked on, it has caused more trouble than it was worth to collapse in this way. It was
far better to split the dimension into Employee (type 1) and Employee History (type 2).
Thereby, in such more complex history situations, a snowflake can be better.
Seventh, whether the star schema is more understandable than the snowflake is entire
subjective. I have personally worked on several data warehouse where the user community
complained that in the star, because everything was flattened out, they could not understand
the hierarchy of the dimensions. This was particularly the case when the dimension had many
columns.
Finally, it would be nice to quit the theorizing and run some tests. So I did. I took a data
model with a wide customer dimension and ran it as a star and as a snowflake. The customer
dimension had many attributes. We used about 150MM rows. I split the customer dimension
into three tables, related 1:1:1. The result was that the snowflake performed faster. Why?
Because with the wide dimension, the DBMS could fit fewer rows into a page. DBMSs read
by pre-fetching data and with the wide rows it could pre-fetch less each time than with the
skinnier rows. If you do this make sure you split the table based on data usage. Put data into
each piece of the 1:1:1 that is used together.
What is the point of all this? I think it is unwise to pre-determine what is the best solution. A
number of important factors come into play and these need to be considered. I have worked to
provide some of that thought-process in this response.



Exercise E5.3: Build a Logical Data Model
An enterprise wants to build up an ordering system.

The following objects should be administered by the new ordering system.
• Supplier with attributes: name, postal-code, city, street, post office box, telephone-no.
• Article with attributes: description, measures, weight
• Order with attributes: order date, delivery date
• Customer with attributes: name, first name, postal-code, city, street, telephone-no
Conditions: Each article can be delivered by one or more suppliers. Each supplier delivers 1
to 10 articles. An order consists of 2 to 10 articles. Each article can only be one time on an
order form. But you can order more than on piece of an article. Each order is done by a
customer. Customer can have more than one order (no limit).
Good customers will get a ‘rabatt’. The number of articles in the store should also be saved.
It not important who is the supplier of the article. For each object we need a technical key for
identification.
Task: Create a Logical ER model. Model the necessary objects and the relations between
them. Define the attributes and the keys. Use the following notation:
Entity Attribute Relation
First Solution:

Second Solution (M. Haug, A. Riess, WS2021):
Exercise E6.1: ETL: SQL Loading of a Lookup Table
Define the underlying SQL for the loading of Lookup_Market table:

Solution:
……
Exercise E6.2*: Discover and Prepare
In the lecture to this chapter we have seen 3 steps: “Discover”, “Prepare” and ”Transform” for
a successful data population strategy.
Please present for the first two steps examples of two tools. Show details like functionality,
price/costs, special features, strong features, weak points, etc.
You can use the examples of the lecture or show new tools, which you found in the internet or
you know from your current business….
1. DISCOVER: Evoke-AXIO (now Informatica), Talend - Open Studio, IBM Infosphere

Inform. Server (IIS) – ProfileStage, or ????
2. PREPARE: HarteHanks-Trillium, Vality-Integrity, IBM Infosphere Inform. Server (IIS)
– QualityStage, or ??????
Solution (SS2021):


Exercise E6.3: Data Manipulation and Aggregation using KNIME Platform

Homework for 2 Persons: Rebuild the KNIME Workflow (use given solution)
for Data Manipulation & Aggregation and give technical explanations to the
solution steps.
Hint: Follow the instructions given in the KNIME workflow “KNIME Analytics
Platform for Data Scientists – Basics (02. Data Manipulation -solution)” - see
image below:

Solution: WS2021
Workflow « Data Manipulation and Aggregation » :
The Data Manipulation and Aggregation workflow analyses data on customers of

investment products are analysed.
First, sample data from KNIME is read in (which can be found with the installation under
Example Workflow- The Data- Misc). The data comes from different sources e.g. CSV, Excel
and from a SQLite DB.
Reading: The top CSV Reader loads data to the First Web Activity from customers. Each
Reader node needs a path to a file in the appropriate format. This can be customized under the
configurations.
The SQLite Connector below establishes a connection to a SQLite file and with the DB Table
Selector the table Web Activity is selected. The DB Reader loads
the data into a KNIME Table. The database and the CSV file provide data about customers
and their First Web activity. The CSV file provides e.g. older customer data and in the
database the database contains the newer data.
Activity 1 - Part 1: In the next node Concatenate the two data sources are merged in the
column Customer Key, because both tables have this column. In the setting of the
node is set to concatenate only the columns that have the same name. The node
now contains the IDs of all customers.
Import: In the next part, a customer sentiment table is read in. The column Sentiment
Analysis here consists of strings. This is to be mapped by integer values, which are mapped in
the CSV file below. Each string value is then mapped to a number.
Activity 1 - Part 2: In the Cell Replacer node, the cells of a column can be replaced. The
node needs an input table and a translation table. In the example here the upper

table is the input table and the CSV file provides the translation. The Cell Replacer adds
according to the configuration, the Cell Replacer adds a new Sentiment Rating column with
the appropriate integer values to the input table.
However, it can also be specified that the string values are replaced by the numbers in the
same column without creating a new column.
Import: Next, customer data e.g. income, gender etc. is read in from a CSV file. file.
Activity 1 - Part 2: With the next node Column Filter one or more columns can be filtered
out of a table. from a table. Here we want to exclude the columns Sentiment Rating and Web
Activity should be excluded.
Import: Finally, an Excel table is read in with the Excel Reader. In it you can see which
customer has purchased which investment product. In the next step, these individual data are
merged.
In the 3rd part of the workflow, join operations take place. We know that with the help of
this to merge different tables. For this purpose, KNIME offers the possibility to configure the
Join- operation in KNIME (Knode --→ Config)
Joiner Settings: you can select one/several of 3 options (include in output):
- Matching rows (inner Join): only the rows are included in the common table, that are
included in both original tables.
- Left unmatched rows (left outer join): adds additionally the columns from the left table that
are not that are not present in the right table. In the right table the missing data is columns in
the right table.
- Right unmatched rows (right outer join): adds the columns from the right table that are not
present in the table that are not present in the left table. In the left table missing data will be
columns in the left table.
Column selection: there is a possibility to select the desired columns (manually, with the
RegEx or by the data type).
The first joiner merges the tables from Part 1 and Part 2, the second joiner uses the result of
the first joiner. Joiner uses the result of the 1st Joiner and the data, which results after the
filtering in the Part 2 are used. The 3rd Joiner composes its result from the data of the 2nd
Joiner and from the Excel Reader, which includes the "Products" table.
After that the data flows into the next phase, where the received data can be manipulated
again.
The "Duplicate Row Filter" node identifies duplicate rows. I can either remove all duplicate
rows from the input table and keep only the unique and selected rows, or I can or provide the
rows with additional information about their duplication status. In the configurations it is
possible to specify in which columns it should search for duplicates.
Using the String Manipulation node you can perform different operations on strings. This can
also be set in configurations.
The Table Manipulator node allows to perform multiple column transformations on any
number of input tables, such as renaming, filtering, rearranging and type change of the input
columns. If there is more than one input table, the node concatenates all input rows into a
single result table. If the input tables contain the same row identifier, the node can either
create a new row identifier or use the index of the input index of the input table to the original
row identifier of the corresponding input table.
The last node "GroupBy" groups the rows of a table according to the unique values in the
selected group columns. For each unique set of values of the selected group column, one row
is created. The remaining columns are aggregated based on the specified aggregation settings.

Exercise E7.1*: Compare 3 ETL Tools
Show the Highlights and build a Strengthens/Weakness Diagram for the following three ETL
Tools. Use the information from the internet:
1. Informatica – PowerCenter --→ www.informatica.com

2. IBM - Infosphere Inform. Server - DataStage ---→ https://www.ibm.com/us-
en/marketplace/datastage?loc=de-de
3. Oracle – Warehouse Builder (OWB) --→
https://docs.oracle.com/cd/B28359_01/owb.111/b31278/concept_overview.htm#WBDOD10100
Show the three tools in competition to each other
Solution - Informatica:
 Datenextraktion, Datenintegration, Datenmigration,

Datenkonsolidierung, Datentransformation, Datenverdichtung
 ETL-Tool mit verschiedenen Lösungen und Datenbereinigung
 Standard  Flexibel Einsetzbar
Powercenter  Advanced Funktionen
 Virtualization  gemeinsame Entwicklungsumgebungen
 Real Time (z.B. mit Echtzeitkonnektivität)  Automatisierte Test Tools
 Big Data (z.B. mit natürlicher Sprachverarbeitung)
 Überwachung der Integrationsprozesse
 Schulungen
29.11.2013 PowerCenter - Richard Grießl 3 29.11.2013 PowerCenter - Richard Grießl 4
 Stärken:
 Einheitliche Benutzeroberfläche
Stärken und  Sehr umfassend
 Optimierte Lösungen
Schwächen  Schwächen:
 Fehlerhaftes sortieren
 Filter werden ignoriert (beim Kopieren) Vergleich
 Probleme beim Importieren von XML
29.11.2013 PowerCenter - Richard Grießl 5 29.11.2013 PowerCenter - Richard Grießl 6

Quellen:
http://www.informatica.com/de/products/enterprise-data-
integration/powercenter/
Vielen Dank! http://etl-tools.info/informatica/informatica-weaknesses.html
http://www.irix.ch/cms/upload/pdf/publications/ETL_Tools_01_200
7.pdf
http://de.wikipedia.org/wiki/Informatica
29.11.2013 PowerCenter - Richard Grießl 7



Exercise E7.2: Demo of Datastage
Prepare and run the guided tour „Offload Data Warehousing to Hadoop by using DataStage”
Use IBM® InfoSphere® DataStage® to load Hadoop and use YARN to manage DataStage
workloads in a Hadoop cluster (a registered IBM Cloud Id is needed!):

https://www.ibm.com/cloud/garage/dte/producttour/offload-data-warehousing-hadoop-using-
datastage
Explain each step in the demo with your own words….
Solution (SS2021):
See the execution of this demo in the IBM Cloud in the following video:
https://cloud.anjomro.de/s/72knGpN3oPsKitM
…

….
Exercise E7.3: Compare ETL and ELT Approach

Compare the traditional ETL-Processing with the ELT-Processing in the Amazon Cloud-
DWH (AWS Redshift) – 2 Persons; 20 minutes: Analyse the differences and show advantages
and disadvantages of the two approaches. For more information see “ELT-Stack_in_AWS-
Cloud-DWH.pdf” in [DHBW-Moodle]
Solution (WS 2021):

……………..


Exercise E7.4: ETL : SQL Loading of a Fact Table

Define the underlying SQL for the following loading of the_ Fact “FACT-TABLE” from the
three tables: “PRODUCTIO_COSTS”, “INVENTORY” & “SALES”.

The content of the three input-tables are seen here:

write a SQL script, s.t. you get the following content of the target table:
We see the following conditions:

1. Map the name of the cities in the sources to a number 1 – 100, define this as City_Id
2. Define last digit of SKU in SALES as Product_Key
3. Define them Month of Transdate as Time _Id (range:01 –12)
4. Def. Scenario_Id with cases (Year of Transdate )=1997 as 1,(…Transdate)=1996 as 3
, else 2
5. Fill all columns of target table with the same columns of sources
6. Define new column: Ending_Inventory = (Opening_Inv. + Additions) -Items_Sold
Solution:
Select
Case SAMPLTBC.sales.city

When ‘Manhattan’ then 1

…..
When ‘Maui’ then 100
End
As City_Id
Substr (SAMPLTBC.sales.SKU,12,1) as Product_Key
Case
When Month(SAMPLTBC.sales.transdate) = 01 then 1
….
When Month(SAMPLTBC.sales.transdate) = 12 then 12
End
As Time_Id
Case
When Year(SAMPLTBC.sales.transdate) = 1997 then 1
When Year(SAMPLTBC.sales.transdate) = 1996 then 3
Else 2
End
As Scenario_Id
….
See screenshot:

Exercise E8.1: Compare MOLAP to ROLAP
Find and define the Benefits & Drawbacks of

• MOLAP
• ROLAP
Systems
Use the information of the lesson or use your own experience
First Solution:
Criteria ROLAP MOLAP
Data volume + > 50 GB possible: low expansion - Not > 50 GB: expansion factor too big
factor (low aggregation rate) (high aggregation rate)
Dimensions + > 10 possible (depends only on - Bad performance for > 10 (due to high
DBMS) aggregation rate)
Query (+ When querying single tables) + When using high aggregated data
Performance - When joining many tables ( - when using low aggregated data)
Update + Update during operation possible - Cube has to be rebuilt completely each time
flexibility + Fast and flexible (partly correct, in depends on calculation
rules)
- Operation has to be stopped
Query + Complex, dynamic queries possible - Only standard queries, that the cube is built
complexity (impact on query performance ) for, possible (but combinations are possible)
Usability - Not intuitive: SQL knowledge + Intuitive, easy to handle, no special

necessary knowledge required
Price + Cheaper; simpler SQL-based front- - Expensive; costly front-end tool necessary
ends sufficient (but more performance
needed)
➔ ROLAP is for many criteria superior to MOLAP. As most data marts today are bigger than
50 GB, ROLAP is many cases better choice due to performance and storage reasons.
Second Solution:
MOLAP ROLAP

- Erstellen der Cubes aufwendig + Komplexere Anfragen möglich (Verwendung von

SQL)
+ Schnelle Queries - Anwender muss SQL-Kenntnisse haben
- Zugriff nur auf Daten des Cubes - Queries dauern länger, da komplexer
+ Auf Problemstellung angepasste + Zugriff auf alle Daten in DB

Anfragen möglich
- Bei Update müssen Cubes neu erstellt + Update auch während Operationen möglich
werden
- Nur bei Cube-Erstellung definierte

Anfragen möglich
Third Solution (SS2021):
Exercise E8.2*: Compare 3 Classical Analytics Tools

Show the Highlights and build a Strengthens/Weakness Diagram for the following three
Reporting Tools.Use the information from the internet:
1. MicroStrategy ---→ www.MicroStrategy.com

2. BusinessObjects ----→ www.BusinessObjects.com

3. Cognos --→ www.Cognos.com
Show the three tools in competition to each other.
Solution: Presentation of Cognos:
 Übernahme von IBM 2008

 3500 Mitarbeiter (Sitz in Frankfurt am Main)
 Softwarelösungen
• Business Intelligence
• Geschäftsanalyse
• finanzielles Performance Management
 Abfragen & Berichte

• Jahreseinkommen, Quartalszahlenbericht
 Dashboards
• Interaktiver Zugriff auf Inhalt, mit personalisiertem
Erscheinungsbild und Kriterien für Daten
 Analyse
• Informationszugriff aus verschiedenen
Blickwinkeln und Perspektiven
 Zusammenarbeit
• Kommunikationstools und Social Networking
 Echtzeitüberwachung
 InMemory-Technologie (Nutzung des

Arbeitsspeicher)
 Mobile Client




Exercise E9.1: Three Data Mining Methods (Part1)
Task: Describe the following Data Mining techniques. Search this information
in the internet, i.e. Wikipedia or other knowledge portals:
• Clustering
• Classification
• Associations
Solution:




Exercise E9.2: Three Data Mining Methods (Part2)
Task: Describe the following Data Mining techniques. Search this information
in the internet, i.e. Wikipedia or other knowledge portals:
• Sequential Patterns
• Value Prediction
• Similar Time Sequences

Solution:
Sequential Patterns Value Prediction

• Ziel: Findung vorhersehbarer Verhaltensmuster • Ziel: Aufbau eines Datenmodells zur Vorhersage von Werten
• Methode: Auswahl geeigneter Assoziationen • Methoden:
• Beispiele: - „Nächster Nachbar“
- Auslastung von Verkehrsmitteln und Infrastruktur - Bayes-Netze
Mory
Exercise DWH Lecture 9 - Martin
Mory
- Konsumverhalten - Radial Basis Functions
Similar Time Sequences

• Ziel: Findung von ähnlichen zeitabhängigen sequentiellen
Mustern
• Zahlreiche Anwendungen mit spezifischen Algorithmen
• Beispiel: Speech Recognition
Mory
Exercise E9.3: Measures for Association

Task: Remember the following measures for Association: support, confidence
and lift. Calculate
measures for the following 8 item sets of a shopping basket (1 person, 10 min):
{ Milch, Limonade, Bier }; { Milch, Apfelsaft, Bier }; { Milch, Apfelsaft,
Orangensaft };{ Milch, Bier, Orangensaft, Apfelsaft };{ Milch, Bier };{
Limonade, Bier, Orangensaft }; { Orangensaft };{ Bier, Apfelsaft }
1. What is the support of the item set { Bier, Orangensaft }?

2. What is the confidence of { Bier } ➔ { Milch } ?
3. Which association rules have support and confidence of at least 50%?
Solution:
To 1.:

We have 8 market baskets -→Support(Bier=>Orangensaft)=frq(Bier,Orangensaft)/8

We see two baskets which have Bier and Orangensaft together
--→Support = 2/8=1/4 = 25%
To 2.:
We see frq(Bier)=6 und frq(Bier,Milch)=4 -→Conf(Bier=>Milch)=4/6=2/3= 66,7%
To 3.:
To have a support>=50% we need items/products which occur in more than 4 baskets,
we see for example Milch is in 5 baskets (#Milch=5), #Bier=6, #Apfelsaft=4 and
#Orangensaft=4
Only the 2-pair #(Milch, Bier)=4 has minimum of 4 occurrences. We see this by
calculating the Frequency-Matric(frq(X=>Y)) for all tuples (X,Y):
It is easy to see that there are no 3-pairs with a minimum of 4 occurrences.

We see from the above matric, that: Supp(Milch=>Bier)=Supp(Bier=>Milch)4/8=1/2=50%
We now calculate: Conf(Milch=>Bier)=4/#Milch=4/5=80%
From Question 2, we know that Conf(Bier=>Milch)=66,7%
Solution: Only the two association rules (Bier=>Milch) and (Milch=>Bier) have support
and confidence >=50%.
Exercise E9.4*: Evaluate the Technology of the UseCase “Semantic Search”
Task: Groupwork (2 Persons): Evaluate and find the underlying technology

which is used in “UseCase – Semantic Search: Predictive Basket with Fact-
Finder”. See: https://youtu.be/vSWLafBdHus
Solution (WS2021):


Exercise E9.5*: Run a KNIME-Basics Data Mining solution
Task: Homework for 2 Persons: KNIME-Basics Workflow (use given solution)

for one of the 3 KNIME solutions and give a technical explanation to the
solution steps.

Hint: Follow the instructions given in the KNIME workflow “KNIME Analytics
Platform for Data Scientists – Basics (04. Data Mining – solution)” - see image
below:
Solution:
………….

Exercises (+Solutions) to DHBW Lecture Intro2DWH-Chapter 10
Exercise E10.1*: Compare Data Science/Machine Learning (i.e. DM) Tools

Task: Search for the actual “Gartner Quadrant” of Data Science/Machine
Learning (i.e. DM) tools. Give detail descriptions of two of the leading tools in
the quadrant:
For further information see in [DHBW-Moodle] the document “Gartner-

Machine_Learning_Platform.pdf”
Solution:
……

Exercise E10.2*: Advanced Analytics vs. Artificial Intelligence.
Task: Look for example on the blog:

https://seleritysas.com/blog/2019/05/17/data-science-and-data-analytics-
what-is-the-difference
Give a short summary of this blog. If necessary you can also use additional
information from the internet. What are the main statements? What are the
similarities and what are the differences?
Solution:
……..
Exercise E10.3*: Create a K-Means Clustering in Python
Task: Homework for 2 Persons: Create a python algorithm (in Jupyter

Notebook) which clusters the following points:
Following the description of: https://benalexkeen.com/k-means-clustering-in-

python/ to come to 3 clear clusters with 3 means at the centre of these clusters:
Solution:

For a sample solution see: [HVö-5] Homework_H3.4_k-Means_Clustering.pdf

https://github.com/HVoellinger/Lecture-Notes-to-ML-WS2020


Exercise E10.4*: Image-Classification with MNIST Data using KNIME
Task: Homework for 2 Persons: Rebuild the KNIME Workflow (use given
solution) for Image-Classification and give technical explanations to the solution
steps.
Hint: Follow the instructions given in the KNIME workflow “L4-DL
Introduction to Deep Learning/Session4/Solutions (Image Classification MNIST
Solution)” - see image below:

Solution:
……..

References
1. [BD-DWH]: Barry Devlin ’Data Warehouse....’, Addison-Wesley, ISBN: 0-201-

96425-2
2. [RK-DWH]: R. Kimball ’The Data Warehouse Toolkit.’, John Wiley & Sons,
NY 1996, ISBN: 0-471-15337-0
3. [AB&HG-DWH]: Andreas Bauer, Holger Günzel (Hrsg.): ‘Data Warehouse
Systeme - Architektur, Entwicklung, Anwendung‘ DPunkt Verlag Heidelberg
2004, 3. Auflage, ISBN: 978-3-89864-540-9
4. [RK-DWH/TK]: R. Kimball and Other: ’The Data Warehouse Lifecycle Toolkit.’,
John Wiley & Sons, NY 1998, ISBN: 0-471-25547-5
5. [SE-DWH/BI]: Stefan Eckrich and Other: ’From Multiplatform Operational
Data to Data Warehousing and Business Intelligence’, IBM Redbook, SG24-
5174-00, ISBN: 0-7384-0032-7
6. [VAC&Other-BI/390]: V. Anavi-Chaput and Other: ‘Business Intelligence
Architecture on S/390 –Presentation Guide’, IBM Redbook, SG24-5641-00,
ISBN: 0-7384-1752-1
7. [DM-MD]: David Marco: ’Building &Managing the Meta Data Repository’,
John Wiley & Sons 2000, ISBN: 0-471-35523-2
8. [CB&Other-DB2/OLAP]: Corinne Baragoin and Other:’ DB2 OLAP Server
Theory and Practices’, IBM Redbook, SG624-6138-00, ISBN: 0-7384-1968-0
9. [DC-DB2]: Databases (i.e. IBM DB2 UDB) – Don Chamberlin: ’A Complete
Guide to DB2 Universal Database’, Morgan Kaufmann Publ. Inc., ISBN: 1-
55860-482-0
10. [JC&Other-VLDB]: J. Cook and Other: ’Managing VLDB Using DB2 UDB EEE’,
IBM Redbook, SG24-5105-00
11. [CB&Other-DMod]: Data Modeling (Historical Models) – C. Ballard, D.
Herreman and Other: ’Data Modeling Techniques for Data Warehousing’,
IBM Redbook, SG24-2238-00
12. [TG&Other-ETL]: Thomas Groh and Other: ’BI Services -Technology
Enablement Data Warehouse -Perform Guide’ IBM Redbook, ZZ91-0487-00
13. [TG&Other-ETL&OLAP]: Thomas Groh and Other: ’Managing
Multidimensional Data Marts with Visual Warehouse and DB2 OLAP
Server’, IBM Redbook, SG24-5270-00, ISBN: 0-7384-1241-4

14. [PC&Other-DM]: P. Cabena .... ’Intelligent Miner for Data – Applications

Guide’, IBM Redbook, SG24-5252-00, ISBN: 0-7384-1276-7
15. [CB&Other-DM]: C. Baragoin and Other: ’Mining your own Business in
Telecoms’, IBM Redbook, SG24-6273-00, ISBN: 0-7384-2296-7
16. [HVö-1]: Hermann Völlinger: Script of the Lecture "Introduction to Data
Warehousing”; DHBW Stuttgart; WS2021; http://www.dhbw-
stuttgart.de/~hvoellin/
17. [HVö-2]: Hermann Völlinger and Other: Exercises & Solutions of the Lecture
"Introduction to Data Warehousing“‘; DHBW Stuttgart; WS2021
http://www.dhbw-stuttgart.de/~hvoellin/
18. [HVö-3]: Hermann Völlinger and Other: Exercises & Solutions of the Lecture
”Machine Learning: Concepts & Algorithms”; DHBW Stuttgart; WS2020;
http://www.dhbw-stuttgart.de/~hvoellin/
19. [HVö-4]: Hermann Völlinger: Script of the Lecture "Machine Learning:
Concepts & Algorithms“; DHBW Stuttgart; WS2020; http://www.dhbw-
stuttgart.de/~hvoellin/
20. [HVö-5]: Hermann Völlinger: GitHub to the Lecture "Machine Learning:
Concepts & Algorithms”; see in: https://github.com/HVoellinger/Lecture-
Notes-to-ML-WS2020
21. [DHBW-Moodle]: DHBW-Moodle for TINF19D: ‘Directory of supporting
Information for the DWH Lecture’; Kurs: T3INF4304_3_Data Warehouse
(dhbw-stuttgart.de)

Exercises&Solutions Intro2DWH

Uploaded by

Copyright:

Available Formats

Exercises&Solutions Intro2DWH

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exercises&Solutions Intro2DWH

Uploaded by

Copyright:

Available Formats

What are the main topics covered in the document?

What are the main topics covered in the document?

What are some of the reference materials listed?

What are some of the reference materials listed?

Exercises to Intro2DWH Last Update: 07.12.

Exercices (+Solutions) to DHBW

Status: 7 December 2021

Goal: Documentation of all Solutions to the

Please send your solutions (if you want) to your lecturer:

Authors of the Solutions: Dr. Hermann Völlinger and Other

Page 1 of 120 Pages

Page 2 of 120 Pages

Exercise E8.2*: Compare 3 Classical Analytics Tools .................................................... 98

* This exercise is also a task for a Seminar Work (SW).

Page 3 of 120 Pages

Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 1

Exercise E1.1*: Investigate the BI-Data Trends in 2021.

* This exercise is also a task for a Seminar Work (SW).

Page 4 of 120 Pages

Page 5 of 120 Pages

Page 6 of 120 Pages

Page 7 of 120 Pages

Exercise E1.2*: Investigate the catchwords: DWH, BI and CRM

1. Data Warehousing (DWH)

DWH – Data Warehousing:

CRM – Customer Relationship Management:

Page 8 of 120 Pages

OLAP – Online Analytical Processing:

(Abgrenzung Data Mining: Suche nach Mustern und bislang unbekannten

OLTP – Online Transactional Processing:

Operative Softwaresysteme mit deren Transaktionsdaten. Heute analysiert man weniger

ETL – Extraction, Transformation and Loading:

ERP – Enterprise Resource Planning:

Unternehmensübergreifende SW-Lösungen, die zur Optimierung von

EAI – Enterprise Application Integration:

EAI beschäftigt sich mit der inner- und über-betrieblichen Anwendungsintegration, um

Page 9 of 120 Pages

2) Integrierte 360° Sicht

3) Komplexe Anfragen und Analysen

5) Fussion von DWH und CRM

7) Datenansammlungen (’Data Hubs’) statt relationaler DBs

9) Starkes Anwachsen von Datenquellen (z.B. e-Business)

10) Re-Engeneering oder sogar Neuaufbau von Business- Systemen (DWH, … )

Further Solution (SS 2014):

Page 10 of 120 Pages

Page 11 of 120 Pages

Page 12 of 120 Pages

Further Solution (WS 2019):

Page 13 of 120 Pages

Further Solution (WS 2021, Leon Berger, Dennis Schmidt):

Page 14 of 120 Pages

Page 15 of 120 Pages

Page 16 of 120 Pages

Page 17 of 120 Pages

Exercise E1.3*: Compare two Data Catalogue Tools

Page 18 of 120 Pages

Page 19 of 120 Pages

Page 20 of 120 Pages

Exercise 1.4: First Experiences with KNIME Analytics Platform