Exercises&Solutions Intro2DWH
Exercises&Solutions Intro2DWH
Exercises&Solutions Intro2DWH
2021
by
Dr. Hermann Völlinger and Other
Content
Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 1 ..................................... 4
Exercise E1.1*: Investigate the BI-Data Trends in 2021. .................................................. 4
Exercise E1.2*: Investigate the catchwords: DWH, BI and CRM ..................................... 8
Exercise E1.3*: Compare two Data Catalogue Tools ...................................................... 18
Exercise 1.4: First Experiences with KNIME Analytics Platform ................................... 21
Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 2 ................................... 25
Exercise E2.1*: Compare 3 DWH Architectures ............................................................. 25
Exercise E2.2*: Basel II and RFID .................................................................................. 32
Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 3 ................................... 45
Exercise E3.1: Overview about 4 Database Types........................................................... 45
Exercise E3.2: Build Join Strategies ................................................................................ 55
Exercise E3.3: Example of a Normalization .................................................................... 56
Exercise E3.4: Example of a Normalization .................................................................... 59
Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 4 ................................... 59
Exercise E4.1: Create SQL Queries ................................................................................. 59
Exercise E4.2: Build SQL for a STAR Schema ............................................................... 61
Exercise E4.3*: Advanced Study about Referential Integrity.......................................... 66
Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 5 ................................... 71
Exercise E5.1: Compare ER and MDDM ........................................................................ 71
Exercise E5.2*: Compare Star and SNOWFLAKE ......................................................... 72
Exercise E5.3: Build a Logical Data Model ..................................................................... 77
Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 6 ................................... 78
Exercise E6.1: ETL: SQL Loading of a Lookup Table .................................................... 78
Exercise E6.2*: Discover and Prepare ............................................................................. 79
Exercise E6.3: Data Manipulation and Aggregation using KNIME Platform ................. 81
Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 7 ................................... 84
Exercise E7.1*: Compare 3 ETL Tools............................................................................ 84
Exercise E7.2: Demo of Datastage ................................................................................... 88
Exercise E7.3: Compare ETL and ELT Approach ........................................................... 90
Exercise E7.4: ETL : SQL Loading of a Fact Table ........................................................ 93
Exercises (+Solutions) to DHBW Lecture Intro2DWH – Chapter 8 ................................... 97
Exercise E8.1: Compare MOLAP to ROLAP .................................................................. 97
Solution:
Task: Prepare a report and present it next week; duration = 30 minutes (10 min for each area).
Information sources are newspaper or magazine articles or books (see literature list). 3
students.
Theme: Trends or new development in the following areas (project reports are also possible):
For Explanation of these ‘catchwords’ see also the slides of the lesson or search in the internet
Optional: Give an explanation also for the synonyms like: OLAP, OLTP, ETL, ERP, EAI
Solution:
BI – Business Intelligence:
BI ist der Prozess, die angesammelten, rohen, operationalen Daten zu analysieren und
sinnvolle Informationen daraus zu extrahieren, um auf Basis dieser integrierten Informationen
bessere Geschäftsentscheidungen treffen zu können.
BI ist wenn Geschäftsprozesse anhand der aus dem Data Warehouse gewonnenen Fakten
optimiert werden.
Operatives CRM:
Lösungen zur Automatisierung / Unterstützung von Abwicklungsprozessen mit Kunden
(Online Shop, Call Center,…)
Analytisches CRM:
Lösungen, die auf Informationen des Data Warehouse zurückgreifen und auf
aufgabenspezifische Analysen (Data Mining) beruhen.
Kollaboratives CRM:
Kommunikationskomponente, die die Interaktion mit dem Kunden ermöglicht.
Gewinnung von Erkenntnissen durch Zusammenarbeit mit dem Kunden. Diese können dann
zur Optimierung der Geschäftsprozesse oder Personalisierung der Kundenbeziehung genutzt
werden.
Der Begriff OLAP fasst Technologien, also Methoden, wie auch Tools, zusammen, die
die Ad-hoc Analyse multidimensionaler Daten unterstützen. Die Daten können aus dem
Data Warehouse, Data Marts oder auch aus operativen Systemen stammen.
Ein ETL – Tool ist dafür zuständig, um aus den operationalen Daten (real-time-data)
gesäuberte und eventuell aggregierte Informationen sowie zusätzliche Metadaten zu
erhalten.
Aktuelle Trends:
1) Explodierendes Datenvolumen
• Stärkster Trend
• Laut Gartner soll 2004 das Datenvolumen 30x so hoch wie 1999 sein.
• Skalierbarkeit
4) Mehr Endbenutzer
• BI- und DWH- Systeme müssen zugänglicher werden
Benutzbarkeit „weniger ist mehr“
6) Active DWH
• Wettbewerbsdruck → Daten müssen schnell da sein
• Aktive DWH sind eng an operationale Systeme gekoppelt → sehr aktuelle
Daten + sehr detailliert
8) Outsourcing
• Zu Anfang Applikationen + Daten; zukünftig auch die Informationshaltung im
DWH
Task: Select two of the Data Catalog (DC) tools from the two “Market Study - DC” slides
and prepare a report about the functionality of these tools (2 Students, next week, duration =
20 minutes).
Hint: Information source is the internet. See also links in the “Market Study –DC” slides. See
also the directory “Supporting Material” in the Moodle of this lecture [DHBW-Moodle].
Solution:
Task: Install the tool and report about your first experiences. Give answers to the following
questions:
1. What can be done with the tool?
2. What are the features for Data-Management?
3. What are the features for Analytics and Data Science?
Information source is the KNIME Homepage KNIME | Open for Innovation and the three
mentioned documents in the lesson DW01 (see lesson notes).
Knime Features:
Blend & Transform:
● Access data from different sources (e.g Databases, Files, etc.)
● Merging of data from different data sources (adapting data if necessary)
● Prefabricated interfaces for various DBs and DWHs
● Interfaces are extensible
● Documentation of executed steps for better traceability
First, we import data using a JSON-Reader Node, since KNIME holds the
processed data of each node in the context of the node, this is where the now
imported dataset is present.
This allows the user to view each step of the workflow and recap which
node transforms the data in which way. After importing the JSON data we’re
telling the import node to only represent the data matching a given JSON-Path.
Task: Compare the three DWH architectures (DW only, DM only and DW & DM) in the
next slide. List the advantages and disadvantages and give a detailed explanation for it. Find
also a fourth possible architecture (hint: ‘virtual’ DWH)
Criteria 2 -- - + - Text2
Criteria 3
....
Solution:
Implementation costs
The implementation of a Data Warehouse with Data Marts is the most expensive solution,
because it is necessary to build the system including connections between Data Warehouse
and its Data Marts.
It is also necessary to build a second ETL which manages the preparation of data for the
Data Marts.
In case of implementing Data Marts or a Data Warehouse only, the ETL is only implemented
once. The costs may be almost the same in building one of these systems. The Data Marts
only require a little more hardware and network connections to the data sources. But due to
the fact, that building the ETL is the most expensive part, these costs may be relatively low.
The virtual Data Warehouse may have the lowest implementation costs, because e.g.
existing applications and infrastructure is used.
Administration costs
The Data Warehouse only solution offers the best effort in minimizing the administration
costs, due to the centralized design of the system. In this solution it is only necessary to
manage a central system. Normally the client management is no problem, if using web
technology or a centralized
client deployment, which should be a standard in all mid-size to big enterprises. A central
Backup can cover the whole data of the Data Warehouse.
The solution with Data Marts only are more expensive, because of its decentralized design.
There are higher costs in cases of product updates or maintaining the online connections,
you also have to backup each Data Mart for itself, depending on his physical location.
Also the process of filling a single Data Mart is critical. Errors during update may cause loss
of data. In case of an error during an update, the system administration must react at once.
Data Marts with a central Data Warehouse are more efficient, because all necessary data is
stored in a single place. When an error during an update of a Data Mart occurs, this is
normally no problem, because the data is not lost and can be recovered directly from the
Data Warehouse. It may also be possible to recover a whole Data Mart out of the Data
Warehouse.
Virtual Data Warehouses administration costs depend on the quality of the implementation.
Problems with connections to the online data sources may cause user to ask for support,
even if the problem was caused by a broken online connection or a failure in the online data
source. End-users may not be able to realize whether the data source or the application on
their computer cause a problem.
Performance
A virtual Data Warehouse has the poorest performance all over. All data is retrieved during
runtime directly from the data sources. Before data can be used, it must be converted for
presentation. Therefore, a huge amount of time is spent by retrieval and converting of data.
The Data Marts host information, which are already optimized for the client applications. All
data s stored in an optimal state in the database. Special indexes in the databases speed up
information retrieval.
Implementation Time
The implementation of a Data Warehouse with its Data Marts takes the longest time,
because complex networks and transformations must be created. Creating Data Warehouse
only or Data Marts only should take almost the same amount of time. Most time is normally
spent on creating the ETL (about 80%), so the differences between Data Warehouse only
and Data Marts only should not differ much.
Implementing a Virtual Data Warehouse can be done very fast because of its simple
structure. It is not necessary to build a central database with all connectors.
Data Consistency
When using Data Warehouse or Data Mart technology a maximum consistency of data is
achieved.
All provided information is checked for validity and consistency. A virtual Data Warehouse
may have problems with data consistency because all data is retrieved at runtime. When
data organization on sources changes, the consistency of new data may be consistent, but
older data may not be represented in its current model.
Flexibility
The highest flexibility has a virtual data warehouse. It is possible to change the data
preparation process very easy because only the clients are directly involved. There are
nearly no components, which depend on each other.
In Data Warehouse only solution flexibility is poor, because there may exist different types of
clients that depend on the data model of the Data Warehouse. If it would be necessary to
change a particular part of the data model intensive testing for compatibility with existing
applications must be done, or even the client applications have to be updated.
A solution with Data Marts, with or without a central Data Warehouse has medium flexibility
due that client applications normally uses Data Marts as their point of information. In case of
a change in the central Data Warehouse or the data sources, it is only necessary to update
the process of filling the Data Marts.
In case of change in the Data Marts only the depending, client applications are involved
and not all client applications.
Data Consistency
Data consistency is poor in a virtual Data Warehouse. But it also depends on the quality of
the process, which gathers information from the sources.
Data Warehouses and Data Marts have very good data consistency because the information
stored in their databases have been checked during the ETL process.
Quality of information
The quality of information hardly depends on the quality of the data population process (ETL
process) and how good the information is processed and filtered before stored in the
Data Warehouse or presented to a user. Therefore, it is not possible to give a concrete
statement.
History
A virtual Data Warehouse has no history at all, because the values or information are
retrieved at runtime. In this architecture it is not possible to store a history because no central
database is present.
The other architectures provide a central point to store this information. The history provides
a basis for analysing business process and their efforts, because it is possible to compare
actual information with information of the past.
Theme: Give a definition (5 Minutes) and impact of these new trends on Data Warehousing
(10 Minutes)
1. Basel II
2. RFID
Solution:
Agenda
▪ Warum Basel-Abkommen?
▪ 2003: Aufnahme von Basel II in die Strategie ▪ grosse Datenmengen zur Analyse
der Institute
▪ DWH werden benötigt von:
▪ 2004: Aufbau der DWH-Infrastruktur – Banken → Kunden-Rating
▪ 2005: Datensammlung + Auswertungsstrategie – Rating-Agenturen → Service zur Verfügung stellen
▪ 2006: ... Parallel-Lauf von Basel I + II – Unternehmen → optimale Finanzsituation verringert
Kreditkosten
▪ 2007: Basel II wird bindend
Tools Ausblick
Quelle: IBM
Agenda
• Sicherung der Stabilität im Finanzsektor
• Eigenkapitalvereinbarung von 1988 (Basel I)
• Von Basel I zu Basel II
• Gründe für Basel II
Basel II & DWH • Rating von Krediten nach Basel II
• Die Bank will uns kennen lernen
• Auswirkungen von Basel II
Christian Schäfer, 28.10.2005 • Herausforderungen an Data Warehouse Systeme
Eine weitere Lösung (dritte Lsg.) zu Basel2 und DWH finden Sie in der folgenden
Darstellung:
Basel II
Mindestkapital
- Überprüfungen Marktdisziplin
anforderungen durch durch
Bankenaufischt Offenlegungs-
Kreditrisiko,
pflicht
Marktrisiko,
Operationelles
Risiko
Basel I: ≥ 8% Eigenkapital
http://www.bundesbank.de/bankenaufsicht/bankenaufsicht_basel.php
Standards
◦ Migration alter Daten
◦ Anbindung weiterer Datenquellen
Qualitätskontrollen
http://www.it-observer.com/data-management-challenges-basel-ii-readiness.html
http://www.facebook.com/topic.php?uid=25192258947&topic=5725&_fb_noscript=
1
1. Basel II
2. Internes
Meldewesen
3. Analyse
und
Auswertungen
4. Externes
Meldewesen
Erweiterte
Standards für Offenlegung und
Überprüfung
Erhöhte Kapitalanforderungen
Liquiditätsanforderungen
◦ Echtzeitüberwachung
http://www.finextra.com/community/fullblog.aspx?blogid=4988
frei verfügbare Anlagen hoher Qualität halten, welche auch in Krisenzeiten
verkäuflich, Echtzeit -> data quality challenge
http://www.information-management.com/news/data_risk_management_Basel-10018723-1.html
http://www.pwc.lu/en/risk-management/docs/pwc-basel-III-a-risk-management-perspective.pdf
Agenda
◼ Ausblick
Was ist RFID → Anwendungsgebiete → RFID & Data Warehouse → Ausblick Was ist RFID → Anwendungsgebiete → RFID & Data Warehouse → Ausblick
Was ist RFID → Anwendungsgebiete → RFID & Data Warehouse → Ausblick Was ist RFID → Anwendungsgebiete → RFID & Data Warehouse → Ausblick
Was ist RFID → Anwendungsgebiete → RFID & Data Warehouse → Ausblick Was ist RFID → Anwendungsgebiete → RFID & Data Warehouse → Ausblick
Datenmodell: Die zur Beschreibung von Daten und deren Beziehungen untereinander auf
logischer Ebene zur Verfügung stehenden Datenstrukturen bezeichnet man
zusammenfassend als Datenmodell.
Strukturelemente:
- Objekttypen
- Hierarchische unbenannte Beziehungen (Kanten haben keine Bezeichnungen)
Ergebnis: Baumstruktur
Lieferant Bauteil
Geht nicht!
Lösung:
Lieferant Bauteil
Lieferant Bauteil
Problemlösung: Pairing
Abweichend vom strengen HDM werden zusätzliche logische Zugriffe eingeführt, damit n:m
Beziehungen dargestellt werden können.
Lieferant Bauteil
Problem: Preise, die als Attribute bei zusätzlich eingeführten Objekttypen B-L und L-B
gespeichert werden, sind immer noch redundant.
1. Das Netzwerkmodell
Strukturelemente:
- Objekttypen
- hierarchische Beziehungen (1:mc), die als Set-Typen bezeichnet werden
E1 E1
Owner-Typ
1
mc
Exercises to Intro2DWH Last Update: 07.12.2021
Set-Typ
Member-Typ
Lieferant Bauteil
Objekt-Typen können auch mit sich selbst in Beziehung stehen, z.B. kann ein Bauteil ein
Bauteil eines anderen Bauteils sein.
Bauteil
Bauteil K Bauteil
Virtuel: Hardware-Unabhängigkeit, d.h. bei der Dateiorganisation wird primär kein Bezug auf
die physische Speicherorganisation (z.B. Zylinder und Spuren der Magnetplatte) genommen.
- durch „Zell-Teilung“ (cellular splitting) neuen Speicherplatz zu schaffe, falls der Platz
beim Einfügen nicht ausreicht,
werden hier auch auf die Speicherung der Datensätze selbst (Primärdaten) angewendet und
als Index ein B+-Baum verwendet, dessen Blätter gekettet sind, so dass eine logisch
fortlaufende Verarbeitung nach aufsteigenden und absteigenden Schlüsselwerten und auch der
(quasi-) direkte Zugriff möglich ist.
Project Voldemort
→ Definition:
- rel. DB-Modell 1970 von Codd
- Datenspeicherung in Tabellen (Relationen) mit einer festen Anzahl an Spalten und
einer flexiblen Anzahl an Zeilen
- Durch das Verteilen der Informationen auf einzelne Tabellen werden Redundanzen
vermieden.
- Mit Schlüsselfeldern können Verknüpfungen zw. den Tabellen erstellt werden.
ID Name Alter
12 Meier 23 Row
13 Müller 45
14 Bauer 34 Tupel = ganzer Datensatz
Feld
Column
→ In einer Tabelle gibt es keine zwei Tupel, die für alle Attribute die gleichen Werte haben.
→ Primärschlüssel
= eine Spalte der Tabelle, durch deren Werte jeder Datensatz der Tabelle eindeutig
identifiziert wird.
Der Wert eines Primärschlüsselfeldes einer Tabelle darf nicht doppelt vorkommen.
Jede Tabelle kann nur einen Primärschlüssel haben.
Er kann sich aus mehreren Datenfeldern zusammensetzen und darf nicht leer sein.
→ Fremdschlüssel
= eine Spalte einer Tabelle, deren Werte auf den Primärschlüssel einer anderen Tabelle
verweisen.
Eine Tabelle kann mehrere Fremdschlüssel enthalten.
Er kann aus mehreren Feldern der Tabelle bestehen, er kann leer sein.
Für jeden Wert eines Fremdschlüssels muss es einen entsprechenden Wert im
Primärschlüssel der korrespondierenden Tabelle geben (Integrität)
- Projektion
→ Weitere Regeln der relationalen Datenbank:
- Transaktionen müssen entweder vollständig durchgeführt werden oder, bei einem
Abbruch, vollständig zurückgesetzt werden.
- Der Zugriff auf die Daten durch den Benutzer muss unabhängig davon sein, wie die
Daten gespeichert wurden oder wie physikalisch auf sie zugegriffen wird.
- Ändert der Datenbankverwalter die physikalische Struktur, darf der Anwender davon
nichts mitbekommen.
Abhängigkeiten
a) funktional abhängig
zu einer Attributkombination von A gibt es genau eine Attributkombination von B
B ist funktional abhängig von A: A -> B
c) transitiv abhängig
B ist abhängig von A und C ist abhängig von B: A -> B -> C
C darf dabei nicht Schlüsselattribut sein und nicht in B vorkommen
Anomalien:
Prüfungsgeschehen
PNR Fach Prüfer Student
MATNR Name Geb Adr Fachbereich Dekan Note
3 Elektronik Richter 123456 Meier 010203 Weg Informatik Wutz 1
1
124538 Schulz 050678 Str 1 Informatik Wutz 2
Einfüge-Anomalien
Wo fügt man in dieser Relation einen Studenten ein, der noch nicht an einer Prüfung
teilgenommen hat?
a) Lösch-Anomalien
Mit Löschung des Studenten Pitt, geht auch die Information über den Dekan vom
Dachbereich BWL verloren.
b) Änderungs-Anomalien
Zieht ein Student um, der an mehreren Prüfungen teilgenommen hat, so muß die
Adressänderung in mehreren Tupeln vollzogen werden
Build all join strategies for the following tables SAMP_PROJECT and SAMP_STAFF:
i.e.
1. Cross Product
2. Inner Join
3. Outer Join
a. Left Outer Join
b. Right Outer Join
c. Full Outer Join
SAMP_PROJECT:
SAMP_STAFF:
Solution:
Do the normalization steps 1NF, 2NF and 3NF to the following unnormalized table
(show also the immediate result):
Solution:
Erste Normalform
➢ Nur atomare Attribute, also Elemente von Standard-Datentypen und nicht Listen,
Tabellen oder ähnliche komplexe Strukturen
Prüfungsgeschehen
PNR Fach Prüfer Student
MATNR Name Geb Adr Fachbereich Dekan Note
3 Elektronik Richter 123456 Meier 010203 Weg Informatik Wutz 1
1
124538 Schulz 050678 Str Informatik Wutz 2
1
4 Informatik Schwinn 245633 Ich 021279 Gas. Informatik Wutz 1
2
246354 Schulz 050678 Str Informatik Wutz 1
1
5 TMS Müller 856214 Schmidt 120178 Str Informatik Wutz 3
2
369852 Pitt 140677 Gas. BWL Butz 1
1
Prüfling
PNR MATNR Name Geb Adr Fachbereich Dekan Note
3 123456 Meier 010203 Weg 1 Informatik Wutz 1
3 124538 Schulz 050678 Str 1 Informatik Wutz 2
4 245633 Kunz 021279 Gas. 2 Informatik Wutz 1
4 124538 Schulz 050678 Str 1 Informatik Wutz 1
5 856214 Schmidt 120178 Str 2 Informatik Wutz 3
5 369852 Pitt 140677 Gas. 1 BWL Butz 1
Zweite Normalform
Prüfling
PNR MATNR Name Geb Adr Fachbereich Dekan Note
3 123456 Meier 010203 Weg 1 Informatik Wutz 1
3 124538 Schulz 050678 Str 1 Informatik Wutz 2
4 245633 Kunz 021279 Gas. 2 Informatik Wutz 1
4 124538 Schulz 050678 Str 1 Informatik Wutz 1
5 856214 Schmidt 120178 Str 2 Informatik Wutz 3
5 369852 Pitt 140677 Gas. 1 BWL Butz 1
Erkennbar: Daten des Studenten (Name, Geb, Adr, Fachbereich, Dekan) hängen nur von
MATNR ab und nicht von PNR, ist somit nicht voll funktional abhängig
Erzeugung der zweiten Normalform durch Elimination der rechten Seite der partiellen
Abhängigkeit und Kopie der linken Seite
Student
MATNR Name Geb Adr Fachbereich Dekan
123456 Meier 010203 Weg 1 Informatik Wutz
124538 Schulz 050678 Str 1 Informatik Wutz
245633 Kunz 021279 Gas. 2 Informatik Wutz
124538 Schulz 050678 Str 1 Informatik Wutz
856214 Schmidt 120178 Str 2 Informatik Wutz
369852 Pitt 140677 Gas. 1 BWL Butz
Prüfungsergebnis
PNR MATNR Note
3 123456 1
3 124538 2
4 245633 1
4 124538 1
5 856214 3
5 369852 1
➢ Eine Relation R ist in 2. NF, wenn sie in 1.NF ist und jedes Nicht-Primärattribut von
R voll von jedem Schlüssel in R abhängt (also keine Attribute des Schlüssels
unwesentlich ist)
➢ Problem der Anomalien noch nicht beseitigt
Einfüge-A.: Fachbereichsdaten nicht ohne eingeschriebenen Studenten speicherbar
Lösch-A: Fachbereichsdaten verschwinden mit Löschen des letzten Studenten
Änderungs-A: Wechsel des Dekans muss an mehreren Stellen vollzogen werden
Dritte Normalform
Student
MATNR Name Geb Adr Fachbereich Dekan
123456 Meier 010203 Weg 1 Informatik Wutz
124538 Schulz 050678 Str 1 Informatik Wutz
245633 Kunz 021279 Gas. 2 Informatik Wutz
124538 Schulz 050678 Str 1 Informatik Wutz
856214 Schmidt 120178 Str 2 Informatik Wutz
369852 Pitt 140677 Gas. 1 BWL Butz
Student
MATNR Name Geb Adr Fachbereich
123456 Meier 010203 Weg 1 Informatik
124538 Schulz 050678 Str 1 Informatik
245633 Kunz 021279 Gas. 2 Informatik
124538 Schulz 050678 Str 1 Informatik
856214 Schmidt 120178 Str 2 Informatik
369852 Pitt 140677 Gas. 1 BWL
Do the normalization steps 1NF, 2NF and 3NF to the following un-normalized
table (show also the immediate results):
Prerequisites: Keys are PO# and Item#, SupName = Funct (Sup#) , Quant =
Funct (Item#,PO#) and $/Unit=Funct (Item#)
Solution to 3.4:
The table is not in First Normal Form (1NF) – there are “Repeating Row Groups”.
By adding the duplicate information in the first three row to the empty row cells, we
get five complete rows in this table, which have only atomic values. So we have First
Normal Form. (1NF).
.........
Airport:
FID Name
MUC Muenchen
FRA Frankfurt
HAN Hannover
STU Stuttgart
MAN Mannheim
BER Berlin
Flight:
1. you get a list of airports which have no incoming flights (no arrivals) (6
points)
2. create a report (view) Flights_To_Munich of all flights to Munich(arrival)
with Flight-Number, Departure-Airport (full name) and Departure-Time as
columns (6 points)
3. insert a new flight from BER to HAN at 17:30 with FNo 471 (4 points)
4. Change FlightTime of Fno=181 to 10:35 (4 points)
Ad 2.:
create view Flights_to_Munich2
as select f.Fno as FNr, a.name as Dep_Airp, f.time as DepT from flight f, airport
a
where f.to='MUC' and a.fid=f.from
Ad3.:
insert into flight
values (471,'BER','HAN','17.30.00')
Ad4.:
update flight
set time = '10.35.00'
where Fno=181
Ad5 (optional):
select name as Departure_Airport, count (*) as Departure_Count
from airport, flight
where fid=from
group by name
union
select name as Departure_Airport, 0 as Departure_Count
from airport
where not exists (select * from flight where from=fid)
order by departure_count
DEPARTURE_AIRPORT DEPARTURE_COUNT
------------------------------ -------------------------------
Berlin 0
Frankfurt 0
Hannover 1
Mannheim 1
Stuttgart 1
Muenchen 2
6 record(s) selected.
**************************************************************************
Here is also a second solution (which is shorter) and gives the same results as above by
Stefan Seufert:
SELECT Name as Departure_Airport, count (Flight.From) as Departure_Count
FROM Airport LEFT OUTER JOIN Flight ON Airport.FID = Flight.From
GROUP BY Name
ORDER BY Departure_Count
The idea is, that count(Field) in contradiction to count(*) only count the fields which
are not NULL. Since the attribute in the count function is from the flight table, only the
flights which have departures are counted, all other get the 0 value.
Product Time
Prod_id
Time_id
Brand
Sales_Fact
Subcategory Fiscal_Period
Category Prod_id Quarter
Department Time_id Month
...... Promo_id Year
Store_id ......
Store
Dollar_Sales
Store_id Unit_Sales Promotion
Dollar_Cost
Name Cust_Count Promo_id
Store_No …
Store_Street
Store_City Promo_Name
Price_Reduct.
Build the SQL, such that the result is the following report, where time condition is the
Fiscal_Period = 4Q95‘, such that we get the result table below. Why is this a typical DWH
query (result table)?
By using the SQL Wizard (Design View) in the database Microsoft Access, we see the
following ‘Access SQL‘:
Solution with Standard SQL(for example with DB2) by loading the data
(flat files) into DB2:
First connect to database “Grocery”. Then create the necessary tables and load the data from
flat Files (*.txt Files) into the corresponding tables:
Load the data from the Sales_Fact.txt file by using the “Load Data” feature of the table
DB2ADMIN.Sales_Fact in the GROCERY database:
Do the same for the four dimension-tables: “Product”, “Time”, “Store” and “Promotion”.
CREATE TABLE "DB2ADMIN"."TIME" ("TIME_ID" INTEGER,
"DATE" varchar(20),"DAY_IN_WEEK" varchar(12),
"DAY_NUMBER_IN_MONTH" Double,
"DAY_NUMBER_OVERALL" Double,
"WEEK_NUMBER_IN_YEAR" Double,
"WEEK_NUMBER_OVERALL" Double,
"MONTH" Double, "QUARTER" int,
"FISCAL_PERIOD" varchar(4),"YEAR" int,
"HOLIDAY_FLAG" varchar(1))
ORGANIZE BY ROW
DATA CAPTURE NONE
IN "USERSPACE1"
COMPRESS NO;
Finally run the SQL to produce the result for the quarter “4Q95”:
SELECT p.BRAND AS Brand, Sum(s.DOLLAR_SALES) AS Dollar_Sales, Sum(s.UNIT_SALES) AS
Unit_Sales
FROM "DB2ADMIN"."SALES_FACT" s, "DB2ADMIN"."PRODUCT" p, "DB2ADMIN"."TIME" t
WHERE p.PRODUCT_ID = s.PRODUCT_ID
AND s.TIME_ID = t.TIME_ID
AND t."FISCAL_PERIOD" = '4Q95'
GROUP BY p.BRAND
ORDER BY p.BRAND;
Alternative:
SELECT p.BRAND AS Brand, Sum(s.DOLLAR_SALES) AS Dollar_Sales, Sum(s.UNIT_SALES) AS
Unit_Sales
FROM "DB2ADMIN"."SALES_FACT" s, "DB2ADMIN"."PRODUCT" p, "DB2ADMIN"."TIME" t
WHERE p.PRODUCT_ID = s.PRODUCT_ID
AND s.TIME_ID = t.TIME_ID
AND t.QUARTER = 4
Finally run the SQL to produce the result for the both quarters “4Q95” and “4Q96”:
SELECT p.BRAND AS Brand, Sum(s.DOLLAR_SALES) AS Dollar_Sales, Sum(s.UNIT_SALES) AS
Unit_Sales
FROM "DB2ADMIN"."SALES_FACT" s, "DB2ADMIN"."PRODUCT" p, "DB2ADMIN"."TIME" t
WHERE p.PRODUCT_ID = s.PRODUCT_ID
AND s.TIME_ID = t.TIME_ID
AND (t."FISCAL_PERIOD" = '4Q95' OR t."FISCAL_PERIOD" = '4Q94')
GROUP BY p.BRAND
ORDER BY p.BRAND;
Alternative:
You just omit the selection of a special quarter. In addition, you can create a View with name
“Sales_Per_Brand”:
Create View "DB2ADMIN"."Sales_Per_Brand" AS
SELECT p.BRAND AS Brand, Sum(s.DOLLAR_SALES) AS Dollar_Sales, Sum(s.UNIT_SALES) AS
Unit_Sales
FROM "DB2ADMIN"."SALES_FACT" s, "DB2ADMIN"."PRODUCT" p, "DB2ADMIN"."TIME" t
WHERE p.PRODUCT_ID = s.PRODUCT_ID
AND s.TIME_ID = t.TIME_ID
GROUP BY p.BRAND;
Remark: You have also to omit “ORDER BY” not to get an error in DB2. Nevertheless,
the result is ordered automatically by the brand name. See resulting view:
Find explanations and arguments in DWH forums or articles about this theme in the internet
or in the literature.
First SOLUTION:
Second SOLUTION:
FRANCOIS TWEER-ROLLER & MARCO ROSIN 12/17/2013 11 FRANCOIS TWEER-ROLLER & MARCO ROSIN 12/17/2013 12
RI IN DWH
Third Solution:
Definition
“Über referentielle Integrität werden in einem DBMS die Beziehungen zwischen
Datenobjekten kontrolliert“.
Vorteile
• Steigerung der Datenqualität: Referenzielle Integrität hilft Fehler zu vermeiden.
• Schnellere Entwicklung: Referenzielle Integrität muss nicht in jeder Applikation neu
implementiert werden.
• Weniger Fehler: Einmal definierte referenzielle Integritätsbedingungen gelten für alle
Applikationen derselben Datenbank
• Konsistentere Applikationen: Referenzielle Integrität ist für alle Applikationen, die auf
dieselbe Datenbank zugreifen gleich.
Nachteile
• Löschproblematik aufgrund von Integrität
• Temporäres außer Kraft setzen der RI für großen Datenimport.
Meiner Meinung nach ist die Realisierung von der referentiellen Integrität möglich, aber mit
viel Aufwand und Kosten verbunden.
Compare ER Modelling (ER) with multidimensional data models (MDDM), like STAR or
SNOWFLAKE schemas (see appendix page):
Compare in IBM Reedbook “Data Modeling Techniques for DWH” (see DWH lesson
homepage) Chapter6.3 for ER modeling and Chapter 6.4 for MDDM
Build a list of advantages/disadvantages for each of these two concepts, in the form of a table:
ER Model MDDM Model
Criteria1 ++ Criteria5 ++
Crit.2 + Crit.6 +
Crit.3 - Crit.7 -
Crit.4 -- Crit.8 --
Solution:
Solution:
Star schema The star schema logical design, unlike the entity-relationship model, is
specifically geared towards decision support applications. The design is intended to provide
very efficient access to information in support of a predefined set of business requirements.
A star schema is generally not suitable for general-purpose query applications.
A star schema consists of a central fact table surrounded by dimension tables, and is
frequently referred to as a multidimensional model. Although the original concept was to have
up to five dimensions as a star has five points, many stars today have more than five
dimensions.
The information in the star usually meets the following guidelines:
• A fact table contains numerical elements
• A dimension table contains textual elements
• The primary key of each dimension table is a foreign key of the fact table
• A column in one dimension table should not appear in any other dimension table
Snowflake schema The snowflake model is a further normalized version of the star schema.
When a dimension table contains data that is not always necessary for queries, too much data
may be picked up each time a dimension table is accessed.
To eliminate access to this data, it is kept in a separate table off the dimension, thereby
making the star resemble a snowflake. The key advantage of a snowflake design is improved
query performance. This is achieved because less data is retrieved and joins involve smaller,
normalized tables rather than larger, de-normalized tables.
The snowflake schema also increases flexibility because of normalization, and can possibly
lower the granularity of the dimensions. The disadvantage of a snowflake design is that it
increases both the number of tables a user must deal with and the complexities of some
queries.
For this reason, many experts suggest refraining from using the snowflake schema. Having
entity attributes in multiple tables, the same amount of information is available whether a
single table or multiple tables are used.
A star schema is a dimensional structure in which a single fact is surrounded by a single circle
of dimensions; any dimension that is multileveled is flattened out into a single dimension. The
star schema is designed for direct support of queries that have an inherent dimension-fact
structure.
The primary justification for using the star is performance and understandability. The
simplicity of the star has been one of its attractions. While the star is generally considered to
be the better performing structure, that is not always the case. In general, one should select a
star as first choice where feasible. However, there are some conspicuous exceptions. The
remainder of this response will address these situations.
First, some technologies such a MicroStrategy require a snowflake and others like Cognos
require the star. This is significant.
Second, some queries naturally lend themselves to a breakdown into fact and dimension. Not
all do. Where they do, a star is generally a better choice.
Third, there are some business requirements that just cannot be represented in a star. The
relationship between customer and account in banking, and customer and policy in Insurance,
cannot be represented in a pure star because the relationship across these is many-to-many.
You really do not have any reasonable choice but to use a snowflake solution. There are many
other examples of this. The world is not a star and cannot be force fit into it.
Fourth, a snowflake should be used wherever you need greater flexibility in the
interrelationship across dimension levels and components. The main advantage of a
snowflake is greater flexibility in the data.
Fifth, let us take the typical example of Order data in the DW. Dimensional designer would
not bat an eyelash in collapsing the Order Header into the Order Item. However, consider this.
Say there are 25 attributes common to the Order and that belong to the Order Header. You sell
consumer products. A typical delivery can average 50 products. So you have 25 attributes
with a ratio of 1:50. In this case, it would be grossly cumbersome to collapse the header data
into the Line Item data as in a star. In a huge fact table you would be introducing a lot of
redundancy more than say 2 billion rows in a fact table. By the way, the Walmart model,
which is one of the most famous of all time, does not collapse Order Header into Order Item.
However, if you are a video store, with few attributes describing the transaction, and an
average ratio of 1:2, it would be best to collapse the two.
Sixth, take the example of changing dimensions. Say your dimension, Employee, consists of
some data that does not change (or if it does you do not care, i.e., Type 1) and some data that
does change (Type 2). Say also that there are some important relationships to the employee
data that does not change (always getting its current value only), and not to the changeable
data. The dimensional modeler would always collapse the two creating a Slowly Changing
Dimension, Type 2. This means that the Type 1 is absorbed into the Type 2. In some cases I
have worked on, it has caused more trouble than it was worth to collapse in this way. It was
far better to split the dimension into Employee (type 1) and Employee History (type 2).
Thereby, in such more complex history situations, a snowflake can be better.
Seventh, whether the star schema is more understandable than the snowflake is entire
subjective. I have personally worked on several data warehouse where the user community
complained that in the star, because everything was flattened out, they could not understand
the hierarchy of the dimensions. This was particularly the case when the dimension had many
columns.
Finally, it would be nice to quit the theorizing and run some tests. So I did. I took a data
model with a wide customer dimension and ran it as a star and as a snowflake. The customer
dimension had many attributes. We used about 150MM rows. I split the customer dimension
into three tables, related 1:1:1. The result was that the snowflake performed faster. Why?
Because with the wide dimension, the DBMS could fit fewer rows into a page. DBMSs read
by pre-fetching data and with the wide rows it could pre-fetch less each time than with the
skinnier rows. If you do this make sure you split the table based on data usage. Put data into
each piece of the 1:1:1 that is used together.
What is the point of all this? I think it is unwise to pre-determine what is the best solution. A
number of important factors come into play and these need to be considered. I have worked to
provide some of that thought-process in this response.
Conditions: Each article can be delivered by one or more suppliers. Each supplier delivers 1
to 10 articles. An order consists of 2 to 10 articles. Each article can only be one time on an
order form. But you can order more than on piece of an article. Each order is done by a
customer. Customer can have more than one order (no limit).
Good customers will get a ‘rabatt’. The number of articles in the store should also be saved.
It not important who is the supplier of the article. For each object we need a technical key for
identification.
Task: Create a Logical ER model. Model the necessary objects and the relations between
them. Define the attributes and the keys. Use the following notation:
First Solution:
Solution:
……
In the lecture to this chapter we have seen 3 steps: “Discover”, “Prepare” and ”Transform” for
a successful data population strategy.
Please present for the first two steps examples of two tools. Show details like functionality,
price/costs, special features, strong features, weak points, etc.
You can use the examples of the lecture or show new tools, which you found in the internet or
you know from your current business….
Solution (SS2021):
Solution: WS2021
Workflow « Data Manipulation and Aggregation » :
table is the input table and the CSV file provides the translation. The Cell Replacer adds
according to the configuration, the Cell Replacer adds a new Sentiment Rating column with
the appropriate integer values to the input table.
However, it can also be specified that the string values are replaced by the numbers in the
same column without creating a new column.
Import: Next, customer data e.g. income, gender etc. is read in from a CSV file. file.
Activity 1 - Part 2: With the next node Column Filter one or more columns can be filtered
out of a table. from a table. Here we want to exclude the columns Sentiment Rating and Web
Activity should be excluded.
Import: Finally, an Excel table is read in with the Excel Reader. In it you can see which
customer has purchased which investment product. In the next step, these individual data are
merged.
In the 3rd part of the workflow, join operations take place. We know that with the help of
this to merge different tables. For this purpose, KNIME offers the possibility to configure the
Join- operation in KNIME (Knode --→ Config)
Joiner Settings: you can select one/several of 3 options (include in output):
- Matching rows (inner Join): only the rows are included in the common table, that are
included in both original tables.
- Left unmatched rows (left outer join): adds additionally the columns from the left table that
are not that are not present in the right table. In the right table the missing data is columns in
the right table.
- Right unmatched rows (right outer join): adds the columns from the right table that are not
present in the table that are not present in the left table. In the left table missing data will be
columns in the left table.
Column selection: there is a possibility to select the desired columns (manually, with the
RegEx or by the data type).
The first joiner merges the tables from Part 1 and Part 2, the second joiner uses the result of
the first joiner. Joiner uses the result of the 1st Joiner and the data, which results after the
filtering in the Part 2 are used. The 3rd Joiner composes its result from the data of the 2nd
Joiner and from the Excel Reader, which includes the "Products" table.
After that the data flows into the next phase, where the received data can be manipulated
again.
The "Duplicate Row Filter" node identifies duplicate rows. I can either remove all duplicate
rows from the input table and keep only the unique and selected rows, or I can or provide the
rows with additional information about their duplication status. In the configurations it is
possible to specify in which columns it should search for duplicates.
Using the String Manipulation node you can perform different operations on strings. This can
also be set in configurations.
The Table Manipulator node allows to perform multiple column transformations on any
number of input tables, such as renaming, filtering, rearranging and type change of the input
columns. If there is more than one input table, the node concatenates all input rows into a
single result table. If the input tables contain the same row identifier, the node can either
create a new row identifier or use the index of the input index of the input table to the original
row identifier of the corresponding input table.
The last node "GroupBy" groups the rows of a table according to the unique values in the
selected group columns. For each unique set of values of the selected group column, one row
is created. The remaining columns are aggregated based on the specified aggregation settings.
Show the Highlights and build a Strengthens/Weakness Diagram for the following three ETL
Tools. Use the information from the internet:
Solution - Informatica:
Stärken:
Einheitliche Benutzeroberfläche
Stärken und Sehr umfassend
Optimierte Lösungen
Schwächen Schwächen:
Fehlerhaftes sortieren
Filter werden ignoriert (beim Kopieren) Vergleich
Probleme beim Importieren von XML
Quellen:
http://www.informatica.com/de/products/enterprise-data-
integration/powercenter/
Vielen Dank! http://etl-tools.info/informatica/informatica-weaknesses.html
http://www.irix.ch/cms/upload/pdf/publications/ETL_Tools_01_200
7.pdf
http://de.wikipedia.org/wiki/Informatica
Prepare and run the guided tour „Offload Data Warehousing to Hadoop by using DataStage”
Use IBM® InfoSphere® DataStage® to load Hadoop and use YARN to manage DataStage
workloads in a Hadoop cluster (a registered IBM Cloud Id is needed!):
https://www.ibm.com/cloud/garage/dte/producttour/offload-data-warehousing-hadoop-using-
datastage
Solution (SS2021):
See the execution of this demo in the IBM Cloud in the following video:
https://cloud.anjomro.de/s/72knGpN3oPsKitM
…
….
……………..
write a SQL script, s.t. you get the following content of the target table:
Solution:
Select
Case SAMPLTBC.sales.city
Case
When Month(SAMPLTBC.sales.transdate) = 01 then 1
….
When Month(SAMPLTBC.sales.transdate) = 12 then 12
End
As Time_Id
Case
When Year(SAMPLTBC.sales.transdate) = 1997 then 1
When Year(SAMPLTBC.sales.transdate) = 1996 then 3
Else 2
End
As Scenario_Id
….
See screenshot:
First Solution:
Data volume + > 50 GB possible: low expansion - Not > 50 GB: expansion factor too big
factor (low aggregation rate) (high aggregation rate)
Dimensions + > 10 possible (depends only on - Bad performance for > 10 (due to high
DBMS) aggregation rate)
Query (+ When querying single tables) + When using high aggregated data
Performance - When joining many tables ( - when using low aggregated data)
Update + Update during operation possible - Cube has to be rebuilt completely each time
flexibility + Fast and flexible (partly correct, in depends on calculation
rules)
- Operation has to be stopped
Query + Complex, dynamic queries possible - Only standard queries, that the cube is built
complexity (impact on query performance ) for, possible (but combinations are possible)
Price + Cheaper; simpler SQL-based front- - Expensive; costly front-end tool necessary
ends sufficient (but more performance
needed)
➔ ROLAP is for many criteria superior to MOLAP. As most data marts today are bigger than
50 GB, ROLAP is many cases better choice due to performance and storage reasons.
Second Solution:
MOLAP ROLAP
- Zugriff nur auf Daten des Cubes - Queries dauern länger, da komplexer
- Bei Update müssen Cubes neu erstellt + Update auch während Operationen möglich
werden
Mobile Client
Task: Describe the following Data Mining techniques. Search this information
in the internet, i.e. Wikipedia or other knowledge portals:
• Clustering
• Classification
• Associations
Solution:
Task: Describe the following Data Mining techniques. Search this information
in the internet, i.e. Wikipedia or other knowledge portals:
• Sequential Patterns
• Value Prediction
• Similar Time Sequences
Solution:
Mory
Exercise DWH Lecture 9 - Martin
Mory
Exercise DWH Lecture 9 - Martin
- Konsumverhalten - Radial Basis Functions
Mory
Exercise DWH Lecture 9 - Martin
Solution:
To 1.:
Solution: Only the two association rules (Bier=>Milch) and (Milch=>Bier) have support
and confidence >=50%.
Solution (WS2021):
Hint: Follow the instructions given in the KNIME workflow “KNIME Analytics
Platform for Data Scientists – Basics (04. Data Mining – solution)” - see image
below:
Solution:
………….
Solution:
……
Solution:
……..
Solution:
Task: Homework for 2 Persons: Rebuild the KNIME Workflow (use given
solution) for Image-Classification and give technical explanations to the solution
steps.
Hint: Follow the instructions given in the KNIME workflow “L4-DL
Introduction to Deep Learning/Session4/Solutions (Image Classification MNIST
Solution)” - see image below:
Solution:
……..
References