Exam - 1: October 5, 2016 Exam - 2: November 23, 2016 Quiz - 2: October 26, 2016 Quiz - 3: November 9, 2016

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

MGS 657

Lecture 5: September 28th, 2016


New Dates:

Exam 1:

October 5th, 2016

Exam 2:

November 23rd, 2016

Quiz 2:

October 26th, 2016

Quiz 3:

November 9th, 2016

FINAL PROJECT -

Prepare Project Plan (Budget and Timeline)


Assign Project Leaders and Responsibilities
Perform Needs Analysis
Pre-interview research
Obtain a copy of all existing OLTP reports
Obtain a copy of all current Adhoc queries
Get a wish list
_ Interviewee selection
_ Interview questions development
Define Success
Define Failure
Define Goals
Determine current Key Business issues
Determine if any benchmarks / targets have been defined
Determine important internal and external drivers of the system
Determine the current time for the issue resolution and % that get resolved
Define all Core Processes of the System
Define Inputs/Outputs of each process
Determine what data is generated / gathered in each business unit
Determine data flow between processes and business units
Determine if the business unit uses any external data
Define Bus Matrix

Define Stakeholders Matrix for each process


Define Priorities
Define Standard Reports
Define KPIs
Determine what parameters should be known to identify a problem
Determine what parameters should be known to fix a problem
Determine methodology to estimate the cost of fixing a problem
Determine methodology to estimate the damage from not fixing a problem
Determine if it is possible to predict future values
Determine how the stakeholders make the decisions
Determine what new information would significantly change the decision making
Determine what were the recent disasters / crisis
Talk to IT to determine the data availability and quality
Data Warehouse Design
Determine BUS
Define Conceptual Schema
The FOUR STEPS are Select The Business Process
Declare Grain
Identify The Dimensions
Identify The Facts
Determine Access Rights / Privileges
Determine Extraction (ETL) method for each OLTP, the frequency and how much historical data
Determine Data Staging Areas
Clean the data, Conform the data
Determine what Transformation is required for each element (Compute KPIs and other)
Define Loading process
Define Dimension Rollup Categories in the Dimension Tables (Reporting Tags)
Determine what kind of drilldown would be necessary
Dashboard Design
Define Dashboards (Follow design principles)
Define Information Portals (user / domain specific)
Define Scorecards (the effective strategy execution framework)
Data Marts
Create user specific Data Marts
Data Validation and Data Audits
Determine Data Validation Methodology
Create Data Audits
Test and Release to the Core Group
Get a feedback from the core group
Make modifications and rerelease

Data Maintenance
Prepare data Maintenance schedule
ROI Measures
Determine methods of estimating ROI
Generate new standards
Modify the existing processes
Establish methods to identify and fix problems
Establish new performance targets
Compare your goals with the industry (important)

Dealing with slowly changing dimensions.

Type 1: Overwrite. A value is replaced. Eg. Salesperson Dimension the Department is changed.
In that case we simply overwrite the old value with the new one.

Type 2: Add new Row.

Add a new row and additional attributes.

SalesID SalesPersonCode SalesPersonName Department EffectiveDate ExpiryDate CurrentRowFlag


1001
1032

DR76321
DR76321

John Smith
John Smith

Type 3: Add New Attribute.

Electonics
Houseware

01/01/2012 03/31/2015 Expired


04/01/2015 12/31/2020 Current

Preserve the previous value

SalesID SalesPersonCode SalesPersonName Department PreviousDepartment


1001

DR76321

John Smith

Houseware Electonics

Type 4: Add Mini Dimension (Rapidly changing)


PatientKey

PatientAge

#Visits

TotalBilling

1
2
3
4
5
6

19-25
19-25
19-25
19-25
19-25
19-25

Low
Low
Low
Med
Med
Med

< 20,000
< 30,000
< 50,000
< 60,000
< 70,000
< 80,000

Fact Table
MedicalRecKey PatientKey

DateKey

ProviderKey .. Facts..

3421
3421

245
246

780
780

3
4

Type 5: Add Mini Dimension and Type 1 Outriger


PatientKey

PatientAge

#Visits

TotalBilling

MedicalCondition

1
2
3
4
5
6

19-25
19-25
19-25
19-25
19-25
19-25

Low
Low
Low
Med
Med
Med

< 20,000
< 30,000
< 50,000
< 60,000
< 70,000
< 80,000

Stable
Stable

MedicalRecKey PatientKey

DateKey

ProviderKey .. Facts..

3421
3421

245
246

780
780

Fact Table

3
4

Patient Dimension
MedicalRecKey CurrentPatientKey
3421

Type 6: Add Type 1 attribute to Type 2 Dimension


SalesID SPCode
1001
1032
1045

SPName

HistDept

CurrDept EffectiveDate ExpiryDate

CurrentRowFlag

DR76321 John Smith Electonics Furniture 01/01/2012 03/31/2014 Expired


DR76321 John Smith Houseware Furniture 04/01/2014 12/31/2014 Expired
DR76321 John Smith Furniture Furniture 01/01/2015 12/31/2020 Current

Type 7: Dual Type 1 and Type 2 Dimensions


SalesID SalesPersonCode SalesPersonName Department EffectiveDate ExpiryDate CurrentRowFlag

1001
1032

DR76321
DR76321

John Smith
John Smith

Electonics
Houseware

01/01/2012 03/31/2015 Expired


04/01/2015 12/31/2020 Current

SalesID SalesPersonCode SalesPersonName CurrentDepartment


1001
1032

DR76321
DR76321

John Smith
John Smith

Houseware
Houseware

Fact Table will contain keys to both these tables.

ETL (Extract, Transform, Load):


Extract
Connectivity to the Host (ODBC, SQL, Oracle, etc.)
Source Format
Internal Schema
Security / Access
Method (File / Stream)

Transform
Type Conversion (ASCII to Binary; SQL Date to Normal Date; Binary to ASCII; Number to Code (0
to F and 1 to M))
Data Separation (Full Name to First Name and Last Name; Date to Year, Month and Day; CSZ to
City, State and Zip)
Standardization (ZIP+4 to ZIP; SSN with Dashes; P.O. Box to PO BOX; Phone# as (xxx) xxx-xxxx)
Functions (Date to Qtr [assuming no Date Dimension]; All Upper or All Lower; Date to Fiscal
Date; Age from Date of Birth)
Derived Measures / Dimensions (% Total; Rank; Quartile; Decile; New Flags etc.)
Format Conversions (YYYYMODA to Mo/Da/Yr)
Scrubbing (Removal of Jr.,Sr.,Dr.,MD etc.; part of standardization; Business Rules)
Add New Fields ($DATE$, $ABSREC$)
Layout (Horizontal or Verticle)
Missing Values (Unknown; Undefined; N/A etc.)
Sign Association (Charge and Quantity)
Chronology of Data (Admit and Discharge Dates; Birth Date and Service Dates; Service Date and
Current Date; Start and End Dates)
Value Association (Line Charge and Unit Charge)
One Column Value dictates the destination (TXCODE: PAY or ADJ; WAGETYPE:REG or OVER or
SICK or HOLIDAY etc)
Key Value Normalization / Standardization (BILLNGID and ACCOUNT_NO, eg. MMS-PATNO)

Chronology of Data Records


Data Merging (eg. DAILYENC, New+Old)
Create Records (Accrual Payroll 3 pay periods 50%)
Data Population through some logic (Payments to Charges File; Expected Payment in ECMC 3
ways; Assign Unit Number based upon START-END Dates and Days since Admission;
DTUNTBUG=DTADM and DTUNTEND=DTDISCH)
ICD9_1, ICD9_2, ICD9_3, ICD9_4 not
Local vs From Support Table
Sex:: Male -> M, Female -> F
Address: SA1*SA2 -> SA1, SA2
Range Checking
DTUNTBEG and DTUNTEND (based upon DTADM, DTDSCH)
CONTMARG, NETPROFIT, COLRATIO
CLINICLOC for INPAT
Populating Cost based upon Date Range
Translation of Code to TEXT eg. 1 = CHARGES
Based upon PAYCODE post AMT to Different columns
Evaluate NEW Flags
ETL can make up for the deficiencies in the BI tool. Eg. PRVYRTTL
Chgtrans - Encounter file (Aggregated by PATNUM)
Encounter - Chgtrans Denormalization
14 Points Error Checking
Encounter - Professional
DlyEncounter -- Encounter (New, Old, Modified)
ProcDiag:: Priority 1 = PRIMDIAG
Payroll Data:: Create Acrual Records
Split According to WAGETYPE
Format Converters:: EDI, XML, CSV, XLSX, DAO, SQL, Oracle, SalesForce, Google, etc.
PCS Call Details to # of Calls and Time spent on calls (OLTP vs OLAP Different objectives)
Linked List for ECMC (L1, L2, L3, L4 etc.)
G/L Data most consolidated
ECMC:: Expected Payments for the undischarged patients
Problem: Collection or the Volume?
Nursing Home:: RUG Assignment Logic
BASKET ANALYSIS:
Each Record in a file is of different type. ROSWEL: A2DB2 files
Multiple Accounts: PATNUM -> MRN (BILLTCUST, SHIPTOCSZ)

DASHBOARD_FACT_TABLE.xlsx

Loading

Fact and Dimension Tables


Frequency (Daily, Monthly, Quarterly etc.)
Mode (Total or Incremental; Change?)
Validation
Error Processing
Staging
De-normalization
Destination (single or multiple)
Selected Columns
Duplicates

DIMENSIONAL MODELLING
How do we know that whatever dimensions we are considering is a complete set?
Degrees of freedom.
Impact of the external factors.
DIMENSIONS are the CONSTRAINTS put on the system behavior.
The objective is to understand each constraint and their interaction with each other.

What are the alternatives?


Data Vault Data Modeling or Common Foundational Data Integration Model

Relating to the process.


# of Checkout Lanes
Store Hours
Promotions
Etc.
Knowing that the delay is a function of the above variables.
Entity Relationship Diagram (ER).
Store ----------- Carries ---------------- Products
Customer ---- Buys ----------------- Products
Vendors ------ Supplies ----------- Products
Store --------- Services ---------- Customer
Store --------- Has -------------- Employees
Employees ------- Provide Service to ---- Customers
Products -------- Are Purchase by ------ Customers
Vendors --------- Generate ------------- Accounts Payable
..
..
Just relating to Inputs and Outputs may not be optimized.

Need Actionable information.

Customer Experience at the retail store


Reasons for Customer Dissatisfaction
Costs More
No Coupons / Discounts / Sales / Loyalty Cards
Out of Stock
Limited Choices or Your Favorites Not Available
Check-out Delays
Items Hard to Find
Limited Customer Service
Store Not Clean
Limited Parking or Not Convenient
Bad Lighting
Poor Store Layout
Not Enough Space between Isles
Price Discrepancies
Expired / Damaged Goods or Packaging
Mislabeled Items
No Express Check-out Lines
Restricted Return Policy
Unsafe Pavement Conditions
Poorly Functioning Carts
Minimum Requirement for Credit Card
Intimidating Decor
Goods are Not Fresh
Proper Size
May be conflicting criteria.
What is the overall customer satisfaction index?
Let us consider Check-out Delays
One approach would be to survey the customers (Yes / No). The overall result (% not happy) will
determine the success. Opinion Poll.
Unless the measure is owned by an entity it can not be improved. Eg. Noise in the class. The noise needs
to be associated with someone to change the conditions. The above example of opinion poll does not
have that association. The association has to be established with something that is directly responsible
for the outcome else it is non-actionable. That something is called DIMENSION.
Whom does the measure belong to?
Customer
Lane
Cashier
Date
Start Time
Inventory Item

This is called DIMENSIONAL MODELLING.


The FOUR STEPS are Select The Business Process (Which Activity? - Checkout)
Declare Grain (What Data Needed? - Every Customer Checkout)
Identify The Dimensions (Who does the measure belong to?)
Identify The Facts (What characterizes the problem? - Total Checkout Time)
Fact or Measure: Total Checkout Time
Dimensions CUSTOMER_ID
LANE_NUM
CASHIER_ID
TRANS_DATE
START_TIME
END_TIME
NUM_ITEMS
METHOD_OF_PAYMENT
COUPONS_USED
PRICE_CHECK?
BARCODE_LOCATION
SCANNING_ISSUES / BARCODE_QUALITY
CUSTOMER_LOADING_DELAY_REASONS
WARRANTY_ISSUES
CUSTOMER_PRODUCT_INQUIRY
CUSTOMER_DECISION_TO_BUY
EMPLOYEE_YEARS_OF_ EXPERIENCE
SIZE_OF_PRODUCT
WEIGHT_OF_PRODUCT
PRODUCT_PACKAGING
CHECKOUT_READINESS (Weighed?)
MANAGER_ASSISTANCE_REQUIRED

DIMENSIONS

MEASURES

Dimension represents a set of attributes!


A member of a dimension represents a unique tuple value of the attributes.
Dimensional attributes are associated with a single or multiple measures.

Function transforms value(s) of some fields into a measure.


For any parameter unless a Baseline and Unit Of Measure and Bounds (norms) are established, we
would not be able to determine proper functioning of that parameter.
Consider
PRODUCT (Attributes: Unit Price, Unit Cost, Promotion, Size, Color, Vendor, Features, Storage Location,
etc.)
STORE (Attributes: Location, Size, # of Employees, Product Mix, Manager, etc.)
DEPARTMENT (Attributes: Chairman, Members Qualification, Reputation, etc.)

How many DIMENSIONS should we consider?


It depends upon where the problem lies. One needs the ability to explore different domains until the
problem is fully diagnosed. Eg. Medical decision tree. Someone owns the problem. We just need to
determine who that is. It is like saying there is 10% more pollution in California. In this case even though
the owner has been identified, the dimensionality or the grain is at a very high level.

RETAIL STORE External factors:


Price of raw Material
Availability of Products/Raw Material
Demand for the Products
Internal factors:
Selling Price
Promotion
Store Bias vs. Customer Bias
Q: Do you sell what customer wants? (most common) may be low profitability. OR
Do you make customer buy something that is more desirable to the store? Higher profitability.
How do you do that? How do you influence the customer behavior?

Sale (i.e. Selling Price, Discounts) Lost leaders (Product A vs Product B), Dispersion
Advertising / Promotion / Coupons
Loyalty (Frequent Flyer)
Financing (Credit) Lease payments
Repackaging eg. BJ

file:///C:/Apl/PINPT_PREMIER/DIMN_AGGR.HTM

DW Creation - PRIMARY OBJECTIVES:: SIMPLICITY AND PERFORMANCE


Consider OLTP. Need report of sales by States.
Concept of Data Marts
Performance: Create once use many times.
Say 10m records, 20 measures 200m calculations x 30 days -> 6000m calculations
Normal users use 4 measures say 10 times a month -> 40m calculations actually used.
Go to DW for the most frequently used and go to OLTP for the less frequently used ones.
Can this logic be used other way? DW -> Data Mart
Q: Where would the simplicity come from?
Ans:
Speak User Language (no knowing data codes),
Non-Technical (no syntax),
Self Reliant (intuitive),
Less or no non-standard Transformation of the Data (no functions involved),
Related directly to the output (desired elements are present in input),
Less number of steps are involved,
Process is visual,
Q: Which one of these can be addressed by the data organization and which ones are the part of BI tool?
A: DATA: Speak User Language,
Less or no non-standard Transformation of the data,
Related directly to the output
BI Tool: Non-Technical,
Self Reliant
Less number of steps is involved,
Process is visual
Q: Which ones can be either? Which one of ETL tasks can be shifted to BI tool?
A: Speak User Language (if not at the data end then user will have to know the data values while using bi
tool),
Less number of steps is involved
Q: Which one of these ETL process can address and which ones are the part of BI tool?

Reasons why the host data in the current format is unusable for the query
Integration of different data elements may be required for a single analysis
End-User may need to know the internal data schema of how the data is stored
Data may need to be remapped
Security issues
Optimized for a single record access. i.e. for Operational efficiency. Quickly records a
transaction
It is highly normalized
The measures needed for the strategic analysis are time consuming to be computed on
demand
It is difficult for a user to differentiate systems internal use data from the rest
The historical data may not be available for the trends analysis

Issues:

You might also like