Nanyang Business School BC3406 Business Analytics Consulting Data Hackathon
Nanyang Business School BC3406 Business Analytics Consulting Data Hackathon
Nanyang Business School BC3406 Business Analytics Consulting Data Hackathon
Project Report
Group 02
Page | 1
2.3. Pricing
The recommended pricing for each bundle should be within the minimum price (50% margin
of all SKUs in the bundle) and the maximum price (sum of original price of all SKUs) for
profits optimization. Refer to Appendix A4.3. for detailed explanation.
2.4. Place
This table summarizes the recommended channels for the bundles. Refer to Appendix A4.4.
for the justifications.
Bundle SKU Proposed Channel
1a. [Skin Balancing + Skin Perfect (SP) Bundle] 2010, 1150, 1350 BC, FP
1b. [Resist + SP Bundle] 2010, 7780, 7830 Online, FP
2. [Best-selling items Set] 7770, 5700, 2010 Online, FP
3. [High-Value Deal] 7790, 8010, 7820 BC, Online
4. [Sample Set] 1159, 1359, 3409 Online
Page | 2
Figure 3-1. is a heatmap which indicates Paula’s Choice sales performance across Singapore.
The top five performing districts are 19, 23, 18, 22 and 15. District 19 has the highest zip code
count, number of transactions and total sales. The total sales in the district accounts for 13.39%
of all online sales transactions. The amount is almost double of District 23 and substantially
higher than all other districts.
Based on the analysis (refer to Appendix B. for detailed analysis), the recommendation for the
location of Paula’s Choice new flagship store will be District 19 – Punggol. The main deciding
factors include 1) large existing customer base, 2) population profile fits Paula’s Choice target
audience, 3) data shows high sales performance leading to great potential for high revenue and
4) extensive development plans for Punggol region.
To fit the brand image of mass prestige, it would be suitable for the Paula’s Choice to venture
into renting a retail space in a shopping mall. This would allow the company to reach out and
expose their brand name and products to more consumers. Based on the location, traffic flow
and mall image, the recommendation would be to open the flagship store in Waterway Point.
Conveniently located just above Punggol MRT/LRT Station and near the bus interchange,
Waterway Point is a popular shopping destination in the northwest. Despite being in the
heartlands, the mall contains established brands such as H&M and Uniqlo. Hence, the
atmosphere of the mall fits in with the affordable prestige image of Paula’s Choice.
Naturally, with the mall’s high footfall (Toh, 2016), rental in Waterway Point is slightly higher
than other malls in District 19. However, though the rental (~$16.00 - $40.00 psf) might be
higher, the revenue generating potential for Paula’s Choice would also be higher. Hence,
Waterway Point would be a suitable and feasible place for the new flagship store.
4. Forecasting Analysis
4.1. Forecasting Models
To come up with forecasting models for different SKUs, products for each channel (online and
physical stores were clustered based on their characteristics (refer to Appendix C4. for
elaboration), and forecasting models were generated and evaluated for each cluster (refer to
Appendix C5.). The below table summarizes the forecasting models selected for each cluster.
Channel / Cluster Model Selected Channel / Cluster Model Selected
Online / 1 5MLR Brick-and-mortar / 1 1-Year SMA
Online / 2 4MLR Brick-and-mortar / 2 5MLR
Online / 3 2SMA Brick-and-mortar / 3 1-Year SMA
Online / 4 1-Year SMA Brick-and-mortar / 4 5MLR
Online / 5 4MLR Brick-and-mortar / 5 5MLR
Online / 6 5MLR Brick-and-mortar / 6 4MLR
Online / 7 2SMA Brick-and-mortar / 7 2SMA
Brick-and-mortar / 8 3SMA
Figure 4-1. – Selected Forecasting Models
4.2. SKUs Recommended for Discontinuation
79 SKUs were recommended for discontinuation due to having zero sales in 2016 (Appendix
C3.), and an additional 55 SKUs were recommended for discontinuation based on low
forecasted demand (Appendix C5.9), making a total of 134 SKUs recommended for
discontinuation.
Page | 3
4.3. Sales Forecast for February 2017 to January 2018
Using the models selected for each cluster, and taking into account SKUs recommended for
discontinuation, a forecast of the following year’s sales was generated as per the figure below
(further discussion in Appendix C5.10).
+135.6%
$1,833,615
$778,402 -33.5%
$424,334
$282,121
Online Brick-and-Mortar
Page | 4
Appendix
Page | 5
Appendix A – Bundling Analysis and Marketing Strategies
A1. Additional Notes
1. Refer to glossary in Appendix D for explanation of the technical terms used in this section
onwards. The metrics used to determine the proposed SKU sales performance in this report
are:
a. ‘Yearly’ gross sales & quantity sold
i. Even though both gross sales and quantity sold are highly correlated, these
two metrics are used in our analysis to give our client a better picture of the
SKU sales performance
b. ‘Yearly’ gross sales contribution and quantity sold contribution
c. Average monthly gross sales & quantity sold are not considered because the relative
sales performance of each SKU will exactly be the same as that of ‘Yearly’
2. The dataset “December Orders - Online _ Store.csv” was not used because:
a. Firstly, it does not provide gross sales data. As such, fair comparison cannot be
drawn when looking at both gross sales and quantity sold metrics.
b. Secondly, there are no information on transaction ID. Market basket analysis cannot
be conducted with this dataset.
c. Thirdly, there are no significant changes to the quantity performance of the SKUs
in the proposed bundle even after including the additional quantity data from the
dataset (See Appendix for more details)
d. As such, the timeframe of ‘yearly’ in this analysis does not the include the additional
December data from the dataset, “December Orders - Online _ Store.csv”
3. The proposed bundle 1a and 1b are part of the currently offered Advanced Kit. However,
these bundles have a stronger focus on the complementary functions of the three SKUs as
compared to an advanced kit where there are about five SKUs.
4. The support for the market basket analysis of all the proposed bundles are very low in terms
of the transaction count, however, given the extreme granular nature of the SKU dataset,
having a transaction count of even 5 is arguably significant. Still, to ensure that there will
be enough support or to compensate the support for the proposed bundle, most of the
proposed SKU has very high demand indicated by their respective sales performance.
5. The key metrics for the proposed SKUs were compared against the scenario if the
additional dataset "December Orders - Online _ Store.csv" were used. This is to confirm
that the additional dataset "December Orders - Online _ Store.csv" do not have any
significant effects on the relative performance of the SKUs.
A2. Dataset
A2.1. Data Cleaning for Physical Stores: BC and FP
1. Dataset used: “FULL YEAR ITEMS SALE BC.csv” and “FULL YEAR ITEMS SALE
FP.csv”
Page | 6
2. Data Cleaning:
a. Standardize “Date” column due to inconsistent format from raw dataset
i. Convert date and time from PT to SGT
b. Delete “Time Zone” column due to identical values for all rows
c. Delete “Tax” column due to identical values for all rows
d. Remove “S$” from “Gross Sales”, “Discounts” and “Net Sales” columns
i. To transform text to numeric (accounting) format
e. Isolate refunded transactions
i. Move into new sheet, “Refund”
ii. Refunded transactions are not applicable in bundling analysis
f. For items without assigned SKU
i. Assign item = “Shine Stopper” with price point name = “Sample” to have
SKU = “3601”
ii. Isolate item = "Custom Amount"
1. Move into new sheet, “Blanks”
2. Items are not assign a new SKU because customized transactions are
not useful or relevant in bundling analysis
iii. For FULL YEAR ITEMS SALE BC data, additional steps are taken:
1. 1 row for item, “Calm Cleanser” is isolated to “Blanks” sheet
a. This item is also not assign any SKU because it is not
possible to map out the SKU and 1 item is not going to be
significant in affecting the subsequent analysis
2. 9 rows for item, “Skin Balancing Simple Kit (4 items)” are mapped
and assigned SKU = “4090”
3. 4 rows for item, “Skin Balancing Super Kit (7 items)” are mapped
and assign SKU = “4600”
4. Assignment of SKU for both items, “Skin Balancing Super Kit (7
items)” and “Skin Balancing Simple Kit (4 items)” are done by
mapping the corresponding items from the dataset “Line Item Orders
– Online.csv”. The mapping is based on three factors:
a. Price
b. Item/Product Name
c. Number of items in the kit
d. Refer to the following A2.2 dedicated to explaining in
detailed, the step by step illustration of how the mapping is
done.
3. Resulting dataset is renamed as “FULL YEAR ITEMS SALE BC_Cleaned” and “FULL
YEAR ITEMS SALE FP_Cleaned” respectively.
Page | 7
A2.2. Mapping Explanation
Mapping “Skin Balancing Super Kit (7 items)” and “Skin Balancing Simple Kit (4 items)”
1. Originally, no SKU is assigned for the items, “Skin Balancing Simple Kit (4 items)” and
“Skin Balancing Super Kit (7 items)” in the Dataset: “FULL YEAR ITEMS SALE BC” as
shown:
2. Using the dataset, “Line Item Orders – Online.csv” (see following screenshot) as a
reference, the items can be mapped easily due to the resemblance between the names and
the price.
3. In addition, based on the information from Paula’s Choice online store, it is shown that
“Skin Balancing Essential Kit” has 4 items and “Skin Balancing Advanced Kit” has 7 items.
Hence, “Skin Balancing Essential Kit” is mapped to “Skin Balancing Simple Kit (4 items)”
and “Skin Balancing Advanced Kit” is mapped to “Skin Balancing Super Kit (7 items)”.
Page | 8
4. Combining this information with the information on item/product name and price, the item,
“Skin Balancing Simple Kit (4 items)” is mapped and assigned SKU = “4090” and “Skin
Balancing Super Kit (7 items)” is mapped and assigned SKU = “4600”.
A2.3. Data Cleaning for Online Store
1. Dataset used: “Line Item Orders – Online.csv”
2. Data cleaning:
a. Delete redundant columns:
i. “Order Status” due to identical values for all rows
ii. “Region” since it is not needed for analysis
iii. “City” since it is not needed for analysis
iv. “Manufacturer” since this column is empty
v. “Qty. Invoiced” is similar to “Qty. Ordered”
vi. “Qty. Shipped” is ignored as stated by client
vii. “Qty refunded” is not needed
viii. “Tax”, “Tax Invoice”, “Refunded to Total Margin” columns are all empty
ix. “Total Incl. Tax” is similar to “Total”
x. “Invoiced Incl. Tax” is similar to “Invoiced”
b. Split “Order Date” into “Order Date” and “Order Time” for easier data analysis
3. Resulting dataset is renamed as “Line Item Orders - Online_Cleaned.csv”
A2.4. Data Consolidation
1. Consolidate the three cleaned datasets, “FULL YEAR ITEMS SALE BC_Cleaned”,
“FULL YEAR ITEMS SALE FP_Cleaned” and “Line Item Orders - Online_Cleaned.csv”
into a centralized dataset or database, named “BC + FP + Online.csv”
2. Due to the different number of variables used and the naming used between the datasets,
further data processing has to be done. (Refer to the figure below)
a. ‘Channel’ column was created to differentiate the channels of the transactions:
online, BC and FP stores.
Page | 9
b. To standardize newly mapped items from dataset, “FULL YEAR ITEMS SALE
BC_Cleaned”, the name of the items are changed as follows:
i. “Skin Balancing Simple Kit (4 items)”with assigned SKU = “4090” is
renamed as “Skin Balancing Essential Kit”
ii. “Skin Balancing Super Kit (7 items)” with assigned SKU = “4600” is
renamed as “Skin Balancing Advanced Kit”
A3. Methodology
A3.1. Market Basket Analysis
1. Further data processing is needed to ensure that the consolidated dataset is compatible with
SAS Enterprise Miner for Market Basket Analysis.
a. Standardize and recode “Transaction ID” to purely numeric code, “Transaction
ID_Coded” as follows:
b. Remove rows with 0 gross sales, 0 discounts and 0 net sales as they are not needed
in market basket analysis
c. Remove rows with ‘Discontinued’ category as they are not needed in market basket
analysis
d. Remove rows that have SKUs considered to be a bundle itself, (i.e. at least two
SKUs within a SKU) because they should not be considered for association and it
is not fair to compare a single SKU with a ‘bundled’ SKU. This can be done by
removing rows containing:
i. “kit” or “set”
ii. SKU 4980 – Power Couple: Clinical 1% Retinol + Resist Oil Booster
iii. SKU 4930 – Power Couple: Resist C15 + Skin Balancing Serum
iv. SKU 4920 – Power Couple: Resist C15 + Skin Recovery Serum
v. SKU 4910 – Power Couple: C15 + SA Serum
vi. SKU 4890 – Power Couple: Resist C15 + Ultra-light Serum
vii. SKU 4830 – Power Couple: Resist C15 + Pure Radiance
viii. SKU 4820 – Power Couple: Resist C15 + Wrinkle Repair Retinol
3. To run the dataset in SAS Enterprise Miner, the dataset “For SAS_3.csv” has to be
converted into SAS format, “A3.sas7bdat” using base SAS with the following code:
Page | 10
4. After conversion, the dataset, “A3.sas7bdat” is then created in SAS Enterprise Miner.
5. The “Market Basket” node is then added and linked to the dataset node “A3” in order to
run the analysis.
6. Before running the analysis, the following constraints for the market basket node were
set:
Refer to Figure A3-1. in the next page for the explanation of the constraints
Page | 11
Constraints What is it for? What is set? Why?
7. The Market Basket Analysis was run and the subsequent result is shown in the rule window.
Sample of the result is shown below.
8. The results from this analysis is not conclusive nor sufficient in recommending what SKU
to place in a bundle. As such, in addition to using SAS Enterprise Miner, Microsoft Excel
was also used to descriptively analyze the sales performance of the SKUs using the dataset,
“For SAS_3.csv”.
Page | 12
A4. Analysis of Results
Before diving in depth to the results and the discussions of the proposed bundles, it will be
helpful to refer to the following figure (Figure A4-1.) summarizing the results and discussions
for the proposed bundles.
Note: With respect to the rules discussed below, in general, the rule: SKU A & SKU B SKU
C, means that purchasing SKU A and SKU B together will likely lead to the purchase of SKU
C, given that SKU A and B are bought.
Page | 13
A4.1. Insights
In addition to the justifications mentioned, the proposed bundles also apply an interesting
insight drawn from the analysis, which is: customers mostly purchased items of the same
pack size when purchasing multiple items. As such, note that all the proposed SKUs in each
of the bundles have the same size, i.e., they are all either regular in size or sample in size. In
fact, from our analysis, 99% of all baskets with at least 3 SKUs, contain SKUs with the same
pack size.
No. of Rules
1%
99%
Justifications
1. SKU 2010 is the top selling SKU
SKU 2010 ranked 1st, ‘100th’ percentile for the year 2016 in terms of both gross
revenue and quantity. This makes it more likely for customers to purchase the
bundle.
Page | 14
2. SKU 1350 is also contributing significantly
SKU 1350 also contributes significantly, 2.12% to total gross sales and 2.26% to total
quantity for year 2016 (excl. Dec).
3. SKU 1150 has much lower contribution to total gross sales and quantity sold
as compared to SKU 2010 and 1350
SKU 1150 contributes only 0.87% to total gross sales and 0.93% to total quantity
sold for year 2016. Still, it has potential to increase since it is within the top 80th
percentile. Thus, bundling SKU 1150 with SKU 2010 and SKU 1350 will leverage
on SKU 2010 and 1350’s good sales performance as a halo effect to drive more
demand for SKU 1150.
Page | 15
4. The rule: SKU 2010 & 1150 1350 has a reasonable confidence of about 30%
Given that this analysis is done at the most granular SKU level, a confidence of
30% is reasonable. It also has the highest support count of 12 among all the rules
involving SKU 2010 with min. 3 items.
A4.2.2 Bundle 1b
Justifications
1. SKU 2010 is the top selling SKU
Similarly, SKU 2010, being the top selling SKU for the year 2016 in terms of both
gross revenue and quantity sold, will help to improve sales performance for the
other two SKUs - 7780 and 7830 which are selling significantly lesser than SKU
2010.
Page | 16
2. SKU 7780 and 7830 are contributing lesser than SKU 2010 but have high
potential to contribute more.
Both SKUs are within top 90th percentile and have high potential in contributing to
total gross sales and quantity.
- SKU 7780 contributes 1.88% to total gross sales and 1.81% to total quantity
- SKU 7830 contributes 1.47% to total gross sales and 1.48% to total quantity
In terms of both gross sales and quantity sold at the category level, SKU 7780 and
7830 belong to the 2nd largest category contributor - Resist Oily. Resist Oily
category contributes 16.04% to the total category gross sales and 15.25% to the
total category quantity sold.
Page | 17
iv. Within the resist oily category, SKU 7780 and 7830 are also one of
the top few key drivers for the resist oily’s category gross sales and
quantity sold.
3. The association between SKU 2010, 7780 and 7830 has a reasonable confidence
level of about 25%, given that this analysis is done at the most granular SKU
level.
Page | 18
If complementary effects were to be taken into account, it also means that the
confidence of 25% should actually be higher than it currently is.
4. Just like bundle 1a, all the three SKUs are highly complementary to each
other. SKU 2010 need to be used after cleanser and toner. They should be used
in the order: Cleanser→ Toner→ Exfoliants
A4.2.3 Bundle 2
Justifications
1. Amongst all SKUs, SKU 2010 and 7700 ranked first and second respectively in
terms of both gross sales and quantity sold
SKU 2010 and 7700 can provide halo effect to further drive sales performance for
less performing SKU 5700.
2. SKU 5700 is actually one of the best-selling body exfoliant lotion
Since SKU5700 is one of the best-selling body lotion, within the top 90th percentile,
all three SKUs can be considered to be the best-selling products in a bundle.
Page | 19
3. 2nd highest confidence of about 45% amongst all association rules related to
SKU 2010
a. Note: Highest confidence 46% rule is not selected because the SKUs in the
rule make no business sense. In that rule, 1150→ 1560→ 2010, it suggests
selling a bundle of cleanser, exfoliants and moisturizer. However, based on
the user instruction of SKU 2010, which is an exfoliant, SKU 2010 must be
used after a cleanser and toner. As such, for the bundle to be effective, both
cleanser and a toner must be present to effectively complement SKU 2010.
In this case, SKU 1150 is a cleanser but unfortunately, SKU 1560 is a
moisturizer instead of a toner. Hence, complementary effects will be sub-
optimal for this bundle.
Page | 20
A4.2.4 Bundle 3
Justifications
Note: Though both SKU 7790 and 7820 might conflict with each other as
suggested by the FAQ in the online store, it should be noted that the conflict is
addressed by the different frequency of use between these two SKUs. Based on the
SKU description, SKU 7790 is used weekly while SKU 7820 is used daily.
Therefore, SKU 7790 has a large complementary role to play in leading to the
purchase of SKU 7820 despite the fact that SKU 7790 perform similar functions as
SKU 7820. Coupled with the fact that SKU 8010 is performing well (see below),
the high confidence of 83.33% is well-justified.
2. SKU 7820 and 8010 are both performing well in terms of gross sales and
reasonably, quantity sold
Both SKUs performed well. However, SKU 8010 fall short of SKU 7820 in terms
of quantity sold. This is probably due to SKU 8010 having a more premium price at
$88 as compared to SKU 7820 being priced at $48.
Page | 21
As such, for a more premium SKU, which contributes 1.04% of total quantity sold,
it is reasonably considered very significant.
From this comparison, we can see that SKU 8010 (Left) has a slightly higher rating than
SKU 7870 (Right) but more importantly, for the ‘Options’, SKU 8010 size of 1 oz is
doubled that of SKU 7870 size of 0.5 oz. Therefore, despite being pricier, SKU 8010 will
Page | 22
be perceived as having more value by the customers browsing the products. This probably
explains why the data shows that SKU 8010 is recommended instead of SKU 7870.
4. High-value transaction
SKU 8010 is a premium product, this will drive total sales revenue to a larger
extent.
A4.2.5 Bundle 4
Justifications
1. Sample bundle to help promote less popular Skin Balancing category
At the category level, Skin Balancing category are the fifth contributor to total
gross sales and fourth to quantity sold, suggesting that this category is not very
popular among customers, but has great potential to increase sales. As such, by
having sample sized SKUs for this category, it allows customers to try on this new
Page | 23
category first and if customers like this new category, they will likely purchase this
collection in future.
Page | 24
2. Original pack size SKU 1150, 1151, 1350 and 3400 are the key driver SKUs for
the Skin Balancing category. Respective sample SKUs have higher potential to
sell.
SKU 1159, 1359 and 3409 are sampling original SKU 1150, 1151, 1350 and 3400,
which are already contributing significantly within the Skin Balancing category.
This suggests that the three sample SKUs stand a higher chance of getting sold as
compared to the other samples in the same category.
Page | 25
1. The following chart, Figure A4-3. shows the price range of the proposed bundles, i.e. the
minimum price, the maximum price and the proposed discounted price at 15%, 17% and
20% of each respective bundle.
2. The maximum price is the sum of the price of each of the three SKUs in the bundle
3. The minimum price is 50% of the maximum price as we are assuming 50% margin
4. The discounted price at 17% is derived from taking a simple average of all the discounts
given in all the past transactions.
a. Arbitrarily, a range of discount is proposed using 17% as the central benchmark to
derive 15% as the lower bound and 20% as the upper bound.
5. Therefore, the recommended price of each bundle should:
a. First, lie within the minimum price and maximum price range in order to optimize
sales profit, and
b. Second, depending on Paula’s Choice willingness to give discount and the amount
of consumer surplus, it is up to Paula’s Choice to decide if it wants to follow the
proposed discounted price range of 15 – 20%.
Page | 26
A4.4. Justifications for “Place”
1. The dataset “For SAS_3.csv” is used
2. Remove rows with month = December due to incomplete data.
3. Since gross sales is highly correlated to quantity, only gross sales will be in the subsequent
analysis.
4. This stacked bar chart shows the proposed SKU gross sales breakdown by channel,
supported by the subsequent figure, Figure A4-5., showing the absolute figures.
90%
30.25%
80%
43.75% 45.01% 43.42% 40.39% 45.66%
51.55% 49.75% 48.79% 46.84% 49.32% 46.49%
54.12%
70%
12.28%
60%
11.96% Online
50% 9.91% 6.58% 5.99%
13.28% 16.73% 7.18%
FP
9.28% 12.32% 11.78% 10.75%
40% BC
18.82%
30% 57.47%
46.33% 47.65% 47.76% 45.98% 47.51%
20% 41.70% 39.93%
39.18% 37.93% 39.85% 39.43%
27.06%
10%
0%
1150 1159 1350 1359 2010 3409 5700 7770 7780 7790 7820 7830 8010
Page | 27
5. This information is then translated at the bundle level as shown:
SKU gross sales breakdown by Dominant Dominant Bundle
Bundle SKU Channel Channel Channel (At least 2
BC FP Online (+/- 2%) Dominant Channel)
Page | 28
100%
5.41%
10.21% 10.17%
8010
5.53%
90%
4.89% 6.15% 7830
8.43%
80%
12.85% 4.35% 7820
13.32%
4.31%
7.45%
7780
7.25%
60% 22.69%
7770
50% 21.52%
18.56%
5700
6.72%
0.16%
40% 3409
3.81% 4.23%
0.12% 0.16%
2010
30%
31.17%
1359
23.29%
25.58%
20%
1350
0.04%
0.11% 1159
10% 0.07%
8.14%
7.32%
7.82%
0.06% 0.06%
0.08% 1150
4.16% 3.73% 2.23%
0%
BC FP Online
Page | 29
Channel gross sales breakdown by SKU (S$)
SKU BC FP Online
8010 9768 1232 9557.8
7830 4680 1260 5780.8
7820 12288 1920 12517.8
7790 3315 990 4102.32
7780 7128 982 6813.8
7770 20582 5165 17445.22
5700 3645 1530 3971
3409 115.5 37.5 151.5
2010 22274 7095 24043.22
1359 34.5 24 69
1350 7786 1666 7352.8
1159 57 13.5 75
1150 3978 850 2093.7
Figure A4-8. – Absolute figures for Gross Sales Breakdown by SKU
Page | 30
7. Based on Figure A4-9., the most appropriate bundles for FP store are bundle 1a, 1b and 2
since they contribute, on average, significantly more than bundle 3 and 4.
Page | 31
8. To summarize the appropriate channel for the proposed bundles, refer to Figure A4-10.
Bundle SKU Proposed Channel
2010
- End of Appendix A -
Page | 32
Appendix B – Flagship Store Positioning Analysis
Currently, Paula’s Choice has two physical retail stores in Singapore, Beauty Collective in
Novena Square 2 and Front Porch in Tanjong Pagar Plaza. With the progress and growth of the
company, Paula’s Choice would like to consolidate their current retail operations and open a
brand new flagship store in a new location.
B1. Dataset
The dataset used in the analysis is the “Line Item Orders - Online” which is the sales transaction
data for Paula’s Choice online store. The online store data is being used as it contains
customer’s address and zip code which allows for the analysis of customer’s location.
Note: Due to the limitations of the data provided, an assumption is made that the behaviour of
the online customers would be similar to retail stores customers, i.e. customers who purchase
Paula’s Choice products online will also buy them physically at a retail store if it is located
near them.
In order to proceed with the analysis, the dataset has to be further reorganised and cleaned.
Refer to Appendix A2.3. for the first round of data cleaning for “Line Item Orders – Online”.
B1.1. Data Reorganisation
As the dataset is based on line item orders (Figure B1-1.), it would not be meaningful to use it
as it is. Hence, the number of transaction and total sales transacted dataset is summed up and
aggregated at the zip code level. Figure B1-2. shows an example of the reorganised dataset.
This would allow comparison and analysis of the sales data by location. Upon consolidation,
the total number of unique zip code (instances) is 1,699.
Page | 33
Variable Definition/Explanation
Zip Code An assigned number to indicate a specific location in Singapore
Sector Postal sector as defined by the Urban Redevelopment Authority (URA).
District Postal district as defined by the URA. (Based on postal sector)
No. of Txn Number of transactions (orders) that occurred in that zip code
Total Sales Total sales amount transacted based in that zip code
Figure B1-3. – Data Definitions
2. There are some zip codes that are currently not reflected in Google Maps. As such, there
were some difficulties in using software that base their location functions on Google Maps.
To work around this issue, the affected zip codes were adjusted with zip codes in the same
postal sector and district. Hence, the accuracy and reliability of the analysis were not
impacted by change.
After cleaning, the total number of instances was reduced from 1,699 to 1,688.
B2. Methodology
The methodology behind the quantitative analysis is based on the evaluation of the sales
performance of the 28 districts. Figure B2-1. displays the details of the 28 districts of Singapore.
The sales performance will be analysed using three metrics, 1) number of transaction, 2) total
sales and 3) average sales. Figure B2-2. shows the sales performance for all the twenty-eight
districts. A strong sales performance in a particular district would imply that Paula’s Choice
has a strong customer base there. Therefore, due to the higher and existing demand, it would
be suitable for Paula’s Choice to open a flagship store in that district.
Figure B2-3. displays the correlation coefficient of variables. Based on the correlation
coefficient, it supports the intuitive relationship between the number of zip codes, transaction
and total sales, i.e. a higher number of zip code would result in higher number of transaction
resulting in higher sales.
Figure B2-4. shows the statistical summary of the metrics used to evaluate the district
performance. Only twenty-seven districts were analysed as District 24 (Lim Chu Kang, Tengah)
has no sales records and is currently an unsuitable location for the new store. Thus, it was
omitted.
Page | 34
Postal
Postal Sector General Location
District
1 01, 02, 03, 04, 05, 06 Raffles Place, Cecil, Marina, People's Park
2 07, 08 Anson, Tanjong Pagar
3 14, 15, 16 Queenstown, Tiong Bahru
4 09, 10 Telok Blangah, Harbourfront
5 11, 12, 13 Pasir Panjang, Hong Leong Garden, Clementi New Town
6 17 High Street, Beach Road (part)
7 18, 19 Middle Road, Golden Mile
8 20, 21 Little India
9 22, 23 Orchard, Cairnhill, River Valley
10 24, 25, 26, 27 Ardmore, Bukit Timah, Holland Road, Tanglin
11 28, 29, 30 Watten Estate, Novena, Thomson
12 31, 32, 33 Balestier, Toa Payoh, Serangoon
13 34, 35, 36, 37 Macpherson, Braddell
14 38, 39, 40, 41 Geylang, Eunos
15 42, 43, 44, 45 Katong, Joo Chiat, Amber Road
16 46, 47, 48 Bedok, Upper East Coast, Eastwood, Kew Drive
17 49, 50, 81 Loyang, Changi
18 51, 52 Simei, Tampines, Pasir Ris
19 53, 54, 55, 82 Serangoon Garden, Hougang, Punggol
20 56, 57 Bishan, Ang Mo Kio
21 58, 59 Upper Bukit Timah, Clementi Park, Ulu Pandan
22 60, 61, 62, 63, 64 Jurong
23 65, 66, 67, 68 Hillview, Dairy Farm, Bukit Panjang, Choa Chu Kang
24 69, 70, 71 Lim Chu Kang, Tengah
25 72, 73 Kranji, Woodgrove, Woodlands
26 77, 78 Upper Thomson, Springleaf
27 75, 76 Yishun, Sembawang
28 79, 80 Seletar
Page | 35
Zip Code Total Average
District No. of Txn
Count Sales ($) Sales ($)
1 54 133 15709.41 118.12
2 11 24 2035.75 84.82
3 51 99 13542.55 136.79
4 31 70 8915.23 127.36
5 89 161 18116.78 112.53
6 4 4 693.33 173.33
7 16 25 2810.46 112.42
8 12 27 2773.68 102.73
9 56 99 11833.39 119.53
10 80 134 17835.90 133.10
11 27 45 6545.18 145.45
12 62 106 12208.52 115.17
13 26 37 3821.39 103.28
14 60 117 11714.36 100.12
15 75 162 20150.32 124.38
16 74 163 17804.96 109.23
17 12 21 2713.76 129.23
18 142 258 26279.21 101.86
19 223 451 47247.82 104.76
20 80 139 15108.05 108.69
21 44 96 11398.91 118.74
22 135 263 25184.55 95.76
23 158 279 27370.62 98.10
24 0 0 0 0
25 50 115 10112.53 87.94
26 13 28 2530.80 90.39
27 76 131 13829.49 105.57
28 27 42 4665.92 111.09
Figure B2-2. – District Sales Performance
Zip Code
No. of Txn Total Sales Average Sales
Count
Zip Code Count 1
No. of Txn 0.990665386 1
Total Sales 0.9795191 0.990396968 1
Average Sales -0.007439569 -0.014677583 0.050189827 1
Figure B2-3. – Correlation Coefficient of Variables
Page | 36
Zip Code
No. of Txn Total Sales Average Sales
Count
Mean 62.52 119.59 13072.33 113.72
Median 54 106 11833.39 111.09
Mode 12 99 #N/A #N/A
Standard Deviation 51.74892817 100.3586872 10308.16668 19.0722347
Skewness 1.475975039 1.598893007 1.447734681 1.194203885
Range 219 447 46554.49 88.51
Minimum 4 4 693.33 84.82
Maximum 223 451 47247.82 173.33
Sum 1688 3229 352952.87 3070.50
Count 27 27 27 27
Figure B2-4. – Statistical Summary of Variables
B3. Analysis of Results
Based on total sales, the top five performing districts are 19, 23, 18, 22 and 15 (Figure B3-1. –
Group A). District 19 has the highest zip code count, number of transactions and total sales.
The total sales in the district accounts for 13.39% of all online sales transactions. The amount
is almost double of District 23 and substantially higher than all other districts.
Judging by average sales, the top five performers are district 6, 11, 3, 10 and 17 ((Figure B3-2.
– Group B). However, upon further observation, it seems that average sales might not be the
best metric to evaluate sales performance.
Firstly, a high average sales figure could be misleading as with the case of District 6. Although
it has the highest average sales of $173.33, the total number of transaction is only 4. It would
not be feasible and profitable to open a store there when the total sales is very low.
Page | 37
Secondly, total sales could be more an important metric as it shows the revenue generating
potential and attractiveness of the district. This information could be potentially used to
determine if the new store could breakeven and earn a profit.
Lastly, Districts 19, 23, 18, 22 and 15 have higher zip code counts and number of transactions.
This implies that the customer base in those districts are higher than others. As the positive
correlation of number of customers and sales have been established earlier, having a store in
those districts might mean achieving higher sales.
The districts in Group A are more of the heartlands or residential estates in Singapore. This is
strategically in line with Paula’s Choice mass prestige strategy whereby the firm aims to deliver
quality products to the masses. Hence, by locating the store in districts in Group A, the firm
would be able to reach out to more consumers and further increase their customer base.
Furthermore, by comparing the difference between average sales among the two groups of
districts, it seems that the difference is around the average price of a regular product. For
example, the difference between District 19 and 11’s average sales is $40.69.
Figure B3-3. visualises the findings on a heatmap. The heatmap indicates the top 5 performing
districts by total sales. The appearance of a hotspot in District 1 is due to the concentration of
sales transaction in a small number of areas. Looking at its total sales, District 1 does not bring
in a higher amount of sales despite that. Hence, it is not considered for the flagship store.
Therefore, by using total sales as the deciding factor, Districts 19, 23 and 22 are shortlisted for
further analysis.
Page | 38
B3.1. Option 1 – District 19
The first option is District 19. It is the general location of Serangoon Garden, Hougang and
Punggol, which are situated in the northeast of Singapore.
Pros
District 19 has highest total sales and the largest customer base among all districts.
Hence, there has a higher probability of achieving high sales and success if the store is
located there.
It has the largest number of Singapore residents between the ages of 15-34 years old
among the three shortlisted districts. (Figure B3-4. to B3-6.)
o Total: 192,520
o Males: 93,780
o Females: 98,750
The district has a relatively young population with only 5-10% of residents aged 65 years
and over (Figure B3-7.). Hence, the population there fits the target audience (18-35 years
old) that Paula’s Choice is focusing. This could also explain the high sales figure in the
district.
The district also continually attracts younger population with its developments. For
example, the new Safra Club in Punggol (Chin, 2014) and Compass One mall in Seng
Kang (Baker, 2016) are designed to cater to young families. Furthermore, more public
housing are being planned and built in the district which are attractive to young couples
(CNA, 2016).
Punggol is a rising and developing estate with increasing infrastructure and amenities. It
is announced that Punggol North is to be developed into Singapore’s first “Enterprise
District” (Ng, 2017). This development would help create more offices and working
spaces in the area. Thus, it might generate more retail traffic to the district.
Serviced mainly by the North East MRT Line. Serangoon is also an interchange
connecting both North East and Circle Line. Furthermore, Punggol might be a terminus
for the future Cross Island Line, increasing accessibility to the area (CAN, 2016).
Cons
It is not a centralized location and consumers from other regions of Singapore might find
it inconvenient to travel to the northeast.
Punggol is a relative young town. Though there are many future developments planned,
it will still require some time for them to be completed.
Page | 39
B3.2. Option 2 – District 23
The second option is District 23. It is the general location of Hillview, Dairy Farm, Bukit
Panjang, Choa Chu Kang, which are situated in the northwest of Singapore.
Pros
The Integrated Transport Hub in Bukit Panjang is planned to be opened by this year
(CNA, 2015). With an improved transport hub providing accessibility and convenience,
more people might be willing to visit the area. Hence, this could potentially bring new
retail traffic to Bukit Panjang.
District 23 has the second largest number of Singapore residents between the ages of 15-
34 years old among the three shortlisted districts. (Figure B3-4. to B3-6.)
o Total: 136,290
o Males: 68,340
o Females: 67,950
Development plans for Tengah District, located beside Choa Chu Kang, is being
announced. Touted to be as big as Bishan and expected to have 55,000 homes (Yeo,
2016), the new estate could provide a huge number of new customers and consumers.
Thus, it might be lucrative to establish a flagship store in this district.
Serviced by the Downtown Line and North South Line.
Cons
It is not a centralized location and consumers from other regions of Singapore might find
it inconvenient to travel to the northwest.
The size of the retail scene is not as comparable to District 19 and 22
o District 19: Nex, Waterway Point, CompassOne
o District 22: JEM, Westgate, JCube, Jurong Point 1&2, Big Box
o District 23: Lot 1, Bukit Panjang Plaza, Hillion Mall, West Mall
Compared to District 19 and 22, there is not much major future developments that would
be occurring in the district.
Page | 40
B3.3. Option 3 – District 22
The third option is District 22. It is the general location of Jurong, which is located in the
western part of Singapore. Although District 22 is ranked 4th in terms of total sales, it is selected
over District 18 (Simei, Tampines, Pasir Ris) due to the future development plans of the district.
Pros
The Jurong Lake District has been planned to be transformed into the second Central
Business District of Singapore (Ng and Chua, 2016). Furthermore, plans for
infrastructural and amenities developments in the district have also been announced
(Lim, 2016). Thus, the possible business prospect these future developments could bring
might be tremendous.
Jurong East is the future terminus for the Singapore-Malaysia High Speed Rail (Heng,
2016). Hence, there might be increased retail traffic with both local and foreign
commuters. As Paula’s Choice do have customers in neighboring countries, there could
be potential benefits in setting up the flagship store in Jurong.
The development of Tengah District, which is situated above Jurong District, could also
possibly be a potential source of customers for the store.
District 22 is mainly serviced by the East West Line. Jurong East is also an interchange
connecting East West and North South Line.
Cons
It is not a centralized location and consumers from other regions of Singapore might find
it inconvenient to travel to the west.
District 22 has the lowest number of Singapore residents between the ages of 15-34 years
old among the three shortlisted districts. (Figure B3-4. to B3-6.)
o Total: 100,480
o Males: 50,240
o Females: 50,240
Page | 41
Figure B3-4. – Total Singapore Residents aged 15-34 years old (SingStat, 2016)
Figure B3-5. – Male Singapore Residents aged 15-34 years old (SingStat, 2016)
Page | 42
Figure B3-6. – Female Singapore Residents aged 15-34 years old (SingStat, 2016)
Figure B3-7. – Proportion of Resident Population Aged 65 Years. June 2016 (SingStat, 2016)
Page | 43
B4. Recommendation – Punggol
After evaluating the three options, the recommendation for the location of Paula’s Choice new
flagship store will be in District 19 due to the strong potential of the area (Figure B4-1.).
Population Profile
Large Existing Potential for High Extensive Future
fits Target
Customer Base Revenue Developments
Audience
- End of Appendix B -
Page | 44
Appendix C – Forecasting Analysis
With a portfolio comprising over 200 SKUs, Paula’s Choice needs to be able to forecast future
sales based on historical data, in order to make good decisions regarding inventory
management. Having clear projections will allow the company to know when to replenish
stocks at different distribution channels, when to make orders for products, as well as which
products to continue or discontinue selling.
Since the online and brick-and-mortar channels perform differently and are managed separately,
each channel was analysed separately. Furthermore, given the small contribution of the FP
outlet to total brick-and-mortar sales (~10%), the sales of both outlets were combined into a
single dataset for analysis.
C1. Methodology
Page | 45
C2. Limitations
C2.1. Clustering Analysis
Due to the sheer number of SKUs carried by Paula’s choice, it was prohibitively time and
resource-consuming to analyse each product separately. Hence, clustering analysis was done
to group products into clusters with similar characteristics for forecasting model generation.
This may reduce the accuracy of the final forecasting models when applied to individual
products.
C2.2. Multiple Linear Regression (MLR)
For each product, our group was online provided with 12 to 13 months of sales data. It was
decided that the dataset was too small to be split into training and testing datasets for the
purposes of MLR. Hence, all data points were used for MLR model training, and the
performance of MLR models was evaluated based on the RMSE of the model when applied to
the training dataset.
C3. Data Cleaning
The datasets given by the client were processed to generate datasets summarizing the sales
performance (in units) of the 275 SKUs over the past 12 (for retail) to 13 months (for online).
No missing data was found.
The total sales, average monthly sales, standard deviation (SD), coefficient of variance (CV),
and simple linear regression statistics (INTERCEPT, LINEST) were generated for each
product, in order to provide some general measures of the performance of each SKU:
1. Total sales and average monthly sales were indicators of general popularity
2. SD and CV were measures of sales volatility
3. Simple linear regression intercept was a possible measure of general demand, while
gradient was a possible measure of the rate of sales growth
There were significant numbers of SKUs that had no (0) sales within the given dataset. These
SKUs were hence removed from the dataset – it is also recommended that these SKUs be
discontinued, if they have not already been. A total of 79 SKUs with no sales were removed
at this point.
Page | 46
Jan 2016 to Jan 2016 to
SKU SKU
Jan 2017 Sales Jan 2017 Sales
2051 0 7970 0
2060 0 7979 0
2069 0 8020 0
2100 0 8029 0
2107 0 9107 0
2109 0 9117 0
2110 0 9127 0
2117 0 9137 0
2119 0 9147 0
2120 0 9157 0
2129 0 9167 0
2130 0 9177 0
2137 0 9187 0
2139 0 9908 0
2140 0 9940 0
2147 0 91501 0
2149 0 91502 0
2150 0 91511 0
2155 0 91512 0
2159 0 91521 0
2160 0 91522 0
2167 0 91531 0
2169 0 91532 0
2769 0 91541 0
3110 0 91542 0
3119 0 91551 0
3140 0 91552 0
3149 0 91561 0
3707 0 91562 0
6002 0 91571 0
6009 0 91572 0
7677 0 91580 0
7799 0 91587 0
7920 0 91589 0
7927 0 91641 0
7929 0 91651 0
7930 0 91661 0
7937 0 91671 0
7939 0 92063 0
7967 0
Figure C3-1. – List of SKUs removed due to zero sales in 2016
Page | 47
C4. Cluster Analysis
Cluster analysis was conducted using SAS through a four-step approach:
1. Variable clustering (proc varclus) was conducted to identify key variables to be used for
clustering, which were then standardized to account for the different scales of the variables
2. Hierarchical clustering (proc cluster) was conducted to determine the number of clusters,
based on the dendrogram, as well as CCC, pseudo-F and pseudo-T statistics
3. K-means clustering (proc fastclus) was then conducted based on the number of clusters
identified in the previous step
4. Lastly, each cluster was qualitatively analysed based on cluster characteristics
A summary of the clustering results is shown below:
Online Retail
Variables used for
clustering (based on Average, CV Average, LINEST, CV
step 1)
Number of clusters
7 8
(based on step 2)
Figure C4-1. – Summary of Clustering Results
C4.1. Clustering Results – Online Store
Variable Clustering
Page | 48
Hierarchical Clustering
7 Clusters
Summary of Clusters
Cluster Count of SKU Avg Mnthly Sales Avg CV Sales Volatility
1 1 99.23 0.57 High Low
2 29 3.47 2.08 Low High
3 61 6.65 1.30 Low High
4 6 2.18 3.26 Very Low Very High
5 23 23.03 0.69 Medium Low
6 5 42.18 0.67 Medium Low
7 71 7.75 0.66 Low Low
Overall 196 9.74 1.15
Page | 49
C4.2. Clustering Results – Physical Stores
Variable Clustering
Hierarchical Clustering
8 Clusters
Page | 50
Summary of Clusters
Count of Avg Mnthly Avg Avg
Cluster SKU Sales CV LINEST Sales Volatility Growth
1 48 1.18 1.33 0.05 Low High Low
2 1 62.18 0.19 1.99 High Low High
Very
3 18 0.42 2.38 -0.01 Low Very High Very Low
4 18 8.29 0.62 -0.63 Medium Low Negative
5 1 10.83 0.85 2.21 Medium Medium High
6 6 23.35 0.34 -0.50 High Very Low Negative
7 15 8.54 0.52 0.48 Medium Low Low
8 79 4.57 0.59 -0.06 Low Low Very Low
Overall 186 4.92 0.94 -0.03
Page | 51
C5.2. Multiple Linear Regression
To build the MLR models, we generated suitable training datasets from the original data using
VBA script and generated the models using proc reg in SAS.
Coefficients
Cluster Model Intercept m1 m2 m3 m4 m5 R^2 Adj R^2
3MLR 19.58385 0.66666 -0.76213 1.09402 0.8099 0.7149
1 4MLR 31.08823 0.00815 0.75115 -0.93848 1.09101 0.8105 0.6209
5MLR 49.76952 0.5159 -0.02174 0.29874 -0.69366 0.84688 0.8946 0.6312
3MLR 3.78635 0.02451 -0.03767 0.27658 0.0182 0.0079
2 4MLR 2.15003 1.33043 -0.16889 -0.06605 0.25221 0.3574 0.3474
5MLR 2.29972 0.93038 1.21554 -0.20938 -0.11117 0.21428 0.3662 0.3522
3MLR 2.82864 0.16834 0.51185 0.32853 0.2019 0.1979
3 4MLR 2.27089 0.58894 0.02598 0.39712 0.29359 0.2542 0.2487
5MLR 2.20838 0.5354 0.48601 -0.05239 0.38556 0.25823 0.2541 0.2463
3MLR 3.07176 -0.1756 -0.10971 -0.03396 0.0024 -0.0511
4 4MLR 3.52825 -0.15677 -0.19443 -0.12478 -0.06043 0.0044 -0.0769
5MLR 1.31604 27.83916 0.05014 -2.0954 -0.16564 0.09528 0.8272 0.8066
3MLR 8.82733 -0.05576 0.53157 0.36594 0.2854 0.2759
5 4MLR 7.7207 0.70147 -0.38977 0.41867 0.30306 0.3968 0.3848
5MLR 9.78585 0.33409 0.52556 -0.40355 0.43694 0.17284 0.3771 0.3596
3MLR 16.94745 0.10306 0.25895 0.46696 0.2347 0.1848
6 4MLR 17.15713 0.89002 -0.35981 0.15046 0.28902 0.4133 0.3547
5MLR 20.68515 0.64236 0.5724 -0.43256 0.24059 0.04884 0.4336 0.3503
3MLR 2.31325 0.09215 0.31917 0.47227 0.3643 0.3616
7 4MLR 1.73266 0.44873 -0.05222 0.2087 0.42784 0.4249 0.4213
5MLR 1.67713 0.36797 0.34954 -0.14243 0.21626 0.36373 0.454 0.4492
Page | 52
C5.4. MLR Models for Physical Stores
Coefficients
Cluster Model Intercept m1 m2 m3 m4 m5 R^2 Adj R^2
3MLR 0.9119 0.04217 0.11034 0.13539 0.0414 0.0347
1 4MLR 0.83977 0.07518 0.03663 0.08565 0.14647 0.0489 0.0389
5MLR 0.82491 -0.0137 0.09429 0.03736 0.06166 0.15216 0.0504 0.036
3MLR 141.691 0.25547 -0.0271 -1.43 0.6903 0.5045
2 4MLR 119.247 -0.6618 0.4112 0.42417 -1.0643 0.8246 0.5908
5MLR 102.566 -0.3438 -0.4509 0.94383 0.46224 -1.2709 0.8605 0.1631
3MLR 0.45378 -0.0278 -0.0312 -0.0464 0.0037 -0.0152
3 4MLR 0.47776 -0.0036 -0.0345 -0.0174 -0.0668 0.0054 -0.0232
5MLR 0.5098 0.05056 0.00285 -0.0353 -0.0416 -0.0712 0.01 -0.0312
3MLR 1.05392 0.36863 0.18199 0.1373 0.3831 0.3714
4 4MLR 0.19253 0.15944 0.35502 0.19597 0.06158 0.4379 0.4217
5MLR -0.1653 0.19213 0.07833 0.27365 0.10177 0.11956 0.4787 0.457
3MLR 6.83998 -0.2668 -0.1857 0.88777 0.5658 0.3052
5 4MLR 10.2918 -0.3562 0.03784 -0.1732 0.67378 0.5324 -0.091
5MLR 16.7356 -2.4536 1.62043 0.15607 -1.7231 1.40069 0.9641 0.7845
3MLR 1.4608 0.1749 0.44065 0.2063 0.3257 0.2852
6 4MLR -0.137 -0.1679 0.26919 0.45087 0.32427 0.3636 0.3044
5MLR -1.1899 0.30047 -0.1229 0.12705 0.29859 0.29051 0.3361 0.2439
3MLR 2.44354 0.14476 0.34843 0.2184 0.4158 0.4024
7 4MLR 2.46195 0.00569 0.12928 0.37117 0.216 0.3962 0.3752
5MLR 2.77609 0.07898 -0.0321 0.11686 0.33436 0.20338 0.3573 0.3248
3MLR 0.65406 0.25169 0.28268 0.23739 0.451 0.4487
8 4MLR 0.35684 0.17987 0.19596 0.25806 0.17864 0.4745 0.4712
5MLR 0.37506 0.12628 0.10958 0.21547 0.22589 0.11985 0.4729 0.4681
Page | 53
For ES, the simple moving average of January and February was taken as the forecast for March,
and the ES model was applied to April and the following months. Alpha for ES models was
determined by the script – the alpha that produced the lowest RMSE for each cluster was
selected.
The summary of model results, as well as the models selected for each cluster are discussed
below.
C5.6. Model Results – Online Store
Characteristics RMSE
Cluster Sales Volatility 2SMA 3SMA ES Alpha ES 3MLR 4MLR 5MLR
1 High Low 39.51 43.13 1 33.02 21.98 21.70 13.84
2 Low High 11.80 12.26 0.25 11.78 11.71 9.92 10.36
3 Low High 11.81 12.44 0.45 12.39 11.98 12.03 12.53
4 Very
Low Very High 14.07 14.71 0.15 14.33 14.59 15.06 6.64
5 Medium Low 15.87 17.20 0.55 16.70 15.45 14.16 14.33
6 Medium Low 27.40 29.45 0.6 28.19 26.36 23.19 22.72
7 Low Low 5.65 5.91 0.45 5.68 5.55 5.41 5.37
Legend: Yellow Box = Best performing / selected model
Red Box = Best performing but model not selected
Figure C5-4. – Model Results – Online Store
For online sales, we found that MLR produced the least forecasting error as compared to
traditional forecasting methods when applied to clusters that had relatively low sales volatility
as indicated by CV.
Cluster 4 proved difficult to forecast, as SKUs in the segment generally had very low sales and
very high volatility. Since none of the models were applicable, we decided to forecast the
cluster based on the 1-year simple moving average, in order to achieve realistic projections.
For Cluster 7 where MLR provided no significant advantage over SMA, we decided to select
the simpler method, which in this case was SMA.
C5.7. Model Results – Physical Stores
Characteristics RMSE
Cluster Sales Volatility Growth 2SMA 3SMA ES Alpha ES 3MLR 4MLR 5MLR
1 Very High Low
Low 2.21 2.15 0.15 2.02 1.93 1.95 1.96
2 High Low High 20.23 19.69 0.2 20.58 9.68 7.61 7.22
3 Very Very Very
Low High High 1.33 1.28 0.25 1.22 1.08 1.12 1.18
4 Medium Low Low 5.35 4.86 0.35 4.9352 4.319 4.19 3.64
5 Medium Medium Medium 8.82 10.40 1 7.8174 5.9843 5.75 1.31
6 High Very Low Very
Low 8.26 8.19 0.45 8.26 7.44 7.43 7.54
7 Medium Low Low 4.95 4.95 0.25 4.81 4.67 4.88 5.06
8 Low Low Low 3.1261 2.9727 0.2 2.8505 2.8358 2.7919 2.7229
Legend: Yellow Box = Best performing / selected model
Red Box = Best performing but model not selected
Figure C5-5. – Model Results – Physical Stores
For brick-and-mortar sales, we similarly found that MLR generally performed well for clusters
that had low volatility.
Page | 54
Once again, for clusters such as 1 and 3 that had very low sales, none of the models could
produce realistic forecasts. Hence, for both clusters, we decided to forecast sales based on the
past 1-year simple moving average.
For clusters 7 and 8, MLR provided no significant advantage over SMA, therefore SMA was
selected.
C5.8. Summary of Selected Models
Below is a summary of all clusters and the models selected for each cluster.
Model Intercept Coefficient Coefficient Coefficient Coefficient Coefficient
Cluster Selected m1 m2 m3 m4 m5
Online / 1 5MLR 49.76952 0.5159 -0.02174 0.29874 -0.69366 0.84688
Online / 2 4MLR 2.15003 1.33043 -0.16889 -0.06605 0.25221
Online / 3 2SMA
1-Year
Online / 4 SMA
Online / 5 4MLR 7.7207 0.70147 -0.38977 0.41867 0.30306
Online / 6 5MLR 20.68515 0.64236 0.5724 -0.43256 0.24059 0.04884
Online / 7 2SMA
Brick-and- 1-Year
mortar / 1 SMA
Brick-and-
mortar / 2 5MLR 102.566 -0.34383 -0.45088 0.94383 0.46224 -1.2709
Brick-and- 1-Year
mortar / 3 SMA
Brick-and-
mortar / 4 5MLR -0.16532 0.19213 0.07833 0.27365 0.10177 0.11956
Brick-and-
mortar / 5 5MLR 16.73564 -2.45361 1.62043 0.15607 -1.72305 1.40069
Brick-and-
mortar / 6 4MLR -0.13701 -0.16794 0.26919 0.45087 0.32427
Brick-and-
mortar / 7 2SMA
Brick-and-
mortar / 8 3SMA
Page | 55
C5.9. Identification of SKUs for Discontinuation
To identify further SKUs to be discontinued, we compared the total annual sales forecasted
against the minimum order quantity (MOQ) as provided by the client. We recommend for
SKUs that have MOQ greater than the total annual sales forecast to be discontinued – the
minimal sales generated by these products do not justify the risk of holding such large amounts
of inventory caused by MOQ.
A total of 55 SKUs fell into this category and are identified in the table below.
Forecasted Feb Forecasted Feb
SKU 2017 to Jan 2018 MOQ SKU 2017 to Jan 2018 MOQ
Sales Sales
1259 36 250 7687 69 150
1359 136 250 7689 47 250
1469 140 250 7717 145 150
1569 197 250 7719 105 250
1720 41 48 7729 54 250
1869 105 250 7769 218 250
3109 124 250 7789 230 250
3257 16 150 7809 100 250
3259 73 250 7819 56 250
3709 61 250 7847 61 150
5009 236 250 7867 143 150
5579 17 250 7880 35 48
5709 209 250 8509 239 250
7607 37 150 8707 98 150
7609 56 250 8709 204 250
7617 58 200 8719 196 250
7619 28 250 8729 31 250
7629 152 250 8749 130 250
7639 48 250 9129 54 250
7647 56 150 9139 232 250
7649 162 250 3357 1 150
7659 118 250 3609 104 250
7667 85 150 5569 83 250
7669 21 250 5809 87 250
7679 34 250 7969 15 250
7680 48 48 8717 6 150
8737 101 130 8727 95 150
8747 3 130
Figure C5-6. – SKUs Identified for Discontinuation due to high MOQ
Page | 56
C5.10. Final Sales Forecast
Using the models selected in the earlier sections, we generated a 12-month sales forecast for
all applicable SKUs (excluding SKUs recommended for discontinuation), from February 2017
to January 2018 as shown in the table below.
+135.6%
$1,833,615
$778,402
-33.5%
$424,334
$282,121
Online Brick-and-Mortar
Page | 57
Service level refers to the client’s commitment to meeting customer demand. For example, a
90% service level means that the client will be able to meet customer demand 90% of the time.
For the purposes of this project, we seek to achieve a service level of 90% which corresponds
to a z-score of 1.28.
The formula for safety stock is as follows:
Page | 58
2800 4 9 8010 7 15
2809 4 22 8017 3 11
3100 6 7 8500 4 6
3250 4 7 8510 3 4
3350 4 11 8519 2 8
3359 2 17 8520 4 3
3400 9 21 8529 3 8
3409 8 24 8700 4 8
3500 6 8 8710 4 4
3600 4 9 8720 4 4
3700 4 8 8730 3 6
5000 4 6 8740 6 7
5200 3 3 8750 2 3
5209 2 11 8760 2 2
5500 3 6 9100 4 7
5560 4 4 9109 2 22
5570 3 3 9110 6 3
5700 6 16 9119 3 17
5800 3 9 9120 3 3
5900 6 12 9130 3 4
5909 2 22 9140 3 6
6000 11 43 9149 4 12
6007 3 49 9150 3 8
6100 6 21 9159 3 22
6107 3 29 9160 3 6
6110 6 27 9169 3 15
6117 4 22 9170 3 7
6130 7 11 9179 3 22
6137 3 15 9180 4 7
6200 7 25 9189 3 16
6207 4 32 9945 2 3
6210 9 24 90530 2 3
6217 4 22 92062 2 2
6240 6 13 92075 2 2
7600 6 9 92076 3 3
7610 6 17 7960 N/A 17
7620 3 6 7660 4 11
7630 3 6 7670 6 11
7640 6 9 7680 4 4
7650 4 11 7690 8 20
Figure C6-3. – Safety Stock Requirements for each SKU
Page | 59
All in all, replenishment amount for each SKU will be as follows, and will have to be calculated
by the client on a monthly basis depending on actual product sales:
Page | 60
C8. Top 10 Products
Based on our forecasts, the top 10 selling products (in terms of unit) were identified. The
following table summarizes the forecasted annual unit sales, the safety stock required, and
reorder points and period for each SKU.
Forecasted Historical
Annual Sales Standard Safety Reorder Point / Reorder Period
SKU (Units) Deviation Stock Order Quantity (Months)
2010 4646 69 153 1314 3
7770 1593 35 78 476 3
6000 1558 40 89 478 3
2017 1540 47 105 490 3
7820 1430 27 60 417 3
6007 1305 39 87 413 3
7980 1262 41 91 406 3
1350 1193 29 65 363 3
7780 1111 23 51 329 3
7760 1044 22 49 310 3
Figure C8-1. – Top 10 Products
C9. Sample Codes used for Analysis
SAS – Variable Clustering
Page | 61
VBA – Regression Dataset Building
Page | 62
Python – Model RMSE Calculation
Page | 63
Page | 64
- End of Appendix C -
Page | 65
Appendix D – Glossary
Terms Meaning
SAS Enterprise Miner is a solution to create accurate predictive and
SAS Enterprise descriptive models on large volumes of data across different sources
Miner in the organization. In this case, this solution is used for Market Basket
Analysis.
Market Basket Analysis can simply be defined as a modelling
Market Basket technique based on the theory that when a consumer purchased a
Analysis certain group of items, he or she is likely to purchase another group of
items.
The likelihood or probability of purchasing B given that A is
Confidence
purchased already. P( purchase B | purchase A)
How useful is the generated rule, AB as compared to a random
guess.
Lift Lift > 1 means the generated rule is more useful than a random guess.
Lift = 1 means the generated rule is the same as a random guess.
Lift < 1 means the generated rule is less useful than a random guess.
Support refers to the frequency in which a rule occur in all the
Support
transactions.
A measure of error, whereby errors for individual data points are
Root mean
squared, averaged and squared rooted. This accounts for negative and
squared error
positive errors cancelling each other out, and more accurately reflects
(RMSE)
the scale of errors.
Multiple-Linear Used in predictive modelling to explain the relationship between one
Regression (MLR) continuous dependent variable to two or more independent variables.
An explorative analysis that divide a multivariate dataset into
Cluster Analysis
“natural” clusters (groups).
A procedure used to group redundant variables, in order to identify
Variable
key variables to be used for hierarchical, k-means or other types of
clustering
clustering of the actual dataset
Hierarchical It is a type of cluster analysis that serves to build a hierarchy of clusters
clustering for more distinct clustering.
A type of cluster analysis to divide n-observations into k-clusters
K-means (User can specify K as the number of clusters) in which each
clustering observation belongs to the cluster with the nearest mean, serving as a
prototype of the cluster.
- End of Glossary -
Page | 66
Appendix E – References
81817777.com. (2016) District Map. 81817777.com. Available at: http://81817777.com/new-
launch-condos/directory-SgCondo-apartment/singapore-district-map.jpg [Accessed: 22
March, 2017].
Baker, J, A. (2016) Compass One opens - with strong focus on young families. The Straits
Times. Available at: http://www.straitstimes.com/singapore/compass-one-opens-with-strong-
focus-on-young-families [Accessed: 22 March, 2017].
Cheng, K. (2016) A peek into Tengah, the next new HDB town the size of Bishan. Today.
Available at: http://www.todayonline.com/singapore/peek-tengah-next-new-hdb-town-size-
bishan [Accessed: 22 March, 2017].
Chin, D. (2014) New Safra Club at Punggol which will cater to young families. The Straits
Times. Available at: http://www.straitstimes.com/singapore/new-safra-club-at-punggol-
which-will-cater-to-young-families [Accessed: 22 March, 2017].
CNA. (2015) Bukit Panjang Integrated Transport Hub to open in 2017. Channel NewsAsia.
Available at: http://www.channelnewsasia.com/news/singapore/bukit-panjang-
integrated/1943524.html [Accessed: 22 March, 2017].
CNA. (2016) More than 10,000 flats launched in largest HDB sales exercise this year.
Channel NewsAsia. Available at: http://www.channelnewsasia.com/news/singapore/more-
than-10-000-flats-launched-in-largest-hdb-sales-exercise/3308458.html [Accessed: 22 March,
2017].
CNA. (2016) Two more years before decision on Cross Island Line: Khaw Boon Wan.
Channel NewsAsia. Available at: http://www.channelnewsasia.com/news/singapore/two-
more-years-before/2558414.html [Accessed: 22 March, 2017].
Heng, J. (2016) Singapore-KL High Speed Rail targeted to start running by around 2026;
journey will take 90 minutes. The Straits Times. Available at:
http://www.straitstimes.com/singapore/singapore-kl-high-speed-rail-targeted-to-start-
running-by-around-2026-journey-will-take-90 [Accessed: 22 March, 2017].
Lim, A. (2016) Plans to develop Jurong Lake Gardens Central and East unveiled. The Straits
Times. Available at: http://www.straitstimes.com/singapore/environment/plans-to-develop-
jurong-lake-gardens-central-and-east-unveiled [Accessed: 22 March, 2017].
Page | 67
Lim, P. J. (2016) Waterway Point officially opens. Channel NewsAsia. Available at:
http://www.channelnewsasia.com/news/business/singapore/waterway-point-
officially/2709726.html [Accessed: 24 March, 2017].
Ng, J, S. (2017) Parliament: Punggol North to become Singapore's first 'enterprise district',
home to digital and cyber-security industries. The Straits Times. Available at:
http://www.straitstimes.com/singapore/housing/punggol-north-to-become-spores-first-
enterprise-district-home-to-digital-and-cyber [Accessed: 22 March, 2017].
Ng, K. and Chua, A. (2016) Jurong Lake District to be second CBD, call for plans issued.
Today. Available at: http://www.todayonline.com/singapore/ura-seeks-proposals-develop-
jurong-lake-district-spores-2nd-cbd [Accessed: 22 March, 2017].
SingStat. (2016) Population Trends 2016. Department of Statistics Singapore. Available at:
http://www.singstat.gov.sg/publications/publications-and-papers/population-and-population-
structure/population-trends [Accessed: 24 March, 2017].
SingStat. (2016) Singapore Residents by Planning Area/Subzone, Age Group and Sex, June
2000 - 2016. Department of Statistics Singapore. Available at:
http://www.singstat.gov.sg/docs/default-source/default-document-
library/statistics/browse_by_theme/population/statistical_tables/tablea12-2000-2016.xls
[Accessed: 24 March, 2017].
URA. (2016) List of Postal Districts. Urban Redevelopment Authority. Available at:
https://www.ura.gov.sg/realEstateIIWeb/resources/misc/list_of_postal_districts.htm
[Accessed: 22 March, 2017].
Yeo, S, J. (2016) Tengah to be developed into a 'Forest Town'. The Straits Times. Available
at: http://www.straitstimes.com/singapore/housing/tengah-to-be-developed-into-a-forest-town
[Accessed: 22 March, 2017].
Yip, W. Y. (2016) Shaw's Waterway Point cineplex has most screens in the heartlands. The
Straits Times. Available at: http://www.straitstimes.com/lifestyle/entertainment/shaws-
waterway-point-cineplex-has-most-screens-in-the-heartlands [Accessed: 22 March, 2017].
- End of References -
Page | 68
BC3406
Business Analytics Consulting
Data Hackathon
Group Two | Alvon Chua Kang Jin | He renyi jonathan | Tan Jun Lek, Jerry
Overview of Paula’s Choice Data Hackathon Challenge
Background
Paula’s Choice Singapore is a skincare company that aims to
de the best skincare and makeup products to consumers. Their
main target audience are customers aged 25-35 (50%) and
Provides 18-24 (22%) years old. Paula’s Choice believes that that their
brand exemplifies the essence of ‘masstige’ where high quality
Dataset products are offered to consumers at affordable prices.
Price
SKU
Place
Promotion
Complementary
SKU
Justifications for the Five Proposed Bundles
Key Technique Used:
• Market basket analysis
• Descriptive analysis at:
• SKU Level
• Bundle Level
• Channel Level
• Category Level
• Combination of the above
Interesting Insight
Customers mostly purchased items of the same pack
size when purchasing multiple items.
Flagship Store – Punggol
• Large existing customer base,
• Population profile fits target audience
• Data shows high sales performance;
great potential for high revenue
• Extensive development plans
Waterway Point
• ~$18.00 - $40.00 psf (24 Mar 17)
Forecasting Analysis
Variable clustering + hierarchical clustering
Methodology
+ k-means clustering to group SKUs for forecasting
Clustering
Top 10 Products Forecasted 12-Months Revenue
Forecasting Model
Final Forecast
Selection
Comparison between simple moving average models, exponential
smoothing and multiple linear regression models for different clusters
Models and guidelines for stock replenishment and ordering proposed based on generated forecast, constraints, and inventory
management concepts such as lead time, safety stock, reorder points