Path Breaking Case Studies in E-Commerce Using Data Mining: Rupesh Sanchati, P.C. Patidar, Gaurav Kulkarni

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Journal of Computer Technology and Electronics Engineering (IJCTEE)

Volume 1, Issue 1

Path Breaking Case Studies in E-commerce


using Data Mining
Rupesh Sanchati, P.C. Patidar, Gaurav Kulkarni

Using a website as a data collection tool is now commonplace,


Abstract The e-commerce domain can provide all the right because of its interactivness, simplicity, and unobtrusiveness.
ingredients for successful data mining and claim that it is a killer So naturally one would want to analyze this data with the best
domain for data mining. The architecture of various e-commerce data mining techniques available. The results of the data
sites has supported data collection, transformation, and data
mining since its inception. With click-streams being collected at
mining - the rules which say which customers are likely to buy
the application-server layer, high-level events being logged, and what products at the same time, or who is about to switch to
data automatically transformed into a data warehouse using your competitor - would ideally then be integrated into your
meta-data, common problems plaguing data mining using dynamic website - thus providing an automated, end-to-end,
weblogs (e.g., sessionization and conflating multi-sourced data) targeted marketing and e-CRM tool.
are obviated, thus allowing one to concentrate on actual data
mining goals. The paper briefly reviews the architecture of
integrated E-Commerce with Data Mining, discusses some case II. LITERATURE REVIEW
studies and puts forward some conclusions inferred from the
same. While the conclusions are drawn from the case studies A. Data Mining and retail e-commerce-
from the retail e-commerce domain, they are also equally
applicable to other data mining domains, as well. Not withstanding several notable successes, data mining
projects remain in the realm of research: high potential
Index Terms Web analytics, retail e-commerce, Simpsons reward, accompanied by high risk. The risk stems from
paradox, Timeout Analysis, bot analysis. several sources. It has been reported by many researchers and
has been our experience, that the data mining or algorithmic
I. INTRODUCTION modeling phase of the knowledge discovery process occupies
at most 20% of the effort in a data mining project.
The advent of e-commerce revolutionized every industry.
Unfortunately, the other 80% contains several substantial
Every aspect of commerce, from sales pitch to final delivery,
hurdles that without heroic effort may block the successful
could be automated and made available 24 hours a day, all
completion of the project.
over the world. B2B solutions carried this one step further,
So, why is e-commerce different? In short, many of the
allowing vertical partnerships and co-branding. Businesses
hurdles are significantly lower. As compared to ancient or
found a new incentive to bring their data into the digital age.
shielded legacy systems, data collection can be controlled to a
And dynamic content allowed the first truly personalized,
larger extent. We now have the opportunity to design systems
interactive websites to come into being, all through the magic
that collect data for the purposes of data mining, rather than
of e-commerce. Meanwhile, away from the digital storefront,
having to struggle with translating and mining data collected
data warehouses were springing up in the machine rooms of
for other purposes. Data are collected electronically, rather
industry - gargantuan information repositories for the
than manually, so less noise is introduced from manual
collection of every bit of business trivia. The sources of data
processing. E-commerce data are rich, containing information
were of the old realm of business: point-of-sale terminals,
on prior purchase activity and detailed demographic data. In
inventory databases, transaction records. Attempts to
addition, some data that previously were very difficult to
understand the data, first with statistical tools, later with
collect now are accessible easily. For example, e-commerce
OLAP systems, met with limited success - until the
systems can record the actions of customers in the virtual
introduction of Data Mining. Using machine learning
store, including what they look at, what they put into their
algorithms, Data Mining software finds hidden patterns in the
shopping cart and do not buy, and so on. Previously, in order
data, and uses them to form new rules and predict the future
to obtain such data companies had to trail customers (in
behavior of customers - turning that mountain of data into
person), surreptitiously recording their activities, or had to
valuable knowledge and untapped business opportunities.
undertake complicated analyses of in-store videos. It was not
Manuscript received july 25, 2011.
cost-effective to collect such data in bulk, and correlating
Rupesh Sanchati,Sr.Lecturer,Department of Computer Science them with individual customers is practically impossible. For
&Engg., MIT, Mandsaur (MP.),India. (e-mail: [email protected]). e-commerce systems massive Amounts of data can be
P.C. Patidar,Department of Computer Science &Engg., MIT, collected inexpensively.
Mandsaur (MP.),India. (e-mail: [email protected]).
Unlike many data mining applications, the vehicle for
Gaurav Kulkarni, Department of Computer Application, MIT,
Mandsaur (MP.)India, (e-mail:[email protected]). capitalizing on the results of miningthe systemalready is

20
International Journal of Computer Technology and Electronics Engineering (IJCTEE)
Volume 1, Issue 1
automated. Also, because the mined models will fit well with The Customer Interaction component provides the
the existing system, computing return on investment can be interface between customers and the e-commerce business.
much easier. This interaction could take place through a web site (e.g., a
The lowering of several significant hurdles to the marketing site or a web store), customer service (via
applicability of data mining will allow many more companies telephony or email), wireless application, or even a
to implement intelligent systems for e-commerce. However, bricks-and-mortar point of sale system. For effective analysis
there is an even more compelling reason why it will succeed. of all of these data sources, a data collector needs to be an
As implied above, the volume of data collected by systems for integrated part of the Customer Interaction component. To
e-commerce dwarfs prior collections of commerce data. provide maximum utility, the data collector should not only
Manual analysis will be impossible, and even traditional log sale transactions, but it should also log other types of
semi-automated analyses will become unwieldy. Data mining customer interactions, such as web page views for a web site.
soon will become essential for understanding customers.
The Analysis component provides an integrated en-
The lessons described in this paper are based on case
vironment for decision support utilizing data transformations,
studies and extensive contemporary literature study. While
reporting, data mining algorithms, visualization, and OLAP
the lessons can be drawn, both at business implementation
tools. The richness of the available metadata gives the
and technical fronts, here, in this paper we attempt to
Analysis component significant advantages over horizontal
summarize our inferences on the business front.
decision support tools, in both power and ease-of-use.
B. Integrating E-Commerce and Data Mining: Architecture The Stage Data bridge connects the Business Data
In this section we give a high level overview of architecture Definition component to the Customer Interaction
for an e-commerce system with integrated data mining. In this component. This bridge transfers (or stages) the data and
architecture there are three main components, Business Data metadata into the Customer Interaction component. Having a
Definition, Customer Interaction, and Analysis. Connecting staging process has several advantages, including the ability
these components are three data transfer bridges, Stage Data, to test changes before having them implemented in
Build Data Warehouse, and Deploy Results. The relationship production, allowing for changes in the data formats and
between the components and the data transfer bridges is replication between the two components for efficiency, and
illustrated in Figure 1. enabling e-commerce businesses to have zero down-time.
The Build Data Warehouse bridge links the Customer
Interaction component with the Analysis component. This
bridge transfers the data collected within the Customer
Interaction component to the Analysis component and builds
a data warehouse for analysis purposes. The Build Data
Warehouse bridge also transfers all of the business data
defined within the Business Data Definition component
(which was transferred to the Customer Interaction
component using the Stage Data bridge).
The last bridge, Deploy Results, is the key to closing the
loop and making analytical results actionable. It provides the
ability to transfer models, scores, results and new attributes
constructed using data transformations back into the Business
Data Definition and Customer Interaction components for use
Figure: 1 Architecture of Integrated Data Mining with in business rules for personalization.
E-Commerce
III. PROPOSED FRAMEWORK
In the Business Data Definition component the After analyzing various retail e-commerce sites, we
e-commerce business user defines the data and metadata propose some analyses that would be useful in practice. In
associated with their business. This data includes mer- each of the following subsections we describe the lessons
chandising information (e.g., products, assortments, and price learned from path breaking case studies.
lists), content information (e.g., web page templates, articles,
images, and multimedia) and business rules (e.g., Case 1: Bot Analysis
personalized content rules, promotion rules, and rules for
Web robots, spiders, crawlers, and aggregators, which we
cross-sells and up-sells). From a data mining perspective the
collectively call bots, are automated programs that create
key to the Business Data Definition component is the ability
traffic to websites. Bots include search engines, such as
to define a rich set of attributes (metadata) for any type of
Google, web monitoring software, such as Keynote and
data.
Gomez, and shopping comparison agents, such as mySimon.
Because such bots crawl sites and may bring in additional

21
International Journal of Computer Technology and Electronics Engineering (IJCTEE)
Volume 1, Issue 1

human traffic through referrals, it is not a good idea for 2. Just because the traffic is increasing immediately after
websites to block them from accessing the site. In addition to registering with search engines, one should not get
these good bots, there are e-mail harvesters, which try to overwhelmed, because substantial part of that might be bot
look for e-mails that are sold as e-mail lists, offline browsers traffic.
(e.g., Internet Explorer has such an option), and many 3. Many commercial web analytic packages include basic
experimental bots by students and companies trying out new bot detection through a list of known bots, identified by their
ideas. user agent or IP. However, such lists must be updated
regularly to keep track of new evolving and mutating bots.

Sept-11 Note Case 2: Session Timeout Analysis

Registration significant drop Enhancing the user browsing experience is an important


goal for website developers. One hindrance to a smooth
at Search Week Ends in human
browsing experience is the occurrence of a session timeout. A
engine sites user session is determined by the application logic to have
timed out (ended) after a certain predefined period of
inactivity.
Figure 2 shows the impact of different session timeout
thresholds set at 10-minute intervals on two large clients

Figure 2: Bot Analysis

Here are the data obtained from some case studies:


Percentage of sessions generated by bots is
23% at MEC (outdoor gear)
40% at Debenhams

Observations:
1. Both account for 5 to 40% of sessions. Due to the volume
Figure 2: Setting a suitable session timeout threshold.
and type of traffic that they generate, bots can dramatically
skew site statistics.
2. Even when the human traffic is fluctuating substantially, Observations:
the bot traffic still remains the same. 1. If the session timeout threshold were set to 25 minutes
3. After registering with search engine the external bot traffic then for client A, 7% of all sessions would experience
increases substantially, as expected. timeout and 8.25% of sessions with active shopping carts
would lose their carts as a result. However, for client B, the
Lesson:
numbers are 3.5% and 5% respectively.
1. Accurately identifying bots and eliminating them before 2. Several user sessions were experiencing a timeout as a
performing any type of analysis on the website is critical. result of a low timeout threshold and lost their active

22
International Journal of Computer Technology and Electronics Engineering (IJCTEE)
Volume 1, Issue 1
shopping cart spending per customer for multi-channel customers is more
than that of the web-only channel.
Lesson:
1. The software save the shopping cart automatically at
timeout and restore it when the visitor returns.
2. Clients must determine the timeout threshold only after
careful analysis of their own data.
3. Setting the session timeout threshold too high would
mean that fewer users would experience timeout thereby
improving the user experience.
4. A larger number of sessions would have to be kept active
(in memory) at the website thereby resulting in a higher load
on the website system resources.
5. Setting an appropriate session timeout threshold
involves a trade-off between website memory utilization
(which may impact performance) and user experience. So
maintain a right balance.
Case 3: Simpsons paradox Figure 3: Average yearly spending per customer for
On a few occasions it becomes difficult to present multi-channel and web-only purchasers by number of
insights that are seemingly counter-intuitive. For instance, purchases (left), and average yearly spending per customer
when analyzing a clients data we came across an example for multi-channel and web-only purchasers (right).
of Simpsons paradox (Simpson, 1951). Simpsons paradox Lesson:
occurs when the correlation between two variables is
1. Explain counter-intuitive insights - The reversal of the
reversed when a third variable is controlled.
trend in the above case is happening because a weighted
average is being computed and the number of customers who
shopped more than five times on the web is much smaller than
the number of customers who shopped more than five times
across multiple channels. Such insights must be explained to
business users.

We were comparing customers with at least two purchases Figure 4: Clarification of Simpsons paradox
and looking at their channel preferences, i.e., where they
made purchases. Do people who shop from the web only Case 4: Search Effectiveness Analysis
spend more on average as compared to people who shop from
Significant time and effort is spent in designing forms that
more than one channel, such as the web and physical retail
are aesthetically pleasing. The eventual use of the collected
stores.
form data for the purpose of data mining must also be kept in
Observations: mind when designing forms.
1. The line chart in Figure 3 shows that for each group of
shoppers who shopped once, twice, three times, four times, Observation:
five times, and more than five times respectively, the average
1. On the basis of average sales per visit, it can be said that
spending per customer on the web-only channel is more than
Customers that search are worth two times as much as
the average spending per customer on multiple channels.
customers that do not search.
2. However, the bar chart in Figure 3 shows that the average
2. Failed searches hurt sales severely.

23
International Journal of Computer Technology and Electronics Engineering (IJCTEE)
Volume 1, Issue 1

shown in Figure 3 that orders seem to follow visits by


five hours. It turned out different servers were being used
to log clickstream (visits) and transactions (orders), and
these servers system clocks were off by five hours. One
was set to GMT and the other to EST.

IV. CONCLUSION ANDFUTURE WORK

We reviewed the integrated architecture of Data Mining


with E-Commerce, which provides powerful capabilities to
collect additional click stream data not usually available in
web logs, while also obviating the need to solve problems
Figure 5: Effectiveness of search. usually bottlenecking analysis (and which are much less
accurate when done as an afterthought), such as sessionization
Lesson: and conflating data from multiple sources. We believe that
such architectures where click streams are logged by the
1. Design forms with data mining in mind.
application server layer are significantly superior and have
2. Create custom pages for often searched keywords.
proven themselves with various E-commerce sites.
3. Do not allow empty search.
Our focus on Business to Consumer (B2C) e-commerce for
Case 5: Data Auditing retailers allowed us to drill deeper into business needs to
Data cleansing is a crucial prerequisite to any form of data develop the required expertise and design out-of-the-box
analysis. Even when most of the data are collected reports and analyses in this domain. Further, we believe that
electronically, as in the case of e-commerce, there can be most lessons will generalize to other domains outside of retail
serious data quality issues. e-commerce.
Consider following graph that shows distribution of visits The top 3 lessons are:
and orders by hour-of-day for a real website. 1. Accurately identifying bots and eliminating them before
performing any type of analysis on the website is critical.
2. Setting an appropriate session timeout threshold involves a
trade-off between website memory utilization (which may
impact performance) and user experience. So maintain a right
balance.
3. Counter-intuitive insights must be explained to business
users in depth.

E-commerce is still in its infancy, with less than a decade of


experience. Best practices and important lessons are being
learned every day. The Science of Shopping is well
developed for bricks and mortar stores. Although the
techniques show very promising results, the topic is still in its
infancy. Surviving this topic, we listed some challenges that
constitute promising research directions.

Figure 6: Distribution of visits and orders by hour-of-day.


REFERENCES
[1] Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Integrating
Observation: E-Commerce and Data Mining: Architecture and Challenges. In
The above graph shows an interesting pattern in visit and Proceedings of the IEEE International Conference on Data Mining
order. Orders follow visits by five hours, while we are (ICDM2001). IEEE
[2] Kohavi, R. (1998). Crossing the Chasm: From Academic Machine
expecting visits and orders to be close to each other in Learning to Commercial Data Mining. Invited talk at the Fifteenth
time. International Conference on Machine Learning (ICML98), Madison,
WA. Morgan Kauffmann.
Lesson: [3] Kohavi, R. (2001). Mining e-commerce data: The good, the bad, and
the ugly. In Proceedings of the Seventh ACM SIGKDD International
1. Data cleansing is a crucial prerequisite to any form of
Conference on Knowledge Discovery and Data Mining (KDD 2001),
data analysis. 8-13. ACM Press.
2. Weve found serious data quality issues in data [4] Kohavi, R. & Provost, F. (2001). Applications of data mining to
warehouses that should contain clean data, especially e-commerce. In Data Mining and Knowledge Discovery, 5(1/2).
Kluwer Academic.
when the data were collected from multiple channels, [5] Blue Martini Software. (2003a). Blue Martini Business Intelligence at
archaic point-of-sale systems, and old mainframes. As Work: Charting the Terrains of MEC Website Data

24
International Journal of Computer Technology and Electronics Engineering (IJCTEE)
Volume 1, Issue 1

25

You might also like