Customer Relationship Manageme
Customer Relationship Manageme
Customer Relationship Manageme
CHURN TO WINBACK
A Dissertation
Submitted
to the Temple University Graduate Board
In Partial Fulfillment
of the Requirements for the Degree of
Doctor of Philosophy
By
Ke Li
May, 2013
ii
UMI Number: 3564826
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
UMI 3564826
Published by ProQuest LLC (2013). Copyright in the Dissertation held by the Author.
Microform Edition © ProQuest LLC.
All rights reserved. This work is protected against
unauthorized copying under Title 17, United States Code
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
©
Copyright
2005
by
KeLi
All Rights Reserved
iii
ABSTRACT
With the grant of a big CRM dataset from a large media company, this
dissertation examines four different categories of factors that could impact three stages of
of lost customers. Specifically, with the aid of machine learning method of random
forests and text mining technique, this study identify among the factors of customer
responsiveness to marketing actions), firm’s marketing initiatives (e.g. the volume of the
channels they use, and the marketing penetration in different geographical areas),
customer self-reported deactivation reasons, as well as the call centers notes in text form,
which factors play bigger roles than others during each of the three stages of CRM.
Furthermore, the authors also examine how these factors evolve throughout these three
whether to convert to paid customer, to churn, or to reactivate their service with the
company. The findings help managers better allocate their resources in the processes of
iv
ACKNOWLEDGMENTS
committee members who have guided me and assisted me throughout the completion of
this dissertation. Dr. Anthony Di Benedetto, it is your book New Product Management
that convinced me that Marketing is the most interesting research area that I should spend
the rest of my life time with. Dr. Eric Eisentein, it is your consistent belief in me and
persistent pushing me into this quantitative world that enables me to handle the analysis
of the big data from SiriusXM. Dr. Ji Zhu, we only met once when you came to Temple
University to give presentation but you quickly impressed me by already coming up with
the machine learning methods to deal with the classification problem in Sirius data after
we only chatted for 5 minutes. Afterwards, both of your expertise and spiritual support
carry me through this process of completing the dissertation. Finally Dr. Eric Bradlow,
Wharton in June 2009 but it was not until several years later that you finally become my
dissertation committee member and you have absolutely become the game-changer in my
career ever since. Your guidance to my dissertation direction, your prompt replies to my
every weekly report with your insights and encouragement, as well as your word-by-
word review and revision of my dissertation not only turns the dissertation into an
extremely interesting project to me but also paved way for me to become a serious
researcher. With all of your deep intelligence and great patience this experience of
v
Dr. Sanat Sarkar, you are not only my external reader of my committee but also
Dr. Aaronson, when you interviewed me for the PhD program at Fox School of
Business you told me Fox School needs students like me. For this very line of words I
have strived not to let you down. In addition, throughout all these years at Temple, you
have always given me a hand whenever it is the most dire situation for me.
I would also thank CIBER Temple, particularly Ms. Kim Cahill and Mr. Arvind
Phatak, without your supporting me financially I won’t be able to accomplish this task.
When other people doubted me, it is you, Kim, who always hold strong confidence in me
I would also acknowledge Dr. Bei Kang, who assisted me in many ways to bring
Finally I would like to thank my family. My mother Xi Liu, my father Faliang Li,
and my sister Hao Li, without their unconditional love and support I would never be able
to make it. Particularly my mom, even though I did not become a scientist as she wished
me to be, this is at least the first step to become a scholar if she could compromise her
vi
This dissertation is dedicated to my grandparents Cuihua Zeng, Mengqiu Zhang,
Yingquan Li, Yingzhou Liu, and my parents Xi Liu and Faliang Li.
vii
TABLE OF CONTENTS
PAGE
ABSTRACT....................................................................................................................... iv
ACKNOWLEDGMENTS ...................................................................................................v
CHAPTER
1. INTRODUCTION ...................................................................................................1
2.1.3WINBACK…………..……………………..…………………………16
3. HYPOTHESES ........................................................................................................30
viii
3.2.1 MARKETING COMMUNICATION CHANNELS AND
CUSTOMER RESPONSIVENESS ………………………………...34
4. METHODOLOGY ..................................................................................................44
4.1.2 ADABOOST…….……………………………………….…………..47
REFERENCES ..................................................................................................................89
APPENDICES
x
LIST OF TABLES
Table Page
xi
LIST OF FIGURES
Figure Page
xii
1
CHAPTER 1
INTRODUCTION
in marketing since firms realized that, in order to maximize their profit, they should
carefully manage their relationship with their customers, often attempting to anticipate
consumer behavior. This is particularly the case in the service industry. Even though
CRM is a relatively recent term, the emphasis of marketing strategies on customers has
had a long history. From the very beginning of marketing campaigns, entrepreneurs and
firms have relied on all kinds of means to attract potential customers and to keep them
loyal for repeated purchases while providing them with the “right” products. Practitioners
have always emphasized both products and customers simultaneously, but the shift of
focus of marketing research from product to customers became practicable only after the
of customers’ demographics and their transactions with firms, but these early efforts held
the promise of enabling targeted direct marketing to individual customers. Over time,
increased processing power and advances in methodology raised the competitive bar so
that, in order to be competitive, firms must have some form of CRM strategy in place.
Today, many companies apply RFM (Recency, Frequency, and Monetary Value)
methods as their CRM models of choice (Reinartz & Kumar, 2003; Fader, Hardie, & Lee,
2005; Cui et al., 2006). These models use secondary data as an input, and segment
customers so that they can target the right customers for promotion, and to profit from the
2
relationship (Berger & Magliozzi 1992). The most valuable customers identified by RFM
segmentation are the ones with the highest recency, frequency, and monetary value.
Criticism for this segmentation approach has pointed out that it assumes that customers’
behavior patterns are static throughout their whole life cycle. If incorrect, this assumption
application that database marketing has facilitated is the customer life cycle or customer
lifetime value (CLV) analysis. As early as 1940s, companies had already begun to
approximate the average value of their customers (Petrison et al., 1997), but academic
research on CLV flourished only after the introduction of database marketing. Most of
the CLV literature has been focused on constructing different probabilistic models of the
underlying consumer behavior. A smaller set of papers explore the relationship between
firm profitability, marketing actions and CLV (Hogan et al., 2002; Reinartz & Kumar,
2003; Rust, Zeithaml, & Lemon, 2004). Studies show that CLV is an effective metric to
segment customers and allocate marketing resources (Haenlein, Kaplan, & Schoder,
2006). A recent study by Fader, Hardie and Lee (2005) links both RFM and CLV by iso-
value curves in customer base analysis showing statistical sufficiency of RFM under a
Database marketing evolved into relationship marketing during the 1990s due to
four different drivers according to Fletcher, Wheeler and Wright (1991): the changing
role of direct marketing, changing cost structures, technology and economic conditions.
With increasing marketing costs, companies realize that to gain competitive advantage
3
customers based on their specific needs so as to increase the length of customer life cycle
as well as the value of the transactions between the firm and the customer during each
stage of that life cycle. As a result, marketing activities are increasingly being organized
around relationships with customers rather than products and services. One example of
this has been the increased use of loyalty programs (Tellis & Zufryden 1995; Kim, et al.
2001; Kivetz & Simonson 2002; Roehm, et al. 2002; Kivetz & Simonson 2003; Lewis
2004). In addition, the change of database constructs with the advances in technology
from buyer database to the customer database integrated with marketing activities (Shaw
and Stone 1988) at inexpensive costs, as well as the separation of consumer and business
markets also smoothed the progress of the evolution from database marketing to
relationship marketing.
customer data. It involves using technology to organize, automate, and synchronize sales,
marketing, customer service, and technical support so that firms can track not only the
behavior and spending history of their customers but also the marketing activities,
customer services as well as the communications that companies have had with each
customer communications brings about the analysis of communication data from social
media sites such as Twitter, LinkedIn, and Facebook. Thus companies can track and
communicate with customers who share opinions and experiences about their company,
products, or services, even when the sharing has taken place in a non-company forum.
4
The challenge facing both marketing practitioners and academics is no longer the scarcity
of data but the tremendous abundance and variety of data that CRM systems generate.
The explosion of data has forced market researchers to seek new techniques, which have
been borrowed from machine learning, computer science, and database management.
The goal of CRM remains to achieve customer loyalty and to maximize customer
lifetime value. Modern CRM systems attempt to realize these goals by approaching each
stage of the customer’s relationship with the company as separate “stage”: customer
acquisition, customer retention, and winning back lost customers. All activities
performed during each of these three stages are of great importance since customer
lifetime value can be maximized only when the allocation of resources are balanced in
these activities (Reinartz, Thomas, & Kumar, 2005). Although researchers have
recognized the importance of each stage, the limited availability of CRM data in the past,
particularly the focus of most company databases solely on existing customers, has led
(Thomas, Blattberg, & Fox, 2004). Even though there is some existing research done on
research has investigated all three stages of CRM activities together. With gratitude to
The Wharton Customer Analytics Initiatives (WCAI), this dissertation is endowed with
the access to the CRM database of a large North American media company that covers all
three stages of CRM activities, as well as the text data from the call center notes.
5
Equipped with this rich, multi-stage subscriber interactions data, I apply classification
methods from machine learning to predict whether customers will convert, churn or
reactivate their service based on their past usage patterns, their text communications with
deactivation reasons. With the aid of the variable importance from random forests and
gradient boosting, I am able to identify the most important factors that impact each of the
retention, and winback of lost customers, and I am also able to explore how each factor
evolves throughout these stages with regard to their effects on shaping customers’
relationship management in the academic realm in terms of data range and variety, as
well as the possible factors that impact customer acquisition, retention and winback. For
the first time in marketing literature, different factors such as customer duration,
quality, customers’ usage of self-service, as well as text mining of call center notes are
empirically examined together throughout the three stages of CRM. Moreover, given that
current research on winback is scant, this study enriches the literature by exploring how
factors affecting customer conversion and retention might also impact customer winback.
6
customer relationship management, as well as which factors are more influential than
others, is critical to firms because companies are frequently faced with the problem of
limited resources and how to optimally allocate them. If it is marketing activities that
play a bigger role in acquiring, retaining and winning back lost customers, firms should
or in their sensitivity to price can better forecast customers’ future behavior, this
dissertation can identify those customer characteristics and thus facilitate companies to
target the right customers at the right time. If the service dissatisfaction, product
deactivation reasons turn out to be the major factors that drive customers to discontinue
their services with the company, managers may find it profitable to improve their service
or product quality. If the text mining of the customer call center notes is more predictive
relationship with the company, firms should invest more in understanding their customers’
direct feedback. Furthermore, this dissertation assists managers to find answers to the
following relevant questions: 1) among all the marketing communication channels that
they apply to reach out customers (e.g. email, direct mail, or phone calls), which ones are
the most effective? 2) Which customers’ own activities such as contacting the customer
service center or employing online self service are more indicative of their future
behavior? 3) Should they apply the same strategy to recapture the customers who never
7
converted to a paid customer and the customers who churned after being a paid one? This
dissertation sheds light on all of these important issues, and it should aid firms in
deciding whether they should attach more importance to their own marketing initiatives,
manner: chapter 2 reviews the literature in customer relationship management and text
mining; chapter 3 lays out the conceptual hypothesis; chapter 4 discusses the
methodology that will be applied; chapter 5 describes the data and preprocessing steps;
chapter 6 discusses the results and implications; and finally chapter 7 points out the
CHAPTER 2
LITERATURE REVIEW
Drawing from the CRM literature and based on my research objectives, this
review focuses on the contributing factors that extant research has revealed in influencing
the three stages of a company’s CRM activities: 1) customer acquisition, 2) retention and
3) winback of lost customers. I also discuss how prior researchers treat the relationship
among customer acquisition, retention and winback from an analytical point of view.
adoption and diffusion of innovation (Bass 1969, Mahajan, Muller, & Kerin, 1984;
Norton & Bass 1987; Davis 1989; Davis et al., 1989; Rogers, 1995; Zhu et al., 2003; Zhu
& Kraemer, 2005; Campbell & Frei, 2010). Early research investigates what factors are
related to aggregate product adoption; for example, in the Bass model, Bass (1969) uses
the product characteristics and the number of previous adopters to predict product
adoption. Mahajan et al. (1982) also discover that the pricing package for new products
has a major impact on their adoption and retention. Subsequently, using individual data,
role in product adoption as product characteristics and marketing efforts (Chatterjee &
Eliashberg, 1990; Sinha & Chandrashekaran, 1992; Goolsbee & Klenow, 2002). Lewis
(2006) in particular reveals that acquisition discount depth is negatively related to repeat-
9
buying rates and customer asset value using customer-level data from a newspaper and an
acquisition on customer equity growth and find out that customers acquired by marketing
actions are of less value to the firm compared to customers persuaded by word-of-mouth;
this is because the former only adds more short-term value, but the latter adds more long-
term value to the firm. Musalem and Joshi (2009) identify the relationship between
customer responsiveness to a firm’s CRM efforts and customer acquisition and retention.
They recommend that firms invest more in moderately responsive customers instead of
the high responsive customers since the latter are also pursued by competing companies
which would lead to the erosion of the effects of an individual firm’s CRM efforts. Using
a nine-year period of U.S. airlines industry data, Grewal et al. (2010) discover that
heterogenity in customer satisfaction impacts both acquisition and retention sales. Nam et
al. (2010), on the other hand, explore the role of word-of-mouth resulting from service
signal quality as well as advertising and the retail environment on service adoption.
Finally Xue et al. (2011) relate customer demand for banking services, the availability of
The study of free trial and subsequent conversion can be dated back to 1974 when
Scott (1974) examines whether offering a two-week free newspaper trial can induce a
the control group. Research on conversion explores how free samples can help induce
paid purchases (Gedenk and Neslin 1999; Bawa and Shoemaker 2004). Contrary to
Scott’s negative results, Gedenk and Neslin show positive short-term and long-term
contributed to customer conversion from free to fee, Pauwels and Weiss (2008) find that
adjusting price promotion, email and search-engine referrals can all facilitate customer
conversion.
Factors Articles
Acquisition
product characteristics, pricing Bass (1969), Mahajan et al. (1982)
individual characteristics Chatterjee and Eliashberg 1990, Sinha and
Chandrashekaran 1992, Goolsbee and Klenow
2002, Xue et al. 2011
marketing efforts Lewis (2006)
word-of-mouth Villanueva, et al. (2008), Nam et al. (2010)
customer responsiveness to a Musalem and Joshi (2009)
firm’s CRM
customer satisfaction heterogenity Grewal et al. (2010)
customers' efficiency in using Xue et al. 2011
self-service channels, local
Internet banking penetration
Conversion
free trial, free samples Scott (1974), Grewal et al. (2010)
price promotion, email and Pauwels and Weiss (2008)
search-engine referrals
11
In summary, prior research has uncovered a wide range of factors that could affect
examine these factors all together and rank their importance so that I know, at least which
Customer retention and churn has long been a concern for the service industry,
and marketers and marketing researchers have tried to understand the underlying
mechanisms. Two research streams can be identified in the customer retention literature
and they are closely related to one another: 1) research exploring how some factors can
increase or decrease the customer retention rates and 2) research studying why customers
churn, or discontinue their relationship/service with the company. In essence, these two
streams aim for the same goal: to retain customers as long as possible so that companies
can maximize customer lifetime value. Identifying the churn factors and finding out
In examining the potential factors and their relationship with customer retention,
McGahan and Ghemawat (1994) analyze life insurance data and find that in very
competitive environments, large firms achieve greater customer retention than their
smaller rivals. Ganesh, et al. ’s (2000) paper also discloses that customer heterogenity can
12
result in different retention rates. Rust and Zahorik (1993), Bolton (1998) and Gustafsson
et al. (2005) discover that customer satisfaction levels are positively related to the
the impact of affective commitment (a emotional factor that indicates the degree of
retention. While both studying the effects of dual pricing, namely access and usage
prices for subscription services on customer retention, Danaha (2002) finds that access
rates have much stronger effect on retention than usage price. Lambrecht (2006)
discovers that flat rate does not increase churn rate, but pay-per-use increases churn rate.
Some researchers have examined the role of the channel of acquisition in CRM.
For example, online versus traditional channels could have different impacts on customer
retention, and Hitt and Frei (2002) show that retention is marginally higher for customers
using online channels. In exploring how service quality can affect retention, Boulding et
al. (1993) link perceptions of the dimensions of service quality to a person's overall
quality perception to predict customer retention; the researchers find that service quality
perceptions positively affect intended behaviors. Zeithaml, Berry, and Parasuraman (1996)
show that service quality affects subscriber behavioral intentions with respect to the
large U.S. bank, Campbell and Frei (2010) discover that customers’ usage of self-service
channels are positively associated with customer retention. With the subscription
and Bradlow (2008) apply five factors to forecast customer retention behavior: (1)
duration dependence, (2) promotional effects, (3) subscriber heterogeneity, (4) cross-
cohort effects, and (5) calendar-time effects (e.g., seasonality). The results show
across all seven services. Customer heterogeneity, calendar-time effects, and duration
dependence only improve in five services, while cross-cohort effects are insignificant in
all services.
Zeithaml, Berry, and Parasuraman (1996) look at whether service quality on particular
behaviors signals customer switching and they find that customers’ behavioral intentions
are strongly influenced by service quality. Bolton (1998), Verhoef (2003), and
Gustafsson et al. (2005) examine whether customers’ dissatisfaction with service and
Keaveney (1995) conducts a field study which identifies more than 800 critical
behaviors of service companies that could cause customers to churn. The author groups
these 800 critical behaviors into 8 general categories: 1) pricing (which includes the
subcategories of high prices, price increases, unfair pricing practices, and deceptive
hours of operation, waiting time for service or to for an appointment, etc.); 3) core service
failures (which is the interaction failure between customers and customer service
conflicts of interest) ; and 8) involuntary switching. Most of these factors were echoed or
further studied by other researchers (e.g., Shaffer and Zhang (2002) also find that
Braun and Schweidel (2011) classified the causes of churn into three categories: 1)
controllable factors such as price, dissatisfaction with service, or with the product; 2)
uncontrollable reasons which include moving from the service area and
nonpayment or abuse of service. The researchers use the data from a land-based
conclude that the ability of a firm to reduce customer churn is diminished by the
Other factors that are found to cause churn include customer heterogeneity
(Morrison and Schmittlein 1980), payment equity (Bolton, Kannan, and Bramlett 2000;
Bolton and Lemon 1999), loyalty programs (Bolton, Kannan, and Bramlett 2000;
15
Verhoef 2003), marketing activity (Lewis 2004, 2005), retention duration (Hughes 2006;
Reichheld, 1996), and multichannel communication (Godfrey, et al. 2011), to name a few.
In summary, prior research shows that the factors that influence a consumer’s
decision to remain a customer with a company or to churn are similar to the ones that
well as customer satisfaction and perceived service quality all impact both acquisition
and retention. However, no prior studies have examined all of these factors concurrently
in both of these stages or in the winback stage. This study fills this gap by exploring how
Factors Articles
Retention
Firm size in competitive environments: Mcgahan and Ghemawat (1994)
large vs. small
Customer heterogenity Ganesh, et al. (2000), Schweidel, Fader,
Bradlow (2008)
Customer satisfaction levels Rust and Zahorik (1993), Bolton (1998) and
Gustafsson et al. (2005)
Affective commitment, calculative Gustafsson et al. (2005)
commiment, prior churn on retention
Access rate vs. usage price Danaha (2002)
Online vs. traditional channels Hitt and Frei (2002)
Service quality Boulding et al (1993); Zeithaml, Berry, and
Parasuraman (1996)
Self-service channels Campbell and Frei (2010)
Duration dependence, promotional Schweidel, Fader, and Bradlow (2008)
effects, cross-cohort effects, calendar-
time effects (e.g., seasonality)
Churn
Flat rate vs. pay-per-use Lambrecht (2006)
16
(Table 2, continued)
Pricing , inconvenience, core service Keaveney (1995)
failures caused by mistakes, billing
errors or technical problems, service
encounter failures, employee responses
to service failures, ethical problems,
involuntary switching
Competition Keaveney (1995), Shaffer and Zhang (2002)
Controllable factors, uncontrollable Braun and Schweidel (2011)
reasons, firm-initiated churn
Customer heterogeneity Morrison and Schmittlein 1980
Payment equity Bolton, Kannan, and Bramlett 2000; Bolton
and Lemon 1999
Loyalty programs Bolton, Kannan, and Bramlett 2000; Verhoef
2003
Marketing activity Lewis 2004, 2005
Retention duration Hughes 2006; Reichheld 1996
Multichannel communication Godfrey, et al 2011
2.1.3 Winback
Even though I believe that each of the CRM stages from acquisition to retention
to winback are all important to any firm that intends to maximize their profitability with
their customers, winback or the reacquisition of lost customers is by far the least
researched area among the three. As Griffin and Lowenstein (2001) and Thomas, et al.
(2004) point out, the importance of allocating more resources to winning back customers
is apparent: the probability of a firm winning back a churned customer and the net return
on investment can be eight times as high as winning a new customer. However, very
limited research has explored which factors can facilitate firms winning back lost
customer except Stauss and Friege (1999) and Thomas, et al. (2004). Stauss and Friege
(1999) develop the second lifetime value (SLTV) metric to facilitate evaluating the NPV
17
generated after a customer has been won back and point out that not every customer
should be won back. Thomas, et al. (2004) focus on the reacquisition pricing strategy and
find out the optimal pricing strategy for a firm to recapture their churned customers
involves a low reacquisition price and higher prices when customers have been
reacquired.
Even though both customer heterogeneity and pricing strategy are important
factors in winning back lost customers, I assert that other factors that have be influential
in shaping customer acquisition and retention should also be examined for their effect on
winback. Therefore, in this dissertation, I explore how prior marketing activities, prior
responsiveness to marketing initiatives as well as the prior call center notes by customers
affect the odds of winning back customers. Particularly in this study, through the rich
data, I classify the lost customer into two categories: those who tried the product for some
trial period but decided not to convert being a paid customer, and those who have become
a paid customer for some time and then decided to churn. Making this distinction in the
lost customer base is of great importance to both academics and managers because it
allows us to assume that there are different underlying factors that influence recapturing
these two different kinds of lost customers. Hence this research can assist companies in
allocating their resources so that they can achieve a greater return on investment.
Factors Articles
Winback
18
(Table 3, continued)
Customer heterogeneity Stauss and Friege (1999)
Optimal pricing strategy for a firm to Thomas, et al. (2004)
recapture their churned customers
involves a low reacquisition price and
higher prices when customers have
been reacquired
acquisition, retention and winback, prior researchers also have had extensive discussions
about how these three events are related, especially between customer acquisition and
retention. Two absolutely opposite assumptions about their relationships have been made.
One assumption is that customer acquisition and retention are two independent events
(Blattberg and Deighton 1996; Gupta, Lehmann et al. 2004). The other assumes that the
customer acquisition process affects the customer retention process (Hansotia and Wang
1997; Thomas 2001; Reinartz, Thomas et al. 2005; Schweidel, Fader, Bradlow 2008).
customer acquisition and retention rates. Thomas (2001), on the other hand, develops a
model for estimating the length of a customer's lifetime linking customer acquisition to
customer retention and shows the financial impact of not accounting for the effect of
acquisition on customer retention. Gupta, et al. (2004) also study the customer value by
treating customer acquisition and retention as two independent events, but Schweidel,
19
Fader, and Bradlow (2008) explore the duration dependency between acquisition and
retention.
No matter whether we believe these events are independent or not, the consensus
is that we should truncate and censor the data (Bolton's 1998, Schmittlein and Helsen
1993) in order to separate acquisition data from retention data for further statistical
analysis. In this study, I hold that all CRM events are related, particular acquisition
activities affect customer retention, both acquisition and retention efforts have an impact
In the following section, I briefly review the text mining and text classification
Today about 80% of the data held within an organization is in the form of text
documents —for example, reports, Web forms, open-ended survey responses, news feeds,
e-mails, and call center notes, etc. Texts are essential for an organization to gain a better
understanding of their customers’ behavior and the abundance of the text data forced
organizations to seek ways to explore and leverage this information. Text mining is a
technology that aims to capture key concepts and themes and to uncover hidden
knowledge of the precise words or terms that authors have used to express those concepts.
Text mining is one branch of data mining, or the analysis steps of Knowledge Discovery
in Databases (KDD) process that results in the discovery of new patterns in large data
sets. The data that text mining deals with are usually referred to as unstructured data
20
since the data cannot be stored in a relational database and thus cannot be structured by
categorical, ordinal, and continuous variables. Data consisting of both structured and
Text mining usually involves two steps: the first step transforms the textual data
into structured data with specialized text analytic techniques; and the second step applies
relationships within the data. Sometimes the transformed data can be combined with
other structured data to make predictions of the future behavior. In this dissertation, I use
The primary problem with the management of all of the unstructured text data is
that there are no standard rules for writing text so that a computer can understand it. The
language, and consequently the meaning, varies for every document and every piece of
text. The only way to accurately retrieve and organize such unstructured data is to
analyze the language and thus uncover its meaning. There are several different automated
approaches can be broken down into two kinds, linguistic and nonlinguistic.
treat each document d j as an array of words and apply computer technology to quickly
scan and categorize key concepts within the text. These key concepts are called
21
vocabulary and they are used as the feature set to train the classifier. After key concepts
are identified in each document, one counts the number of times each feature word occurs
and calculates their statistical proximity to related concepts. From raw text documents to
the data matrix that is needed for further analysis, there are several preprocessing steps
1) The removal of function words: since not all words can be used to train the
classifier, some words or phrases have to be removed. The first of these words are
function words. Function words are also called stop words, which include: auxiliary verbs,
conjunctions, articles, prepositions, etc. These words appear frequently in most text
documents but they do not contribute to the training of the classifier, therefore they need
2) Stemming: the process of grouping the words which share the same
morphological root and misspelled words. For example, words “precise”, “precision”,
“preciseness” should be grouped as one word “precise”. Even though Baker and
McCallum (1998) argue that stemming can sometimes hurt effectiveness, the common
practice is still to adopt it since it can reduce both the dimensionality of term space and
I need to decide how to represent the feature values. One way is to use a binary vector,
assigning 1 to the feature value if the document contains the feature word and 0 otherwise.
Another way is to count the frequency of the feature words that appear in each document.
According to Salton and McGill (1983), there are four kinds of automatic
weighting systems to assign weights to feature terms extracted from the documents.
Therefore, terms occurring in every document of a collection are treated equally with
WEIGHTik = FREQik
account the number of documents that each term is assigned and assumes that the content
frequency of the number of documents DOCFREQk that each term is assigned to.
FREQik
WEIGHTik
DOCFREQk
which the assignment of a term to the documents is capable of decreasing the average
theory stating that the best index terms tend to occur in the relevant documents with
respect to some query. If we define TERMRELk as the ratio of the proportion of relevant
23
items in which term k occurs to the proportion of nonrelevant items in which the term
occurs, then the term relevance weighting system can be expressed as:
The most commonly used weighting systems are actually the product of the
term frequency and the inverse document frequency ( tf idf , or tfidf) and normalized
tf idf , which incorporates the length of document vectors into the weighting system
4) Feature selection: since the number of extracted feature words for most
documents of a collection is several thousands, it can induce high dimensional term space
and overfitting problems. Feature selection is the name given to a broad set of methods
an attempt to select, from the original set T , the set T ' terms (with | T ' | | T | ) that yields
The most commonly used feature selection method applies an evaluation function
that to a single word (Soucy and Mineau, 2003). First, all words are evaluated and sorted
features are used to form the best feature subset. Individual words are scored using
measures such as mutual information, information gain, odds ratio, 2 statistics, term
strength and so on (Brank et al. 2002; Torkkola 2002; Forman 2003; Sousa et al. 2003).
These metrics have their origin in machine learning or text retrieval but there are also
new related stastistical regularization and variable selection methods such as LASSO
(least absolute shrinkage and selection operator, Tibshirani 1996) and elastic net (Zou &
24
Hastie, 2005). These techniques are similar to ridge regression. The lasso minimizes the
residual sum of squares, subject to the sum of the absolute values of the coefficients
being less than a constant. Thus it restricts some coefficients to be 0, which has the effect
of reducing the dimensionality of the data. An elastic net is the combination of the ridge
and lasso since it applies both L2 and L1 penalties for regularization. Elastic net is
particulary applicable for large datasets when the number of predictor variables is much
However, even after all these procedures are performed, the text classification
completed by a nonlinguistic approach still suffers from limited accuracy. Due to the
large presence of polysemy (e.g. a word or phrase that has multiple related meanings),
homonymy (e.g. a group of words that share the same spelling and pronunciation but
different meanings), and synonymy (e.g. different words with same or similar meanings),
the original terms may not be the optimal dimensions for document content
representation. To compensate for this, linguistic approaches have been applied to help
reduce the dimensionality even further and hence substantially enhance the prediction
accuracy.
of words, phrases, and syntax, or structure, of text. A system that incorporates NLP can
products, organizations, or people, using meaning and context. This approach offers the
speed and cost-effectiveness of statistics-based systems, but it offers a far higher degree
Two commonly used linguistic approaches to further reduce the dimension of the
text transformed data matrix are term clustering and latent semantic indexing. Term
clustering groups words with a high degree of pairwise semantic relatedness, so that the
groups of their centroids (or representative of them) may be used instead of the terms as
dimensions of the vector space. Term clustering addresses synonymy and it can be done
means clustering is not affected by the category labels attached to the documents (Lewis
1992; Li and Jain 1998). Supervised clustering gathers those terms that tend to indicate
the presence of the same category, or group of categories (Baker and McCallum 1998;
Tishby 2001).
in Information Retrieval to address the dimensional problem deriving from the usage of
synonyms and polysemes. It infers the dependence among the original terms from a
corpus and transfers this dependence into the newly obtained, independent dimensions.
LSI applies singular value decomposition to the matrix formed by the original document
vectors to map it to the new vectors. Therefore, unlike term clustering, the newly
obtained dimensions by LSI are not intuitively interpretable. However, they work well in
bringing out the “latent” semantic structure of the vocabulary used in the corpus. The
drawback is that, if some original terms are particularly good for discriminating a
26
category, that discrimination power may be lost in the new vector space (Sebastiani,
2002).
After both non-linguistic and linguistic approaches are performed to transform the
original text data, the newly generated data matrix is ready for further text classification.
document is classified under <category> iif (if and only if) it satisfies at least one of the
clauses. The most famous example of this approach is the CONSTRUE system (Hayes et
al. 1990) constructed by Carnegie Group for the Reuters news agency. DNF systems
suffer from the same problems as other expert systems, such as the well-studied
Machine learning methods can be used to overcome the bottleneck problem, because
machine learning methods do not require antecedent expert knowledge, but rather rely
solely on the data itself to build the classifiers. For this reason, machine learning methods
applications. In Chapter 4 I will give a detailed description of the two machine learning
methods used in this study. Below I talk about two other issues in text classification:
binary verses multiclass text classification and single-label vs. multi-label text
classification.
Most text classification research today is about binary coding of content since the
most important text classification applications are binary classifications. For example,
filtering is used to decide whether document I is about one particular category or not
(Sebastiani 2002). In addition, binary classification algorithms serve as the basis for
binary counterparts. They decompose the multi-class classifier into binary classification
training set to build one classifier per class and to distinguish the samples in a single class
from the samples in all remaining classes. All-against-all method builds classifier for
each pair of classes. There are also direct approaches to build multi-class classifications,
multi-labeling cases, any number of categories from 0 to C can assigned to the same
multi-label classification is the categorization of movies. For example, the best motion
picture award winner at the year 2012 Academy Awards, the Artist, can be classified into
methods have been increasingly called for by the modern applications such as music
categorization (Li & Ogihara, 2003), semantic scene classification (Boutell et al., 2004)
ranking, to order a set of labels L, so that the topmost labels are more related to the new
structure; if each document is labeled with more than one node of the hierarchical
The extant multi-label classification methods can be grouped into two categories:
transformation methods transform the multi-label classification problem either into one or
adaptation methods adapt machine learning algorithm to multi-label cases. For example,
AdaBoost.MH and AdaBoost.MR (Schapire & Singer, 2000) are two extensions of
AdaBoost (Freund & Schapire, 1997) for multi-label classification. Other algorithm
adaptation methods include ML-kNN (Zhang & Zhou, 2005) which is an adaptation of
the kNN lazy learning algorithm for multi-label data, and improved SVMs using ranking
29
(Elisseeff & Weston, 2002) and stacking (Godbole & Sarawagi, 2004). In this
dissertation, I apply text mining techniques to examine the text notes from the call center
of a large media and entertainment company to explore if the text data can help predict
whether customers will convert from free to fee, to churn or be won back.
30
CHAPTER 3
HYPOTHESES
3.1 Self-care
Self-care, or self-service is one of the customer contact modes that customers use
in order to interface with service providers. Kellogg and Chase (1995) define customer
contact as the function of the interaction between a customer and a service provider.
Based on whether customer contact is technologically involved, Froehle and Roth (2004)
classify the context under which customers and their service providers interact with each
other into five categories: 1) technology-free customer contact, in which a customer has
face-to-face service interactions (Chase 1978) with a human service provider or customer
representative employs technology as an aid to improve the face-to-face contact, but the
contact, in which both customer representative and customers have access to the
contact, where the customer and customer representative are not physically present with
each other so they communicate via a voice telephone call or online instant messaging, or
Huete and Roth, 1988; Haynes and Thies, 1991; Hill et al., 2002; Oliveira et al., 2002;
31
Roth, 2000; Boyer et al., 2002; Menon, 2003; Karmarkar and Pitbladdo, 1995; Heskett et
al. 1997), more and more service companies are encouraging customers to adopt self-care
or self-service. This trend has been called the “self-service revolution”, which provides a
to adopt self-care or not, and of which ones to choose. Customers’ technology readiness
(Parasuraman 2000) states that there are four characteristics that indicate whether
customers are ready for technology or not: optimism, innovativeness, discomfort, and
insecurity about technology. Optimism means customers think technology is a good thing;
skepticism about its ability to work properly.” Based on this concept, if a customer is
optimistic and innovative, he or she will adopt the newer technology such as using online
self-care while if a customer is feeling discomfort and insecure, he or she may opt to pick
Action (Fishbein and Azjen, 1975) and Theory of Planned Behavior (Ajzen, 1985, 1991)
from psychology hold that an individual’s intention, their perceived control over their
behavior, their attitude towards the behavior, as well as the rational cognitive assessments
of the behavior are essential in their decision whether to take action or not. So when an
individual feels he or she has the control over whether to perform or not to perform the
32
behavior, the stronger the intention to engage in a behavior, the more likely the action is
unfavorableness towards some stimulus object” (Fishbein and Azjen 1975), drives an
experience with the service provider could result in customers’ attitude formation
towards the company and affect their future behavior and decision on whether they would
continue or terminate their service. Thus when customers first started to apply new
such as the availability of different channels, the cost and benefits of using different
channels, as well as the different activities that they perform through these channels also
impact their choice of self-care channel. If the activity they need to perform is to make
payment and it can be accomplished by simply logging onto their online account,
Even though customers’ choice of whether to utilize self-care and which channel
to use is not intended to maximize service provider’s profits, prior studies have found a
causal link between utilization of self-service channels and the relationship between
customers and the service provider. For example, Hitt and Frei, (2002), Campbell and
Frei (2010), Xue et al. (2007, 2011) all find that employing a self-care channel can
increase customer retention rates. Based on these past findings and theories related, I
assume, given the different channels that customers can employ (e.g online help (CUWI),
33
interactive voice response with call center (IVR), making calls to a live agent in a call
center, and different activities they perform, such as changing account information,
making payment, or changing service (e.g. upgrading or downgrading service plans), that
not only the employment of self-service but also the interactions between self-service
conversion, churn and reactivation with their service provider. Furthermore, if a customer
utilized self-care channel during the trial period, and he or she converted to a paid
customer, it indicates a positive attitude being formed and a positive impact on his or her
future behavior. Thus, I expect that the customer will be less likely to churn and more
likely to be won back if for any reason they discontinued their service with the service
provider:
H1a: The employment of self-care channel in the conversion period and churn period
H1b: The interaction of self-care channels and customers’ activities in both trial period
and churn period are highly predictive of customer retention and winback.
service-providers are just one direction of the interactions between customers and firms.
To acquire customers more effectively, to maintain a long term relationship with them
afterwards, as well as to regain lost customers so that customers’ lifetime value with the
company can be maximized, firms also reach out to customers using various marketing
34
initiatives. Past literature has shown that marketing actions are typically effective in
customer acquisition, retention, and winback (Bass 1969; Mahajan et al. 1982; Berger et
al. 2002; Danaher 2002; Lewis 2004, 2005, 2006) and such actions include pricing,
look into the following marketing initiatives and their respective impact on acquisition,
channels to contact customers, the problem that firms are faced with is which channel
they should utilize to interact with which customers. One consideration is the cost-benefit
analysis of each communication channel and the other is the effectiveness of each
implement (e.g. sales forces, direct mail). With the availability of electronic media,
companies tend to send e-mails to customers to welcome them to join the club, to instruct
them how to start their service, to remind them of the service or payment due, or to send
promotional offers at certain occasions, based on the customer information stored in their
database. However, do emails work better than direct mail or telemarketing? What is the
effect when a firm applies all three communication channels to reach the same customer?
35
Prior researchers have explored the volume and mixed effect of communication channel
in non-service settings.
relationship between customer and firms have been disclosed. Venkatesan and Kumar
communication between customers and firms and purchase frequencies. Godfrey et al.
(2011) find that an inverted-U shape between customers repurchase and the volume of
three individual communication channels: telephone, e-mail, and direct mail. Godfrey et
al (2011) further explain these two different phenomena by two social norm theories:
reciprocal action theory and reactance theory. Reciprocal action theory holds that people
1990). Therefore when firms invest more on customers by increasing marketing volume,
customers would reciprocate by repeated purchase. Reactance theory on the other hand
explains why excessive marketing communication volume would diminish the effect of
certain extent, customers would perceive the incoming marketing communication efforts
as persuasive attempts to force them to make purchase (Clee & Wicklund 1980) thus
respond negatively to all kinds of marketing efforts such as personal selling (e.g.,
Wicklund, Slattum, & Solomon 1970), advertising (e.g., Robertson & Rossiter 1974),
direct marketing (e.g., Morimoto & Chang 2006), and rewards programs (e.g., Kivetz
2005).
36
However, what Venkatesan and Kumar (2004) and Godfrey et al. (2011) study as
communications defined in this dissertation. Venkatesan and Kumar (2004) count both
inbound and outbound communication by customers and the firm and Godfrey et al
(2011)’s volume is for each individual communication channel. What I refer to as the
total volume is the additive effect of multiple communication channels that occur in both
the conversion and retention period. Nonetheless, the reciprocal action theory and
communications in different stages of CRM. In the trial period, when customers initially
receive the welcome or promotional communications from a firm, it helps them to learn
about the company and their products so the higher the communication volume, the
stronger the bond and thus higher probability for them to convert to a paid customer.
However, after customers have converted and started a steady relationship with the
company, the cumulative effect of the marketing efforts in both trial and paid period
makes the communication happening in the retention period easy to trigger the reactance
behavior and thus result in churn. As to winning back lost customers, since I use trial
customers and both trial period and retention period marketing volume to forecast the
following:
37
H2a: The total volume of marketing outbound in the trial period is positively associated
H2b: The volume of marketing communication in the paid period, particularly the ones
closer to the churn date or right censored date, is positively associated customer churn.
interaction level with their service provider. The more responsive the customers are, the
stronger is the relationship between the customer and the firm, thus the more likely they
would convert from a free to fee customer, to be won back, and less probable to churn.
H3: The response rate of a customer during the both trial period and paid period is
positively associated with customer conversion and winback but negatively associated
In addition to the total volume of contact channels, each individual channel (e.g.,
telephone, direct mail, e-mail) has been characterized as more or less interpersonal.
since most of them do not require any customers’ involvement but to open and read it. In
addition, due to the cheap cost associated with sending out emails, customers have been
bombarded with all kinds of commercial emails and would simply discard them as junk
mail and never care to open. Prior research shows that a more involving and
interpersonal contact channel have a much higher conversion rate on average than a less
involving contact channel (Anderson and Narus 1999, p. 302). Moreover, compared with
inexpensive and prevalent email communications, customers feel that they are more
38
valued or cared about by the company when they receive direct mail since it shows that
the company is willing to invest in their relationship. As the direct mails are more and
more replaced by emails, the utilization of direct mail may actually enhances
relationships and thus have a better chance to retain them. Finally, regarded as the most
interpersonal channel among the three, the telephone channel requires both a service
well as their ability to solve customers’ issues varies among different representatives,
which could create uncertainty and negative impact on customer perception in firm’s
service quality if an encounter did not have a satisfactory outcome. Since each
communication channel has its own advantages and disadvantages, I expect different
communication channel is the synergy effect that previous researchers have studied
(Reinartz, Thomas, and Kumar 2005, Godfrey et al. 2011). The common understanding is
that the interaction effect is greater than the sum of individual channel effect and previous
researchers again use reciprocal action theory and reactance theory to explicate the two
being invested but excessive communication coming in all kinds of forms can drive
customers away from rather than closer to a company. Therefore I postulate the following:
Radio density is defined as the total number of radio stations in each zip code
which captures the regional differences in terms of product availability and product
customers during their trial period largely impacts customers’ experience with the firm’s
products and thus affects their decision whether to convert from a free trial customer to a
paid customer. However, if a customer has opted to terminate their service with the firm,
the factor of radio density should be not significant in affecting their decision to
reactivate service with the company or not since the radio density is static for long period
of times.
H5a: The radio density is positively related with the probability of customer conversion,
H5b: The radio density is not significant in predicting winning back both of unconverted
Promotional depth is one of the factors that early researchers have explored in its
association with customer lifetime value. Two contradictory findings concerning its
effect on customer lifetime value have been revealed. Analyzing the long-term behavior
of some newspaper and online grocery customers and their acquisition discount,
Anderson, et al. (2004) find a negative relationship between acquisition discounts and
40
customer value; in the mean time, Anderson and Simester (2004) discover that customers
acquired through catalogs with more discounted items have higher long-term value.
Despite the disparity of the two outcomes, prior researchers are in agreement that deep
price cuts can increase the temporary sales for the first-time customers but decrease
future purchases by established customers. Applying this concept to the three stages of
CRM, one can reasonably expect that the promotion given out in the trial period would
encourage customers to convert, however, the discounts that feed the established paid
customers signal the segment of customer base that are price-sensitive and they are at
greater chance to churn once the promotional discount come to a halt. Furthermore, once
the customers who have enjoyed too many promotions churned, they would be hard to
gain back. In this dissertation, the promotional depth is denoted by two terms: the number
of promotional discounts which shows how many times a customer was offered the
promotional price, and the price range which indicate how deep the price cut a customer
H6a: Promotional discounts are positively associated with customer churn but negatively
H6b: There is a positive relationship between price range and customer churn but a
Pricing is another “old” topic that researchers first started with to explore its
relationship with acquisition (Bass 1969; Mahajan, Muller et al. 1984; Norton and Bass
1987; Davis 1989; Davis, et al. 1989; Rogers 1995) and it is closely related to promotions.
41
Most extant literature focuses on what is the optimal pricing strategy to acquire new
customers and even win back lost customers (Thomas, et al. 2004). This dissertation
examines how the average price that a customer pays for his or her overall services with
the service provider each month helps to predict their probability of churn or future
winback. The average price that a customer pays is connected to classifying customer
base that a company has and company usually would regard the higher price customers as
high value, while the ones who pay low average prices as low value customers. Therefore
studying how much the price a customer pays on average and its relationship with their
CRM targeting strategy is important in targeting at the right customer at right time.
Therefore one can expect that the higher price a customer pays each month, the happier
he or she is with the service that the company provides, and the lower probability he or
she would churn. However, once a customer churns, firm’s strategy to win back lost
customers is usually to offer deeper discounts (Thomas, et al. 2004) than to existing
customers. The customers who used to enjoy the discounts, or are of lower customer
value to the company, maybe enticed to accept the new discount and reactivate their
service with the company. As to the high value customers, since they are mostly loyal,
companies tend not to offer discounts to them until they decide to terminate the service
one day. The deep discount that the company offers to win them back could be tempting
since now they discover they could actually enjoy the same service with much lower
H7a: The average price a customer pays to his or her service provider is negatively related
to their churn;
42
H7b: The average price a customer pays before he or she churned has an U-shape
reasons for relationship termination. Prior researchers have explored the various causes
that lead to customer churn (Keaveney 1995), and Braun and Schweidel (2011) even use
the multiple causes of churn to model customer lifetime. Combining both Keaveney
(1995) and Braun and Schweidel (2011)’s categorization methods, we group our
"Non Usage", "Prepaid period ended" (this one is grouped into “Free ended” for trial
This grouping of the deactivation reasons covers from customer satisfaction for both
reasons. They reveal whether these factors play role in influencing customers’ decision
making of whether to convert, churn or reactivate service with their company during each
stage of CRM. Particularly for conversion from free to fee, the reason of “Free ended”
H8a: Deactivation reasons should be among the top important factors that are predictive of
H8b: Among all deactivation reasons, “Free ended” plays a major role in why a customer
churn (Hughes 2006; Reichheld 1996), and the impact of the acquisition duration on
retention duration (Schweidel, et al. 2008). They find that while acquisition duration has
a positive relationship with retention, retention has a negative effect on churn. Therefore,
the longer the trial period, the longer the subsequent retention duration. On the other
hand, the longer a customer stays with the company, the less likely he or she will churn.
Building on these research findings, I speculate that on the one hand, the longer the
acquisition duration and retention duration, the stronger the relationship between the
customer and company it indicates, therefore the more likely that a customer could be
won back once they discontinued their service for any reason. However, if a customer has
had too long history with the company, he or she tends to have experienced the product
of the company so well that once he or she decides to churn, it could imply that he or she
has already had it all so it will be hard to win them back. Therefore, I postulate the
following:
H9: There is an inverted U-shape relationship between the customer acquisition and
CHAPTER 4
METHODOLOGY
Based on the research objectives of this dissertation, as well as the four binary
outcome events that this study examines, namely, conversion, churn, winning back
unconverted customers and winning back churned customers, in this section, we discuss
three classification methods that are used in this study: Bayesian logit/probit, boosting,
and random forest. I focus on the binary class version of these methods.
The standard approach in marketing for binary and multiclass classification are
generalized linear models. The commonly used generalized linear models include the
probit, logit, log-log and complementary log-log models. These models are all
exponential-family models and differ according to the choice of link function used to
transform the DV that will be expressed as a linear function of the Xs and they also may
differ on the assumptions underlying the regression structure on latent continuous data.
The most commonly used generalized linear models are probit and logit models, and
there is actually not much difference between them. The logistic distribution is similar to
the normal distribution except in the tails, which are considerably fatter (It more closely
tend to give similar probabilities for the intermediate values of x ' such as between -1.2
and +1.2, where x is the vector of input variable and are the parameters. When x '
gets extremely small or extremely large, the logistic distribution tends to give larger
45
employ Bayesian logit or probit models for binary outcomes, and Bayesian multinomial
logit or probit for polychotomous response data. There are several possible prior
assumptions for Bayesian probit or logistic models. Readers who wish to pursue a full
discussion of the intricacies of choosing a prior can consult Gelman, Carlin, Stern, and
choose priors in a Bayesian context, here I provide a brief justification of my chosen prior.
the class label for each binary outcome with 1 being the code for membership and 0 for
the nonmembership, and define x as the n p matrix of standardized input variables, then
exp(x'i β)
Pr( yi 1| xi ) Z ( x'i β) (1)
1 exp(x'i β)
where β is the vector for parameters, and Z Z1 ,...Z n is the latent variable. Each Z i is
yi 1 if Zi 0
yi 0 if Zi 0
46
Therefore the decision boundary for classification is the set of points for which z(s)
each outcome is ½.
The Bayesian approach to the logistic classifier assumes that each parameter j
follows a prior distribution and this prior distribution is usually assumed to be normal
1 ( j j ) 2
p( j | j ,V ) ~ N ( j ,V ) exp( ), j 1,...d (2)
2 V 2V
make a hierarchical model. For the hierarchical logistic model with normal prior, I could
further assume that the hyperparameters comes from a normal distribution and V from
~ N (,V A1 ),
(3)
V ~ IW ( ,V )
where A1 100I or larger to set a diffuse prior for the variance.
p( ) p( j )
I ni
y 1 yij
p( | D) p( D | ) p( ) { pij ij (1 pij ) } p( ) (4)
i 1 j 1
p( ) ~ N ( , A1 )
4.1.2 Boosting
Boosting is one of the most powerful learning ideas introduced in the last twenty
“committee.” From this perspective boosting is similar to bagging and other committee-
based approaches which take a simple unweighted average of the predictions from each
model, essentially giving equal probability to each model. However this resemblance is
only on the surface and boosting is fundamentally different from the committee-based
approaches. For example, compared with bagging which uses bootstrap to sample the
There are a number of boosting algorithms but the most popular one is called
Adaptive Boosting (AdaBoost) introduced by Freund and Schapire (1996). Later on,
researchers also explored some algorithms to modify the original AdaBoost algorithm
such as Gentle AdaBoost, Logit AdaBoost, and Real AdaBoost (Friedman et al. 2000).
Recent developments in this area include the regularization of the boosting algorithm
(Friedman 2001; Rosset, Zhu, and Hastie 2004), such as utilizing a learning rate
parameter to regularize the boosting when classification trees constitutes the base
random permutation sampling strategy at each iteration to obtain a refined training set
Binary AdaBoost
The original AdaBoost algorithm by Freund and Schapire (1996) has two versions,
equivalent in dealing with binary classification problems but differ in handling problems
with more than two classes. Here I only introduce “AdaBoost.M1” since Freund and
Schapire (1996) show that AdaBoost.M1 fits a forward stepwise additive logistic
regression model that minimizes the expectation of the exponential loss function, ,
with F(x) denoting the boosted classifier. Consider a two-class problem, with the output
produces a prediction taking one of the two values {−1, 1}. The error rate on the training
sample is
1 N
err I ( yi C ( xi )) ,
N i 1
(4)
A weak classifier is defined as the one whose error rate is only a little better than
random guessing. The boosting algorithm has access to the weak learning algorithm and
predictions from all of the weak classifiers are then pooled through a weighted majority
M
C ( x) sign( mCm ( x)) (5)
m 1
Here 1 , 2 ,..., M denote the contribution (or weights) of each weaker learner Cm ( x) and
are computed by the boosting algorithm. The more accurate classifier receives higher
weights.
The data are also modified at each boosting step. Initially all training observations
modified so that the observations that were misclassified at the previous step are given
more weights, whereas the weights are decreased for those that were classified correctly.
As a result as iterations proceed, observations that are hard to classify correctly receive
training observations that are misclassified in the previous iteration in the sequence.
very weak classifier. This AdaBoost.M1 algorithm is also called “Discrete AdaBoost”
since the base classifier Cm ( x) returns a discrete class label. Friedman et al. (2000)
modified the algorithm to enable the base classifier to return real-valued predictions thus
called “Real AdaBoost”. Other modifications such as Gentle AdaBoost requires fitting a
regressor at each iteration and results in the original GentleBoost algorithm whenever m
=1.
As to the “base learner” Cm ( x) for AdaBoost algorithm, the most commonly used
one is classification trees, where improvements are often most dramatic. In fact, Breiman
50
(1996, 1998) referred to AdaBoost with classification trees as the “best off-the-shelf
classification accuracy are random forests. The idea of a random forest is to combine tree
predictors so that each tree depends on the values of a random vector sample
independently and with the same distribution for all trees in the forest.
Random forests were developed by Breiman (2001), the same author who
introduced bagging. Compared with Adaboost, random forests are more robust with
respect to noise, which occurs because Adaboost has no random elements and grows an
ensemble of trees by successive reweightings of the training set where the current
weights depend on the past history of the ensemble formation whereas random forests do
not depend on the past history of the ensemble. Unlike SVM, random forests require little
Like bagging, random forests also use the bootstrap to generate samples and to
grow a committee of classification trees. To classify a new object from an input vector,
one puts the input vector down each of the trees in the forest and each tree gives a
classification. Then the new object is classified into the class which “won the most votes”
from all the trees in the forest. Unlike bagging, the trees in random forests are de-
correlated and they allow random selection of features to split each node. Bagging can
actually be regarded as a special case of random forests, and random forests are the
51
combination of bagging and random selection of subsets of features (Ho 1998; Amit &
Geman 1997).
B),
1) Draw a bootstrap sample Z* of size N from the original data to be the training set.
2) Grow a random forests tree Tb with the sample Z*. If there are p input variables, a
number m<<p is specified such that at each node, m variables are selected at
random out of p variables and the best split on these m is used to split the node
into daughter nodes. The value of m is held constant during the forest growing.
3) Each tree Tb is grown to the largest extent possible and there is no pruning.
Then output the ensemble of trees {Tb }1B . For the classification prediction at a new point x,
let Cˆb ( x) be the class prediction of the bth random forest tree, then
In the original paper on random forests, Breiman (2001) shows that the forest
error rate relies on two parameters: the strength of each individual tree in the forest and
the correlation between any two trees in the forest, defined as follows.
classifiers {h(x, k ), k 1,...} where the {k } that are independently and identically
52
distributed (i.i.d) random vectors and each tree Tb casts a unit vote for the most popular
class at input x. Define the margin function for a random forest as:
s EX ,Y mr ( X , Y ) . (8)
where I is the indicator function. Hence mr ( X , Y ) Ermg (, X , Y ) . Since any two
In brief, the higher the correlation, the greater the forest error rate and increasing
the strength of the individual trees decreases the forest error rate. Reducing the predictor
number m at each node reduces both the correlation and the strength and increasing it
increases both. For classification, the default value for m is and the minimum node
size is one (Hastie, et al. 2009). However, the "optimal" value of m can be found quickly
by using the oob (out of bag) error rate and this is the only adjustable parameter to which
Random forests use out of bag (oob) data as the test set; therefore cross-validation
is not needed to estimate the test set error. When the training set for the current tree is
drawn by sampling with replacement, about one-third of the cases are left out of the
sample. This oob (out-of-bag) data is used to get a running unbiased estimate of the
classification error as trees are added to the forest. It is also used to get estimates of
During the run, each tree is constructed using a different bootstrap sample from
the original data but the oob data are not used in the construction of the kth tree. Put each
case left out in the construction of the kth tree down the kth tree to get a classification.
Thus, a test set classification is obtained for each case in about one-third of the trees. At
the end of the run, take j to be the class that got most of the votes every time case n was
oob. The proportion of times that j is not equal to the true class of n averaged over all
this dissertation to rank the factors that could influence customers’ decision making
during the three stages of CRM. The variable importance in random forest is calculated in
the following way: first, input the oob cases in every tree grown in the forest and count
the number of votes cast for the correct class; second, permute the values of variable m
randomly in the oob cases and enter these cases into the tree again; third, subtract the
number of votes for the correct class in the variable-m-permuted oob data from the
54
number of votes for the correct class in the intact oob data; finally, average this number
over all trees in the forest to obtain the raw importance score for variable m.
If the values of this score from tree to tree are independent, one can calculate the
standard error by a standard computation. The correlations of these scores between trees
have been calculated for numerous data sets and proved to be quite low, therefore
standard errors is computed in the classical way: one can divide the raw score by its
standard error to get a z-score, and assign a significance level to the z-score assuming
normality. When the number of variables is very large, forests can be run with all the
variables once, and then be run again using only the most important variables from the
first run. For each case, a local importance score for variable m for this case is generated
by subtracting the percentage of votes for the correct class in the variable-m-permuted
oob data from the percentage of votes for the correct class in the untouched oob data.
random forests in the following way: every time a split of a node is made on variable m
the gini impurity criterion decreases. Adding up the gini value decreases for each
individual variable over all trees in the forest, gives a Gini importance value.
accuracy among current algorithms, many later empirical studies showed that this is not
the case. However, random forests do possess the following advantages: 1) they run
efficiently on large data sets; 2) they can handle thousands of input variables without
variable deletion; 3) they give estimates of what variables are important in the
55
the forest building progresses; 5) they incorporate methods for balancing error in class
population unbalanced data sets; 6) the generated forests can be saved for future use on
other data; 7) they compute proximities between pairs of cases and hence offers an
experimental method for detecting variable interactions; 8) random forests do not overfit
(Breiman, 2001).
classification for the following three events: 1) conversion (or trial conversion), the event
when a subscriber who is not paying for a subscription out of his own pocket (but
provided free by the company) decides to become a paying subscriber and purchases a
subscription after his trial period ends; 2) churn, the event when a revenue-generating
subscriber who pays out-of-pocket decides to terminate service, for some reason; and 3)
winback, the event when a deactivated subscriber who either discontinued his service at
the end of trial-period or churned as a paid-subscriber reactivates his service with the
company.
rate (1- error rate), precision and recall, sensitivity and specificity, F-measure, Youden’s
(correlation between the actual and predicted), as well as graphical tools such as ROC
curves and Cumulative Lift charts. However all of these performance check methods are
56
based on a confusion matrix which consists of four dimensions: true positive (TP), true
negative (TN), false positive (FP), and false negative (FN). Each measurement has
strength and weaknesses, which I outline in the discussion of commonly used measures
where T=TP + TN + FP + FN, or the total size of the test set. This is most straight-
forward measurement, however, when a certain category comprises the majority of the
test set, the accuracy can easily reach the high percentage. Sometimes it is worth trying to
maximize the accuracy score, but accuracy (and its counterpart error) are considered
fairly crude scores that don't give much information about the performance of a classifier.
is also called Positive predictive value (PPV), which measures the proportion of predicted
positives which are actually positive. The issue with precision is that when a
classification method outputs only confident categories (the categories that take the
measures the proportion of actual positives which are predicted positive. The problem
with the measurement is that when the classifier outputs loosely, the recall rate can get
really high.
The F-scores are in the interval [0, 1]. The higher the score, the higher the
between recall and precision, the F-measure is usually used as a simple measure to
evaluate the classifier. Considering that you can get a perfect precision score by always
assigning zero categories, or a perfect recall score by always assigning every category.
However, to maximize the F-score, the classifier has to assign the correct categories and
only the correct categories, which maximizes both precision and recall at the same time.
There are two types of F-measures, resulting from two different ways of
multiclass classification, for each class i , I define TPi as the number of documents
assigned correctly to class i , FPi as the number of documents that do not belong to class
i but are assigned to class i , FNi as the number of documents that are not assigned to
class i by classifier but which actually belong to class i , and as the precision, and as
M M
TPi TP i
= M
i 1
, = M
i 1
2
F (micro-averaging) =
(15)
where M is the total number of categories. In contrast, the micro-averaging F-value gives
equal weight to each document, therefore it is the same as accuracy rate when you are
The macro averaging F-measure is calculated locally over each category first and
2 i i F i
Fi , and F (macro-averaging) = i 1
i i M (16)
CHAPTER 5
5.1 Data
The data utilized in this dissertation comes from Sirius XM, the largest worldwide
satellite company, and was granted through the Wharton Customer Analysis Initiative
(WCAI). This data set provides a “360 view” of SiriusXM’s 1,000,000 customers who
first subscribed or tried in late 2009, and are tracked through roughly 2.5 years up to
radios, product usage, billing & payments, outbound direct marketing, and customer
service interactions (online and voice). While customers may access the service via the
Internet and mobile phones, the majority of subscribers listen to SiriusXM in their
vehicles.
The dataset I received includes 9 tables from Sirius XM: Services Table, MDB
(Outbound Marketing) Table, Billing Table, Collections Table, Notes Table, Self Care
Table, Listening Log Table, ESN Table (Device Information), VLC Table (Vehicle
Drawing from this data, I first focus on customers who own one vehicle since the
sole vehicle owners’ activation, deactivation, or reactivation activities can provide clean
Moreover, many of the multi-vehicle subscribers are business customers who are likely to
be qualitatively different from non-business subscribers. This reduces the sample size
from 1 million to less than 700,000 customers. Second I determine which customers are
60
paid customers since only these customers should be examined in the retention period.
The number of paid customer with only one vehicle is 464,810. Third, I separate the data
into three parts: trial period, paid period and winback period by combining the
information from each customer’s trial period end date, first paid service start date,
deactivation date and reactivation date. For paid customers, their trial period records
consist of all interactions that occur before the first paid service start date and their
retention period starts from their first paid service start date; for non-converted customers
the trial period records consist of interactions either before their deactivation date or the
end date in the data. Using these definitions results in 392,532 trial customers, of whom
221,333 converted into paid customer. The reported conversion rate of the company for
To decide which paid customers churned, I used their deactivation date, if it was
before the end of the data in the sample (March 1, 2012). I treat customers who have not
customers. Thus I obtain 228,597 churned customers and 236,213 non-churned customers.
Finally the winback customer base consists of customers who have a clear deactivation
date. Among them, customers who have a specific reactivation date are defined as the
ones that have been successfully won back while others are not. For my research
purposes, I further divided the winback customers into categories: the ones who
deactivated during or at the end of their trial period and the ones who churned after
conversion. There are 1,539 non-converted customers who were reacquired but 157,158
were not. Among the churned customers, 67,124 were recaptured and 148,047 were not.
61
Table 4 below shows the final datasets that I used to predict conversion, churn and
winback both unconverted customers and churned customer. Table 4 below shows the
data for each of the three periods as well as the rate of conversion, churn and winback.
5.2 Variables
duration, retention/paid/churn period duration, the dummy variable for usage of self-care
in both conversion and churn period, customer response rate to firm’s marketing actions,
customer representatives, “IVR” for interactive voice response, and “CUWI” for internet
62
help), as well as the interactions between self-care channels and activities which count
how many time each customer use each self-care channel to do each activity; 2)
contacts, the counts of each communication channel that each customer receives from the
firm, number of promotions that each customer gets, mean-centered average price in both
linear and quadratic forms, and total-radio number for each zip code which I treat as radio
density variable (Xue, et al. 2011); 3) Deactivation reasons, which have been grouped
grouped into “Free ended” for trial deactivation reason), "Free ended", "Competition",
"Unknown reason", and “NonPay”; 4) Text mining variables. I analyzed the call center
notes text using LIWC2007 (Linguistic Inquiry and Word Count). LIWC2007 is text
analysis software which is able to calculate the degree to which people use different
categories of words across text with 64 language dimensions. The variables generated by
LIWC include 22 linguistic dimensions such as percentage of words in the text that are
personal concern categories (e.g., work, home, leisure activities), and 3 paralinguistic
dimensions (assents, fillers, nonfluencies). Therefore, these variables are clusters from
the results of a word extraction process, which groups the words and phrases that express
the same meaning or have the same linguistic function together. For a complete list of the
LIWC output variables and their examples, please see the Appendix 1.
63
The variable values generated by LIWC are the percentage of total words for each
entry. For example, an incomplete snapshot of the LIWC output in Table 5 shows that
11.59% of note #1 is composed of pronouns compared with 15% of note #2. Also 5.92%
of note #5 consists of positive emotional words compared with 12.5% of those for note
#6. Since these values are the percentage of frequency in each note, the total word count
All time-varying variables, including all self-care activities that each customer
performed, the self-care channels that each customer used, all marketing variables such as
64
communications, each marketing outbound channel), as well as the call center notes
during both conversion and retention period, were separated into two time frames: 60
days within the conversion and retention period end dates and 60 days beyond the end
date of conversion and retention period. Therefore, for these variables the number of
variables is doubled. The reason why I used 60 days instead of other time frame is
because that SiriusXM usually communicates with its customers 30 days before their
contract ends to ask them whether they want to renew their services or not and if not,
SiriusXM would take some action (such as offering a discount) to retain customers. In
addition, based on prior research results, I know that the recency effect plays a role in
customers’ decision making, so dividing the data into these time-based categories, I can
also test whether activities that happened closer to the conversion or churn dates are more
predictive of these events. It seems reasonable that activities that occurred further in the
In order to model the winback event, I used the conversion period variables to
predict winning back non-converted customers and the retention period variables to
Table 6 shows some selected variables and their definition. The summary
Variable Definition
RADIO_TOTAL The total number of radio in a given zip code
Acquisition_duration Mean centered trial period length in months
CHAPTER 6
Four variable importance graphs were produced by random forests to exhibit the
30 most important variables for each of the four events: conversion, churn, winning back
them, I find:
1) For conversion period (see Figure 1 below): a) the 30 most predictive variables
are dominated by text mining variables from LIWC of both linguistic words and
psychological vocabulary regardless of whether they are within or beyond 60 days of trial
period end date. This result indicate that analyzing what and why customers made phone
Other factors that play roles in converting customers include: marketing actions such as
the total number of outbound communications that company sent to customer beyond 60
days of trial period end date and customers’ responsiveness to them, the number of direct
mail that the company sent to customer within 60 days of the trial period end date, as well
as the number of phone calls that the company made to customers beyond the 60 days of
the trial period end date. This result shows that during the early stage of the trial period
when customers were trying out the company’s product, the additive effect of company’s
communications by phone play more important role during this period than other
communication methods. However, when it is close to the date that trial period ends,
direct mail are more effective in converting customers. c) The duration of customers’ trial
period, the radio density in each zip code, whether customers apply self-care channel and
how many times they called customer care center are all also highly indicative of whether
customers would convert to paid customer or not. This result is consistent with the prior
research findings. d) Among all the deactivation reasons that customers report, only
not. This result supports Hypothesis 8b and shows that the major reason that customers
would not convert was due to the end of free trial period.
71
roles in shaping customers decision making on whether to churn or stay with the
company. The total number of outbound communications that company sent to customer
during the paid period regardless of the time frame, the total marketing communications
occurring during the trial period, the number of times the company reach customers by
email, direct mail, or phone call both within and beyond the paid period end date, as well
as the number of promotions each customers gets are all highly predicative of customers’
churn or no-churn behavior. b) From customer side, whether customers use self-care
services, their responsiveness to firm’s marketing actions throughout the whole retention
72
period, the number of services they had with the company as well as the average price
and price range associated with them, both the trial period duration and paid period
duration, are all differentiating characteristics that reveal customers’ future behavior
direction. c) A few psycholinguistic variables such as the total number of function words,
the number of verbs, prepositions as well as the number of relativity words (e.g. area,
bend, exit, stop) that appear in the call center notes are also good predictor of whether
3) To win back the customers who fail to become paid customers, the factors that
used to influence their initial decisions of conversion or non-conversion still play a large
role. Even though the number of text mining variables decrease by 2 and their importance
ranking also declined compared with marketing variables, they still constitute nearly half
of the top 30 important variable list which shows the importance of investing in studying
the content of calls that customers made during their trial period. Radio density becomes
the top one factor influencing whether company can win back customer or not, which
74
indicates the regional difference in availability and variety of the product is the major
aspect that firms should consider when they allocate the marketing resources regarding
who should be won back. The total number of marketing initiatives and customers’
responsiveness to it, the trial period duration, as well as the communication channels that
the firm uses to reach out customer are all effective in winning back the non-converted
customers.
4) For winning back churned customers, whether customers used self-care during
both trial and paid periods, the number of promotions that customers used to get during
retention period, the average price and price range that customers used to get during the
retention period, customers’ trial period duration, customers’ total service number with
the company, as well as firm’s marketing communication variables are all highly
predictive of whether customers could be won back or not. Particularly the number of
promotions that customers get during the paid period became the top predictor of
customer churn or not, which could imply that the customers who churned were mostly
price sensitive. The larger number of promotions each customer gets, the more price-
sensitive they are, and the more difficult for the firm to reacquire them. A few
psycholinguistic variables such as “cogmech”, “social”, functional words and verbs are
5) Examining all four variable importance graphs together, I discovered that: all
marketing action variables, whether customers use self-care or not, customer duration in
both conversion and retention periods are strong predictors in classifying converted and
unconverted customer, churned and non-churned customer, as well as the customers that
could be won back or not. Finally, nearly all deactivation variables except ‘Free_ended”
are missing from the top 30 list. Some of them are significant in predicting one or two
events (see Table 9 below) but compared with other variables, they are not that
77
Random forests only give us the overall importance ranking of all variables in
effect and some interaction effects among certain variables on the probability of four
events, namely conversion, churn, winning back unconverted customers and winning
back churned customer, I rely on logistic regression to provide the parameter estimates
6.2.1 Self-care
The major results for the analysis of self-care channels, and related activities, are
during their trial period and subsequent paid period are strongly significant (p-value <
0.01) in predicting all four events: customer conversion, customer churn, customer non-
converted winback and customer churn winback. However the only positive effect of
whether customers adopting self-care is for winning back churned customers (β = 0.3862,
p-value < 0.01), which barely supports H1a. Customers using self-care during their trial
period actually decreases their likelihood of converting from free to fee and subsequent
reactivation of service with the company (β = -0.5640 and -0.7260 respectively, p-value <
0.01). In addition, the employment of self-care during the retention period increases
customer churn probability. All these negative effects of using self-care could suggest
78
customers mainly voluntarily contact the service-provider when they encounter problems
and they use self-care to seek help from the company. To further investigate if this
assumption is true or not, I proceed to examine the interaction and marginal effects of
customer’s usage of each three self-care channel (e.g. call live agent, use online help
channel, or use interactive voice message) and the activity they performed (eg. Change
2) Interaction and marginal effects of self-care channel usage and activities: The
interactions between self-care channels and self-care activities here are not the
multiplication of their marginal effects but the number of times two events happen
agent to make payment within 60 days of their churn or the right-censored date of March
1st, 2012. For Table 7, I discover that the estimates of all marginal effects and the
interaction of self-care channel and self-care activities are insignificant except the
value with regard to churn probability (β = -0.1722, p-value < 0.05). So statistically the
more a customer makes phone calls to a customer representative to make payment within
60 days of their churn period end date, the lower the probability he or she will churn. The
result of overall performance of the interaction and marginal effects of self-care channel
following:
respectively, p <0.001) indicates that there is a positive association between trial period
marketing volumes and conversion and retention. Also the marketing volume beyond the
60 days increase the probability of winning back both unconverted and churned
customers; hence the H2a is largely supported. For the marketing communication volume
in the churn period, the more outbound marketing, the higher the probability that the
customer will churn. This supports H2b and is also consistent with the reciprocal action
2) The response rate of customers to marketing efforts: all response rates except
customers’ response rate within 60 days of trial end period are highly significant in
predicting either conversion, or churn or winning back both unconverted customers and
churned customers. Moreover high response rate during the trial period beyond the 60
days of trial end date increases the chance of conversion but the response rate within 60
82
days of trial end date is insignificant in predicting conversion. This could be explained by
the firm’s specific marketing action that they always send mail to customers 30 days
before their trial end date to remind them to convert to paid customer. Thus the increase
in marketing volume and in the response rate within the 60 days of the end of conversion
could depict the increase in marketing volume near their contract end date not real
investment in the relationship but compelling message to persuade them to become paid
customers. Customers’ response rate could also be forced to be high since they have to
is partially supported.
marginal effects are not significant in predicting winning back unconverted customers.
This could suggest that the unconverted customers were the ones who were not really
interested in the product that the company provides so they could resist the company’s
marketing effort. The interaction effects of all three channels both within and beyond 60
days of both trial end date and churn end date are not significant in predicting conversion,
churn and winback, therefore, the synergy or multiplicative effect that prior researchers
((Naik and Raman 2003, Godfrey et al 2011) found in their studies is not supported by
this study. On the contrary, the additive effect of multichannel communication (denoted
customer’s decision on conversion, churn and winback. This result is consistent with the
83
and churn with the telemarketing being the most effective one. Therefore H4a is supported
Price_range are highly significant in predicting customer churn and winback. Particularly,
the positive values for all four coeffients indicate that the more discounts that a customer
gets the more likely he is going churn and be won back. Therefore both H6a and H6b are
partially supported.
5) Average price a customer pays per month: A negative main effect and a
positive quadratic coefficient for the average price in predicting both churn and winback
shows that there is indeed a U-shape relationship between the average price that a
customer pays and the probability that he or she could be reacquired. Therefore H7b is
fully supported by the result. In the mean time, the average price that a customer pays per
month does have a negative impact on their churn. Therefore, the higher the price a
customer pays each month, the less likely he or she will churn. However, when this
diminishing effect reaches a certain point, the higher price will result in higher customer
churn. So H7a is partially support. Both results imply that: in order to win back lost
customers, company should target at the both low-value and high-value customers since
they are the ones who are most likely to be recapture. To reduce customer churn,
company has to control price increases so that they will not drive their most valuable
customers away.
84
The results of the analysis of the effect of radio density on customer conversion,
churn and winback are displayed in Table 9. The negative coefficient of the Radio_Total
on predicting conversion shows that the availability and variety of products are not the
reason that customers convert from free to fee. In addition, the insignificant coefficient of
Radio_Total on predicting churned customer partially supports the H5b. Finally the
negative effect of radio density on customer churn partially supports H5a. In short, radio
density can indeed decrease customer churn and does not show effect in winning back
churned customers.
duration on predicting winning back both unconverted customers and churned customers
indicate there is indeed an inverted U-shape relationship between prior duration and the
customer winback. This result indicates that companies should not target at the customers
who have the extremely short and long duration with them but focus on the customer
CHAPTER 7
This study aims at exploring how different factors play roles during the three
stages of CRM by identifying the most influential factors for each stage, namely
acquisition, retention and winback of lost customers, and examining how each factor’s
marginal effect evolve in shaping customers’ decision making of conversion, churn and
reactivation of service with the company. With the grant of a big CRM data from a large
multi-media company in the country through WCAI, which covers all three stage of
CRM activities as well as the text data from call center notes, this dissertation examines
variables from the text mining of call center notes, as well as the customer self-reported
deactivation reasons impact customers’ decision making through three stages of CRM of
whether to convert to a paid customer, to churn, or to reactivate their service with the
firm. In the mean time, applying a new machine learning method – random forest with its
variable importance ranking, this dissertation also shed light on which factors impact
customer conversion, churn and winback most among all the factors from both customer
side and firm side. In summary, this study obtained the following findings.
This study reveals that: 1) during the trial period of CRM, the content of the
customer call center notes are the top indicator of whether customers would convert or
not; 2) after customer have become paid customers, the marketing actions that the
company takes, which include the volume they communicate with the customer, the
87
depth of the promotion they offer to each customer, the communication channels they
apply to reach out customers, as well as the radio density in different areas, all play major
roles in retaining customers and preventing them from churn; 3) the factors that influence
the most whether the company can win back the lost customers who only tried their
products but never converted to paid customers are the same ones that affect conversion
most; and the factors that are most predictive of churn or no-churn are also highly
self-reported deactivation reasons are not top predictors compared with other factors
throughout all three stages of CRM. Since all deactivation reasons bear the negative
impact that could affect company’s CRM and it is company-specific, so the no effect of
dissatisfaction or competition issue for the specific company I study with. 5) Customer
usage of self-care service increases the likelihood of customer retention and winning back
churned customers but decreases the likelihood of customer conversion and winning back
unconverted customers. 6) both acquisition and retention durations affect all three events
of conversion, churn and winback and they both have nonlinear relationship with the
many customers as possible and retain them as long as they can, this study helps them to
identify which factors are more important in shaping customers’ decision making of
whether to stay with the company or terminate the service can help company better
terms of data range and variety, as well as the possible factors that impact customer
acquisition, retention and winback, largely enriched the CRM literature and especially the
There are several limitations that this study bears: first, this study only focuses on
predicting whether customer would convert, churn or be won back, but did not have
apply the same factors to calculate customer’s lifetime value which is what the CRM in
each company is most concerned with. Therefore, one future direction is to extend the
current research to customer lifetime value analysis. Second, due to time limitation, this
study could not conduct real time series study with machine learning methods but merely
divided the data into two time frames, namely within and beyond 60 days of trial and
paid period. Hence the second possible direction is to explore machine learning methods
that could deal with the real time series data. Third, the psycholinguistic analysis of the
customer’s conversion, churn and winback is limited and I included all 64 LIWC
variables into the prediction of conversion, churn and winback. A possible extension is to
only select certain variables from the 64 LIWC variables and examine how different
psycholinguistic factors evolve throughout the three stages of CRM and their impact on
89
another important aspect in CRM, but I only used the data for customers who owns one
vehicle. Therefore, including the customers with more than one vehicle into analysis of
Summary
Even though CRM research has flourished in marketing with the availability of
CRM database, none of the existing research has examined all three stages of CRM
together and how factors’ effect evolve throughout customer acquisition, retention and
winback. Extremely under researched is the customer winback but as studies have shown
winback can greatly cut company’s cost and achieve greater return. This dissertation
REFERENCES
Aitchison, J., and S. D. Silvey, (1957). “The Generalization of Probit Analysis to the
Case of Multiple Responses,” Biometrika. 57, 253–262.
Albert, J.H, and S. Chib, (1993). “Bayesian Analysis of Binary and Polychotomous
Response Data,” Journal of the American Statistical Association. 88(422):669–679.
Allenby, Greg M., Robert P. Leone, et al. (1999). “A Dynamic Model of Purchase
Timing with Application to Direct Marketing,” Journal of American Statistical
Association. 94 (June). 365–74.
Amit, Y. and Geman, D. (1997). “Shape quantization and recognition with randomized
trees,” Neural Computation. 9, 1545–1588.
Anderson, Eric and Duncan Simester (2004), "Long-Run Effects of Promotional Depth
on New Versus Established Customers: Three Field Studies," Marketing Science, 23 (1),
4-21.
Andrews, R. L., A. Ainslie, et al. (2002). “An empirical comparison of logit choice
models with discrete versus continuous representations of heterogeneity,” Journal of
Marketing Research. 39(4):479–487.
Bass, F. M. (1969). “A new product growth for model consumer durables,” Management
Science. 15(5):215–227.
Bass, F. M., N. Bruce, et al. (2007). “Wearout effects of different advertising themes: A
dynamic Bayesian model of the advertising–sales relationship,” Marketing Science.
26(2):179–195.
Berger, P., T. Magliozzi. (1992). “The effect of sample size and proportion of buyers in
the sample on the performance of list segmentation equations generated by regression
analysis,” Journal of Direct Marketing. 6(1):13–22.
Blattberg, Robert C. and John Deighton (1996). “Manage Marketing by the Customer
Equity Test,” Harvard Business Review. 74 (July/August). 13 6–44.
91
Bock, H–H., (2002). The Goal of Classification. Handbook of Data Mining and
Knowledge Discovery. 254–258, Oxford University Press.
Boutell, M. R., J. Luo, et al. (2004). “Learning multi–label scene classification,” Pattern
Recognition. 37(9):1757–1771.
Boulding, W., A. Kalra, et al. (1993). “A Dynamic Process Model of Service Quality –
from Expectations to Behavioral Intentions,” Journal of Marketing Research 30(1):7–27.
Boyer, K.K., Hallowell, R., et al. (2002). “E–Services: operating strategy—a case study
and a method for analyzing operational benefits, “ Journal of Operations Management.
20 (2):175–188.
Brank J., Grobelnik M., et al. (2002). “Interaction of Feature Selection Methods and
Linear Classification Models,” Proc. of the 19th International Conference on Machine
Learning, Australia, 2002.
Berger, Paul, Ruth Bolton, Douglas Bowman, Elten Briggs, V. Kumer, A. Parasuraman,
and Creed Terry (2002), "Marketing Actions and the Value of Customer Assets: A
Framework for Customer Asset Management," Journal of Service Research, 5 (1), 39-55
Campbell, D. and F. Frei (2010). “Cost Structure, Customer Profitability, and Retention
Implications of Self–Service Distribution Channels: Evidence from Customer Behavior
in an Online Banking Channel,” Management Science 56(1):4–24.
92
Crammer, K. and Y. Singer (2000). “On the learnability and design of output codes for
multiclass problems,” Comput. Learing Theory. 35–46
Cui, D. P. and D. Curry (2005). “Prediction in marketing using the support vector
machine,” Marketing Science, 24(4):595–615.
Cui, G., M. L. Wong, et al. (2006). “Machine learning for direct marketing response
models: Bayesian networks with evolutionary programming,” Management Science,
52(4):597–612.
Davis, D. 1989. Perceived usefulness, perceived ease of use and user acceptance of
information technology. MIS Quart. 13(3).318–339.
Elisseeff, A., and J. Weston, (2002). “A kernel method for multi–labelled classification,”
Advances in Neural Information Processing Systems. 14.
Fader, P. S., B. G. S. Hardie, K. L. Lee, (2005). “RFM and CLV: Using iso–value curves
for customer base analysis,” Journal of Marketing Research, 42(4):415–430.
Fletcher, Keith, and Colin Wheeler, and Julia Wright (1992). “Success in Database
Marketing: Some Critical Factors,” Marketing Intelligence & Planning. 10, 18–23.
Forman, G., (2003). “An Experimental Study of Feature Selection Metrics for Text
Categorization,” Journal of Machine Learning Research. 3:1289–1305.
Friedman J. H. (1997). “Data Mining and Statistics: What’s the connection?” Available at:
http://www.stat.standford.edu/~jhf/.
Friedman J, Hastie T, Tibshirani R (2000). “Additive Logistic Regression: A Statistical
View of Boosting,” The Annals of Statistics. 28(2):337–407.
Friedman J. H. (2001). “Greedy Function Approximation: A Gradient Boosting Machine,”
The Annals of Statistics. 29(5):1189–1232.
Friedman J (2002). “Stochastic Gradient Boosting,” Computational Statistics & Data
Analysis, 38(4):367–378. doi:10.1016/S0167–9473(01)00065–2.
Freund Y, and R. Schapire. (1996). “Experiments with a New Boosting Algorithm,” In
International Conference on Machine Learning. 148–156.
Freund, Yoav and Robert E. Schapire (1997); “A Decision–Theoretic Generalization of
On–Line Learning and an Application to Boosting,” Journal of Computer and System
Sciences. 55(1):119–139
Froehle C., A. Roth. 2004. New measurement scales for evaluating perceptions of the
technology-mediated customer service experience. Journal of Operations. Management
22(1) 1–21.
Ganesh, J., M. J. Arnold, et al. (2000). “Understanding the customer base of service
providers: An examination of the differences between switchers and stayers,” Journal of
Marketing. 64(3):65–87.
Gelman, Andrew, J. B. Carlin, H.S. Stern, and D. B. Rubin (2004) Bayesian Data
Analysis, Second Edition, Chapman & Hall/CRC.
Ghose, A. and P. G. Ipeirotis (2011). “Estimating the Helpfulness and Economic Impact
of Product Reviews: Mining Text and Reviewer Characteristics,” IEEE Transactions on
Knowledge and Data Engineering. 23(10):1498–1512.
94
Godfrey, A., K. Seiders, et al. (2011). “Enough Is Enough! The Fine Line in Executing
Multichannel Relational Communication,” Journal of Marketing. 75(4):94–109.
Griffin, Jill and Michael W. Lowenstein( 2001). Customer Win–back: How to Recapture
Lost Customers–And Keep Them Loyal. San Francisco: Jossey–Bass.
Gurland, J., I. Lee, et al. (1960). “Polychotomous Quantal Response in Biological Assay,”
Biometrics. 16, 382–398.
Haenlein, M., A. M. Kaplan, D. Schoder, (2006). “Valuing the real option of abandoning
unprofitable customers when calculating customer lifetime value,” Journal of Marketing.
70(3):5–20.
Hastie, Trevor, Robert Tibshirani, Jerome Friedman (2009). The Elements of Statistical
Learning, Second Edition, Springer.
Hayes, P. J., P. M. Andersen, et al. (1990). “Tcs: a shell for content–based text
categorization,” In Proceedings of CAIA–90, 6th IEEE Conference on Artificial
Intelligence Applications. 320–326.
95
Hill, T., M. OConnor, et al. (1996). “Neural network models for time series forecasts,”
Management Science. 42(7):1082–1092.
Hill, A.V., Collier, D.A., Froehle, et al. (2002),”Research opportunities in service process
design,” Journal of Operations Management. 20(2):189–202.
Hitt, L. M. and F. X. Frei (2002). “Do better customers utilize electronic distribution
channels? The case of PC banking,” Management Science. 48(6):732–748.
Ho, T. K. (1998). “The random subspace method for constructing decision forests,” IEEE
Trans. on Pattern Analysis and Machine Intelligence. 20(8). 832–844.
Hogan, John E., Donald R. Lehmann, Mario Merino, Rajendra K. Srivastava, Jacquelyn
S . Thomas, and Peter C. Verhoef (2002). “Linking Customer Assets to Financial
Performance,” Journal of Service Research. 5(August). 26–38.
Kantor, J. R. (1953). The Logic of Modern Science. Bloomington IN: Principle Press.
Kim, B. D., M. Shi, et al. (2001). “Reward programs and tacit collusion,” Marketing
Science. 20(2):99–120.
Kim, Y., W. N. Street, et al. (2005). “Customer targeting: A neural network approach
guided by genetic algorithms,” Management Science. 51(2):264–276.
Kivetz, R. and I. Simonson (2002). “Earning the right to indulge: Effort as a determinant
of customer preferences toward frequency program rewards,” Journal of Marketing
Research. 39(2):155–170.
96
Kivetz, R. and I. Simonson (2003). “The idiosyncratic fit heuristic: Effort advantage as a
determinant of consumer response to loyalty programs,” Journal of Marketing Research.
40(4):454–467.
Kreßel, U., (1999). Pairwise Classification and Support Vector Machines, in Advances in
Kernel Methods–Support Vector Learning, B. Schölkopf, C. J. C. Burges, and A. J.
Smola, Eds. Cambridge, MA: MIT Press, 1999, 255–268.
Lemmens, A. and C. Croux (2006). “Bagging and boosting classification trees to predict
churn,” Journal of Marketing Research. 43(2):276–286.
Lewis, M. (2006). “Customer acquisition promotions and customer asset value,” Journal
of Marketing Research. 43(2):195–203.
Li, T., & Ogihara, M. (2003). Detecting emotion in music. Proceedings of the
International Symposium on Music Information Retrieval.
Mahajan, Vijay, Paul E. Green, S. M. Goldberg (1982). “A conjoint model for measuring
self– and cross–price/demand relationships,” Journal of Marketing Research. 19(3). 334–
342.
Mahajan, V., E. Muller, R. A. Kerin. (1984). “Introduction Strategy for New Products
with Positive and Negative Word–of–Mouth,” Management Science. 30(12):1389–1404.
97
McCullagh, P. (1980). “Regression Models for Ordinal Data,” Journal of the Royal
Statistical Society. Ser. B, 109–142.
McCullagh, P., and J. Nelder. (1983). Generalized Linear Models. Chapman and Hall,
London.
McKelvey, R., and Zavoina, W. (1975). “A Statistical Model for the Analysis of Ordingal
Level Dependent Variables,” Journal of Mathematical Sociology. 4:103–120.
Mitchell, T.M. (1996). Machine Learning. McGraw Hill, New York, NY.
Musalem, A. and Y. V. Joshi (2009). “How Much Should You Invest in Each Customer
Relationship? A Competitive Strategic Approach,” Marketing Science. 28(3):555–565.
Nahshon Wingard, “CRM History: The Evolution Of Better Customer Service,”
http://www.streetdirectory.com/.
Nam, S., P. Manchanda, et al. (2010). “The Effect of Signal Quality and Contiguous
Word of Mouth on Customer Acquisition for a Video–on–Demand Service,” Marketing
Science. 29(4):690–700.
Oliveira, P., A.V. Roth, et al. (2002). “Achieving competitive capabilities in E–services,”
Technological Forecasting and Social Change. 69(7):721–731.
Petrison, Lisa A., Robert C. Blattberg, et al. (1997). “Database Marketing Past, Present,
and Future,” Journal of Direct Marketing. 7(3):27–43.
98
Piramuthu, S., H. Ragavan, et al. (1998). “Using feature construction to improve the
performance of neural networks,” Management Science. 44(3):416–430.
Riedmiller, M., and H. Braun. (1993). “A direct method for faster backpropagation
learning: the rprop algorithm,” Proceedings of the IEEE International Conference on
Neural Networks (ICNN). 1:586–591.
Rogers, E. M. (1995). Diffusion of Innovations, 4th ed. Free Press, New York.
Roth, A.V. (2000). Service Strategy and the Technological Revolution: The 7 Myths of
E–Services. In: Machuca, J.A.D., Mandakovic, T. (Eds.). POM Facing the New
Millennium:
Evaluating the Past, Leading with the Present and Planning the Future of Operations.
POM Sevilla, 159–168.
Rosset, S., Ji Zhu et al. (2004). “Boosting as a regularized path to a maximum margin
classifier”. Journal of Machine Learning Research. 5:941–973.
Rossi, P. E. and G. M. Allenby (1993). “A Bayesian–Approach to Estimating Household
Parameters,” Journal of Marketing Research. 30(2):171–182.
Rossi, P. E., G. M. Allenby, et al. (2005). Bayesian Statistics and Marketing. John Wiley
& Sons Ltd.
99
Schweidel, D. A., Peter. S. Fader, et al. (2008). “A Bivariate Timing Model of Customer
Acquisition and Retention,” Marketing Science. 27(5):829–843.
Shaw, R. and Stone, M. Database Marketing. New York: John Wiley & Sons, 1988.
Slonim, N. and N. Tishby, (2001). “The power of word clusters for text classification,” In
Proceedings of ECIR–01, 23rd European Colloquium on Information Retrieval Research
(Darmstadt, DE, 2001).
Soucy P. and Mineau G., (2003). “Feature Selection Strategies for Text Categorization,”
AI 2003, LNAI 2671, 505–509.
Sousa P., Pimentao J. P., et al. (2003). “Feature Selection Algorithms to Improve
Documents Classification Performance,” LNAI 2663, 288–296.
Stauss, Bernd and Christian Friege (1999). “Regaining Service Customers,” Journal of
Service Research. 1(4). 347–61.
Sun, M. H., A. Stam, et al. (1996). “Solving multiple objective programming problems
using feed–forward artificial neural networks: The Interactive FFANN Procedure,”
Management Science. 42(6):835–849.
Tellis, G. J. and F. S. Zufryden (1995). “Tackling the Retailer Decision Maze – Which
Brands to Discount, How Much, When and Why,” Marketing Science. 14(3):271–299.
100
Thieme, R. J., M. Song, et al. (2000). “Artificial neural network decision support systems
for new product development project selection,” Journal of Marketing Research.
37(4):499–507.
Tibshirani, R. (1996). “Regression Shrinkage and Selection via the Lasso,” Journal of the
Royal Statistical Society. Series B (Methodological). 58(1):267–288.
Villanueva, J., S. Yoo, et al. (2008). “The impact of marketing–induced versus word–of–
mouth customer acquisition on customer equity growth,” Journal of Marketing Research.
45(1):48–59.
Walf, A. (1926). Essentials of Scientific Method. New York. Macmillan Company.
Wedel, M., W. A. Kamakura, et al. (2000). “Marketing data, models and decisions,”
International Journal of Research in Marketing. 17(2–3). 203–208.
West, P. M., P. L. Brockett, et al. (1997). “A comparative analysis of neural networks and
statistical methods for predicting consumer choice,” Marketing Science. 16(4):370–391.
Weston, J. and C. Watkins, (1999). “Multi–class support vector machines,” the Proc.
ESANN99. M. Verleysen, Ed., Brussels, Belgium, 1999.
Xue, M., L. M. Hitt, et al. (2011). “Determinants and Outcomes of Internet Banking
Adoption,” Management Science. 57(2):291–307.
101
Yang, S., G. M. Allenby, et al. (2002). “Modeling variation in brand preference: The
roles of objective environment and motivating conditions,” Marketing Science. 21(1):14–
31.
Zeithaml, Valarie A., Berry, Leonard L. & Parasuraman, A. (1996). “The Behavioral
Consequences of Service Quality,” Journal of Marketing, 60, 31-46.
Zeithaml, V.A., A. Parasuraman, et al. (2002). “Service quality delivery through web
sites: a critical review of extant knowledge,” Journal of the Academy of Marketing
Science. 30(4):362–375.
Zhang, M.–L., and Z.–H. Zhou, (2005). “A k–nearest neighbor based algorithm for
multi–label classification,” Proceedings of the 1st IEEEInternational Confer¬ence on
Granular Computing.
Zhu, J., H. Zou, S. Rosset, et al. (2009). “Multi–class AdaBoost,” Statistics and Its
Interface. 2:349–360.
Zou, H., and T. Hastie, (2005). “Regularization and variable selection via the elastic net,”
J.R. Statist. Soc B. 67(2):301–320.
Zou, H., J. Zhu, and T. Hastie, (2008). “New Multicategory Boosting Algorithms Based
on Multicategory Fisher–Consistent Losses,” The Annals of Applied Statistics.
2(4):1290–1306.
102
Appendices:
Linguistic Processes
Total function words funct 464
Total pronouns pronoun I, them, itself 116
Personal pronouns ppron I, them, her 70
Impersonal pronouns ipron It, it’s, those 46
Articles article A, an, the 3
Common verbs verb Walk, went, see 383
Auxiliary verbs auxverb Am, will, have 144
Past tense past Went, ran, had 145
Present tense present Is, does, hear 169
Future tense future Will, gonna 48
Adverbs adverb Very, really, quickly 69
Prepositions prep To, with, above 60
Conjunctions conj And, but, whereas 28
Negations negate No, not, never 57
Quantifiers quant Few, many, much 89
Numbers number Second, thousand 34
Swear words swear Damn, piss, fuck 53
Psychological Processes
Social processes social Mate, talk, they, child 455
Family family Daughter, husband, 64
aunt
Friends friend Buddy, friend, 37
neighbor
Humans human Adult, baby, boy 61
Affective processes affect Happy, cried, abandon 915
Positive emotion posemo Love, nice, sweet 406
Negative emotion negemo Hurt, ugly, nasty 499
Anxiety anx Worried, fearful, 91
nervous
Anger anger Hate, kill, annoyed 184
Sadness sad Crying, grief, sad 101
Cognitive processes cogmech cause, know, ought 730
Insight insight think, know, consider 195
Causation cause because, effect, hence 108
Discrepancy discrep should, would, could 76
Tentative tentat maybe, perhaps, guess 155
Certainty certain always, never 83
Inhibition inhib block, constrain, stop 111
Inclusive incl And, with, include 18
103
(Appendix A, continued)
Exclusive excl But, without, exclude 17
Perceptual processes percept Observing, heard, 273
feeling
See see View, saw, seen 72
Hear hear Listen, hearing 51
Feel feel Feels, touch 75
Biological processes bio Eat, blood, pain 567
Body body Cheek, hands, spit 180
Health health Clinic, flu, pill 236
Sexual sexual Horny, love, incest 96
Ingestion ingest Dish, eat, pizza 111
Relativity relativ Area, bend, exit, stop 638
Motion motion Arrive, car, go 168
Space space Down, in, thin 220
Time time End, until, season 239
Personal Concerns
Work work Job, majors, xerox 327
Achievement achieve Earn, hero, win 186
Leisure leisure Cook, chat, movie 229
Home home Apartment, kitchen, 93
family
Money money Audit, cash, owe 173
Religion relig Altar, church, mosque 159
Death death Bury, coffin, kill 62
Spoken categories
Assent assent Agree, OK, yes 30
Nonfluencies nonflu Er, hm, umm 8
Fillers filler Blah, Imean, 9
youknow
104