Certified Analytics Professional: Examination Study Guide
Certified Analytics Professional: Examination Study Guide
Analytics
Professional
EXAMINATION STUDY GUIDE
aCAP
™
CE RT I F I E D ANA LY T IC S
PR O F E S S I O N AL ( C A P®)
&
AS S O CI AT E CE RT IF IE D
A NALYT ICS P RO F E S S I O N A L (aC A P™)
EXA M I N AT I O N S T U D Y GUID E
2
FOR E W OR D
As chair of the Analytics Certification Board, I congratulate the Study Guide committee on having assembled in short
order such a comprehensive study guide for the Certified Analytics Professional (CAP®) program.
I know the guide is not going to satisfy everyone or directly provide them with answers for the test. It isn’t designed
to do so. It is designed to provide some information on central concepts embedded in the CAP program. It is up
to the individual to determine his/her familiarity with the concept and decide whether more review or study on that
topic is warranted.
The examination has 100 multiple choice test questions for each of which there is only one correct answer. The
questions are both vendor and software neutral, designed to confirm that the test taker has the underlying knowledge
necessary to know which steps to follow in an analytics process and to select the correct tools. The exam covers seven
domains or areas of analytics practice: business problem framing, analytics problem framing, data, methodology
(approach) selection, model building, deployment, and model life cycle management. A sample of the type of
questions is available with this guide and can also be accessed through the Candidate Handbook. These sample
questions will never appear on an exam. Each sample gives not only the correct answer but also provides rationale
for why each is (in)correct.
You are to be applauded for seeking certification. While the exam is the most pressing hurdle to achieving the CAP,
it is not the only criterion. The Certified Analytics Professional program depends on each of the Five E's: They are
adherence to the Code of Ethics, Effective mastering of soft skills, acceptable levels of Experience and Education and
finally, successfully passing the Exam. The result of this program is a well-rounded analytics professional who can work
in many fields to provide analytic leadership and support.
The Analytics Certification Board wishes all candidates complete success in their certification process. If I, or they, can
be of help, feel free to contact me at acb@informs.org or email our Certification Manager at info@certifiedanalytics.org.
3
TAB LE OF C O NT E NT S
CHAPTER 1: INTRODUCTION TO THE CAP® PROGRAM.......................................................... 7
About the Professional Job Task Analysis................................................................................................ 9
Learning Objectives................................................................................................................................. 17
Key Concepts/Fundamentals.................................................................................................................. 17
Summary................................................................................................................................................... 20
Further reading......................................................................................................................................... 21
Learning Objectives................................................................................................................................. 23
Key Concepts/Fundamentals.................................................................................................................. 23
Summary................................................................................................................................................... 28
Further reading......................................................................................................................................... 28
Learning Objectives................................................................................................................................. 31
Key Concepts/Fundamentals.................................................................................................................. 31
4
Objective 3. Determine how and why to harmonize, rescale, clean and share data.................... 41
Objective 6. Use data analysis results to refine business and analytics problem statements...... 49
Summary................................................................................................................................................... 49
Further Reading........................................................................................................................................ 50
Learning Objectives................................................................................................................................. 51
Summary................................................................................................................................................... 60
Further Reading........................................................................................................................................ 60
Learning Objectives................................................................................................................................. 61
Summary................................................................................................................................................... 65
Further Reading........................................................................................................................................ 65
Learning Objectives................................................................................................................................. 67
Summary................................................................................................................................................... 70
Further reading......................................................................................................................................... 70
Summary................................................................................................................................................... 73
Further reading......................................................................................................................................... 74
Learning Objectives................................................................................................................................. 75
Task 1: Talking intelligibly with stakeholders who are not fluent in analytics...................................... 75
Summary................................................................................................................................................... 79
Further reading......................................................................................................................................... 79
APPENDIX B: USING THE STUDY GUIDE TO HELP PREPARE FOR THE CAP® EXAM. .. 81
GLOSSARY.................................................................................................................................................. 83
List of Figures
Figure 4: Black Box Sketch by Alan Taber, CAP® (used with permission)
Figure 6: Sample software application characteristics by Rami Musa, CAP® (used with permission)
The Institute for Operations Research and the Management Sciences (INFORMS)
is an international scientific society with more than 11,000 members, including
Nobel Prize laureates, dedicated to applying scientific methods to help improve
decision making, management, and operations. Members of INFORMS work in
business, government, and academia.
7
to advance the analytics profession by providing a high-quality program of
certification and by promoting continuing competence for practitioners.
CAP®
For more information on the development of the CAP program, read “Steering
Toward Analytics” in OR/MS Today, June 2013 (p. 30) by Gary Bennett and Jack
Levis.
8
ABOUT THE PROFESSIONAL JOB TASK ANALYSIS
The JTA study defines the current knowledge, skills, and abilities (KSAs) that
must be demonstrated by analytics professionals to effectively and successfully
provide these services. KSAs are validated according to their frequency of use and
importance. The JTA also serves as a “blueprint” for the content (performance
domains) of the INFORMS CAP® examination.
• Arnold Greenland (IBM Global Business • Scott Nestler (Naval Postgraduate School)
Services)
• Jerry Oglesby (SAS)
• Bill Klimack (Chevron)
• Michael Rappa (North Carolina State/
• Jack Levis (UPS) Institute for Advanced Analytics)
• Daymond Ling (Canadian Imperial Bank of • Tim Rey (Dow Chemical)
Commerce)
• Rita Sallam (Gartner)
• Freeman Marvin (Innovative Decisions, Inc.)
• Sam Savage (Stanford/Vector Economics)
9
The findings of this working group were then validated by a random sample of
practicing analytics professionals. Feedback from this survey resulted in slight
modifications of the performance domains, tasks, and knowledge that comprise
the test blueprint that determines the content of the CAP® examination.
In developing the JTA, members of the working group relied on their knowledge
of practice gained from years of experience, academic program content,
corporate job descriptions in analytics, and articles from professional and scholarly
publications.
The following table includes the final domains and weights derived from the JTA
and a review of validation survey recommendations.
The INFORMS CAP® examination is based on the following test blueprint derived
from the JTA process. The final agreed-upon weights reflect the percentage of
questions from each domain that will be included in each test form.
The JTA and the test blueprint resulting from this process will be reviewed
periodically and updated as needed to reflect current practices in analytics. The
list of domains and key tasks follows:
10
T-5 Define an initial set of business benefits
T-6 Obtain stakeholder agreement on the business problem statement
(The ability to work effectively with data to help identify potential relationships that
will lead to refinement of the business and analytics problem.)
(The ability to identify and select potential approaches for solving the business
problem.)
(The ability to identify and build effective model structures to help solve the
business problem.)
(The ability to deploy the selected model to help solve the business problem.)
(The ability to manage the model life cycle to evaluate business benefit of the
model over time.)
The knowledge statements for the CAP® program have been identified but not
individually assigned to each task. The knowledge statements appropriate to
a given task have been used. Not all statements are appropriate for all tasks,
although there may appear to be some blanks in coverage this is not the case.
K-3 Client business processes (i.e., the processes used by the client or
project sponsor that are related to the problem)
12 *Tasks that are beyond the scope of the CAP® certification exam and will not be tested.
K-7 Performance measurement (i.e., the technical and business
metrics by which the client and the analyst measure the success of
the project)
13
THE FIVE E’S
The five E’s are ethics, education, experience, examination, and effectiveness.
These are the five pillars of the Certified Analytics Professional.
The CAP® credentialed person will have read, agreed to, and signed the code of
ethics that governs behavior of a professional analyst. This code was created by
the Task Force who are among the originators of the program (see Figure 1). The
code is intended to describe the accepted behavior of an analytics professional. All
candidates for the CAP® must agree to the code of ethics as part of the application
process. Actions that are opposed to the code of ethics may be reason to rescind
the CAP® credential.
Examination is the fourth leg or pillar of the CAP® program. Through examination
we seek confirmation that the applicant has knowledge of those areas of the job/
task analysis that are considered essential for practice. Because the examination is
based on a broad spectrum of practice rather than the content of a course or series
of courses, it must be constructed with due care. Each test item or question has
been created carefully so as to ensure a fair, valid, and reliable examination that
discriminates against no one except for those who do not have the knowledge to
earn the CAP® credential. Each item is reviewed and refined numerous times by
a committee of subject matter experts in the field of analytics. The sole reason to
use a test item is as a tool to determine who is knowledgeable. Because there may
be a lot riding on the successful completion of the exam, the test items must be
carefully crafted.
All test items are written with reference to the specific domain, task, and knowledge
statements outlined earlier. Test items are also sourced to ensure that all items are
readily available and should be known to everyone who is an analytics professional.
No items are written based on proprietary data or sources that are known only
to a select few. For examples, see the Candidate Handbook that contains 24
questions or items that are indicative of the style of test item but that do not
themselves appear on the exam. In the future, there may be additional items that
we will release from the item bank and use as practice test questions. The CAP®
program is so new that INFORMS does not yet have items that have outlasted their
usefulness as a discriminatory tool to distinguish between the knowledgeable and
those who do not yet possess the knowledge.
14
The rules for item writing are specific and few:
• Do not use ‘All of the above’ or ‘None of the above’ as answer options
• Avoid disadvantaging any part of the test population but the unknowing
• Ensure that the incorrect answers are incorrect for a specific reason
Effectiveness is the art of applying your knowledge and skill in a way that enables
achievement of your organization’s goals. The soft skills required are dealt with
more fully in Appendix A: Soft Skills. Nevertheless, the skilled analyst must be
diplomatic and aware enough to understand the context of the business problem
and the stakeholder agendas involved while not allowing that understanding to
bias the process or the truth thereby developed.
The Certified Analytics Professional (CAP®) program is not the work of one person
or one department: it would not have been possible without the support of
professionals in the field. You can see a long list of those professionals on the
INFORMS website under Contributors (www.informs.org/Certification-Continuing-
Ed/Analytics-Certification/Contributors).
If you have comments on the guide, the certification program, or wish to assist with
the further development and dissemination of the CAP® program, please feel free
to e-mail certification@informs.org.
15
TH I S PA GE IS DE LIBE RATELY LEF T BLANK
CHA PTE R 2
DOMAIN I – BUSINESS PROBLEM FRAMING
In this chapter, you will learn about the first step of an analytics project: framing the
business problem. You will learn, as a part of these processes, how to determine
the business problem, identify and enlist stakeholders, determine if the problem
has an analytics solution, refine the problem statement as necessary, and define
the set of business benefits.
Learning Objectives
2. Identify stakeholders
Key Concepts/Fundamentals
Another factor to consider is that the client firm representatives in these meetings
also play an important role in what is reported and how it is reported. It is natural
that each representative (of the firm) uses their own lenses and contexts to report
(and thus frame) the way they see the problem. These views are all very important
on their own merits because they inform the analyst in some useful way. However,
17
because of the individual lenses used to report these observations, sometimes
these views can have a good degree of variance regarding causes and effects, and
thus may obscure the real issues.
• Who: are the stakeholders who satisfy one or more of the following with
respect to the project: funding, using, creating, or affected by the project’s
outcome.
• Where: does the problem occur? Or where does the function need to be
performed? Are the physical and spatial characteristics articulated?
Of the five W’s, who (the stakeholders are) is probably the most critical to the long
term success of the project. Stakeholders are anyone affected by the project, not
just those in the initial meetings, and they may have different levels of input or
involvement during the project. A stakeholder analysis helps identify the following:
18
OBJECTIVE 3. DETERMINE WHETHER THE PROBLEM IS AMENABLE TO AN
ANALYTICS SOLUTION
Before more time and money is spent on solving the problem, it is time to figure
out if this problem is likely to have an analytics solution. First of all, does the answer
and the change process to get there lie within the organization’s control? Second,
does the requisite data exist or can it be obtained? Third, can the likely problem be
solved and/or modeled? Last, but perhaps most importantly, can the organization
accept and deploy the answer? The problem may not be amenable to an analytics
solution because of the characteristics of the problem or the limitations of the
analytic tools/methods available. The problem statement could be reassessed to
make it amenable to the available analytic tools/methods, or if this is not possible,
the project deemed not feasible. If there isn’t a feasible way forward, the ethical
analyst will say so to the key stakeholders.
For the Seattle plant example, it may be decided to use mathematical optimization
software to improve the plant’s process. This will work as long as data exist on inputs
and outputs for each step in the plant process, and as long as the stakeholders are
willing to accept new ways of operating that won’t necessarily match current work
policies and procedures.
After the initial analysis, it may be necessary to refine the problem statement to
make it more accurate, more appropriate to the stakeholders, or more amenable to
available analytic tools/methods. As part of this process, it will become necessary
to define what constraints the project will operate under. These constraints could
be analytical, financial, or political in nature.
For the Seattle plant example, an optimization problem with a large number
of constraints or a complex objective function may not be solvable within the
capability of the available software/hardware combination. In this case the problem
may need to be restated with fewer constraints and/or a less complex objective
function. This may cause the problem statement to be updated to make sure that
the approach will satisfy—just to name a few of the potential constraints—desired
accuracy and repeatability, program cost, timeframe, and number of stakeholders
impacted, either positively or negatively.
19
OBJECTIVE 5. DEFINE AN INITIAL SET OF BUSINESS BENEFITS
With the problem statement set, it is now possible to define the initial set of
business benefits. These benefits may be determined quantitatively or qualitatively.
If quantitative, it may be financial (e.g., net present value) or contractual (e.g.,
service level agreements). This is also known as the business case.
For the Seattle plant example, an initial determination of the financial benefit due
to optimal use of resources should be determined along with an initial view of the
required project goals determined, e.g., plant is currently losing money at the rate
of 3% of gross sales with current performance and needs to come to 5% margin
on gross sales. The key profit driver is on-time performance, which is currently
68% and needs to get to 98%. How will it get there? At this stage we think it is
because there is plant capacity being wasted, so we’re going to look at optimizing
our scheduling and manufacturing processes to reduce overall time by reducing
queue and wait time. You’ll note that we haven’t said, yet, that we’re going to
simulate incoming orders with one distribution and performance of each machine
on the floor with their own distributions, even though we may be thinking about
doing just that. At this stage, the problem is a business problem and the objectives
are business objectives.
With the problem statement refined and the initial business benefits determined,
it is necessary to obtain stakeholder agreement before proceeding further with
the project. It may be necessary to repeat this cycle several times until stakeholder
concurrence with the particulars of the project are achieved and permission to
proceed is granted. At the end of this process, you will have agreement on the
project’s objectives, initial approach, and resources to get there.
SUMMARY
20
FURTHER READING
Nixon NW (2013) Focus first on framing, not solving, the problem, April 18, http://
philadelphia.regionsbusiness.com/print-edition-commentary/focus-first-on-
framing-not-solving-the-problem/.
Seelig T (2013) Shift your lens: The power of re-framing problems. Seelig T, ed.
inGenius: A Crash Course on Creativity (HarperOne, New York), http://stvp.
stanford.edu/blog/?p=6435.
Spradlin D (2012) The power of defining the problem, September 25, http://
blogs.hbr.org/cs/2012/09/the_power_of_defining_the_prob.html.
21
TH I S PA GE IS DE LIBE RATELY LEF T BLANK
CHA PTE R 3
DOMAIN II – ANALYTICS PROBLEM FRAMING
This chapter is all about the dialogue between the business people who have a
problem that they need to solve and the analytics folks who will give them the
information required to solve the problem. This dialogue is mediated by the
analytics professional (YOU) who is trusted by both sides because you are fluent in
the language and culture of each side. As with any translation effort between two
different groups, much of what follows are simple precepts to keep the sense of
the business problem while decomposing it into actionable analytics pieces.
Learning Objectives
Key Concepts/Fundamentals
There’s an apocryphal story of a Black & Decker sales convention. The VP of sales
gets up to the dais, and says, “Folks, I have some bad news for you. We’ve done
some detailed customer surveys to find out what our customers care about. They
couldn’t care less about our carbide tips, or the voltage rating of our drills. In fact,
they’d rather not think about drills at all! What our customers want is to hang a
picture, or put up drywall, or do any number of other jobs. Our job is to help them
do just that.” Similarly, your business and operational stakeholders likely could not
care less about how you and your team are going to solve their problem. They just
want it to be solved reliably and deliver the results.
The first step is to decode the business problem statement to get to the analytics
problem. There are many ways to do this, some more formal than others. In simple
23
terms, you are translating the “what” of the business problem into the “how” of
the analytics problem.
For example, a company wishes to increase market share, but what is the underlying
problem they need to address? Are they, for instance, emphasizing carbide-tipped
drills to someone who only wants to hang a picture?
Whether you are formally decomposing and parsing a complex business statement,
or you are less formally brainstorming with a project sponsor, it is critically important
to account for tacit as well as formal requirements. The best known model in this
area is Kano’s requirements model (Figure 2). It distinguishes between unexpected
customer delights, known customer requirements, and customer must-haves that
are not explicitly stated.
24
requirements,” not the “expected requirements.” As the analytics professional
charged with translating business requirements into the problem statement, you
really need to probe to make sure that you have the entire appropriate context as
well, including the expected requirements.
These next three items are related. Your input/output functions are strongly related
to your assumptions about what is important about this problem as well as the key
metrics by which you’ll measure the organizational response to the problem.
We’ll start by defining the input/output functions of the problem at hand. As with
any of these areas, you can be as formal or informal as you like, but sketches
and diagrams certainly help communicate with your stakeholders and help get
everyone on the same page.
Once you have these inputs and a general sense of their predicted effects, you
have a choice of how to communicate them to the team at large. A simple table
(Figure 3) is one approach. A black box sketch (Figure 4) is another approach.
How you do it isn’t nearly as important as doing it in a way that the people you’re
working with will understand.
25
Test Team Size
Test Intensity
Test Level
Rate of software
Remaining Defects
defect detection
Interface Changes
Location Changes
Even these simple examples help illustrate the concept. The idea here is to make
the inputs visible and start getting agreement among the team on the direction
and scale of the relationships to bound the problem and to create the related
hypotheses that you’ll use later to attack the data. A point you’ll want to emphasize
to the team is that these are preliminary assumptions and while your best estimate
is needed, it is still just an estimate and is subject to change depending on what
reality turns out to be. The danger we’re trying to avoid here is what Kahneman
calls “anchoring.” People have a tendency to hang on to views that they’ve seen
and held before, even if they are incorrect. Reminding them that these are initial
and preliminary, rather than finalized views, helps mitigate the anchoring effect.
This is where you set the boundaries of the problem. As you look at your input
drivers, each likely has one or more assumptions embedded in it that needs to be
surfaced and listed. Additionally, some complexities can be trimmed away if their
presumed effect on the answer is less than the effort required to handle them.
As Stephen R. Covey (2004, p. 24) said, “We simply assume that the way we see
things is the way they really are or the way they should be. And our attitudes and
behaviors grow out of these assumptions.” Common practice assumptions in your
organization also need to be listed and questioned regularly to ensure that they
are either still valid or that the problem statement needs to change to incorporate
changes to them.
26
OBJECTIVE 4. DEFINE THE KEY METRICS OF SUCCESS
Although you’ve been in touch with your business stakeholders at some level all
along, this is when you come back to them to walk them through your assumptions
and approach and what the final answer will look like to be sure that you really are
answering the business problem. Whether in the form of a formal presentation,
you want your assumptions acknowledged along with the reframing you did from
the business problem, and the key metrics you will be using to mark progress
toward the solution.
The output of this stakeholder agreement will vary by organization, but should
include the budget, timeline, interim milestones (if any), goals, and any known
effort that is excluded as out of scope. The key is to get all the pieces we’ve
noted in this chapter verbally discussed, documented, and visibly agreed to by
all parties. It can be tempting to settle for e-mails or written documents only and
desk-side reviews. For all but the simplest problems, this is a mistake. Translation
of problems from the business domain to the analytics domain, or truly from any
27
given domain to another domain, requires that all parties agree to definitions
and terms, which really does require full and frank discussion. Otherwise, errors
will creep in and what was delivered will miss critical unstated requirements. If
you allow your project to rely on written communication only, you’ve missed the
opportunity to correct misapprehensions when it is still cheap to do so.
SUMMARY
Full and frank review of the approach with the business stakeholders and the
analysts to ensure that the problem can be attacked as planned and that a
successful attack will yield the desired business result.
FURTHER READING
Albright SC, Winston W, Zappe C (2011) Data Analysis and Decision Making, 4th
ed. (South-Western Cengage Learning, Mason, OH).
Covey S (2004) The 7 Habits of Highly Effective People (Simon & Schuster, New
York).
28
Crow KA (1992) Quality Function Deployment, http://www.ieee.li/tmc/quality_
function_deployment.pdf.
29
TH I S PA GE IS DE LIBE RATELY LEF T BLANK
CHA PTE R 4
DOMAIN III - DATA
Learning Objectives
3. Determine how and why to harmonize, rescale, clean and share data
Key Concepts/Fundamentals
Data reduces our uncertainty about the values assigned to variables of interest in
the analysis.
Analysis typically uses `hard data’, i.e., data that is obtained by scientific observation
and measurement (e.g., experimentation). But much of our information is frequently
soft, e.g., gleaned from interviews and reflective opinions and preferences.
Hence it will be important to convert this soft information into scientific data. The
31
traditional way in which soft data is converted into hard data is to hypothesize an
artificial individual whose preferences and beliefs can be completely described
with hard data. (In economics, this artificial individual is called the `economic man’
and is viewed as totally rational.) We then determine what hard data would be
required so that this artificial individual’s behavior coincides with that of the actual
individual with soft data. We then solve the analytical problem as if our actual
individual could be described by this artificial individual.
While conjoint is focused on assessing utility functions for known outcomes, the
decisions which will be informed by analysis typically are gambles which do not have
guaranteed outcomes. As a result, it becomes important to extend the concept
of utility to gambles with uncertain outcomes. To construct these utilities, define
an experiment where M is some best possible outcome and m is a worst possible
outcome. Consider a gamble which leads to M with frequency f and m otherwise.
Again consider a carefully designed laboratory environment where the individual must
decide between the consequence and the gamble. Then there will be some maximum
value of f for which the individual still prefers the consequence to the gamble. This
maximum value measures the individual’s preference in the consequence.
32
In gathering data, it is usually important to have some measure of the confidence
which is placed on each of the various data points. To translate this notion of
confidence into something tangible, consider two individuals, both whose measure
of belief in event E is described by the subjective probability p. Consider a carefully
designed laboratory experiment in which each individual observes one success
in one trial. Each individual’s new belief in the event is then measured. Suppose
the resulting value for both individuals is U. A parallel experiment is then run in
which the individual’s belief in event E is measured after the individual, instead
of observing success in the one trial, observes a failure. Let L be the measure of
belief which both individuals now have in the event. Now suppose that one of
the individual’s original assessment of p is based only on observing n trials. (More
precisely, we assume that the individual had a non-informative prior over p and
then updated that based on the information in n trials.) Then it can be shown
that n=1/(U-L). Suppose that the other individuaI’s beliefs are based on soft data.
Then for analytical purposes, it still is legitimate to use 1/(U-L) as a measure of
confidence in p.
The focus of this stage is on identifying which kinds of data collection will have
the most favorable impact on the quality of the actions and recommendations
supported by the analysis. An especially useful tool for doing this analysis is the
decision tree. (While the decision tree as applied to uncertainty was formalized in
the mid-twentieth century, it can be argued that the Pythagorean Y might have
been the first decision tree.) Consider the following very simple decision tree where
there are two choices : continue the present course or make a specific change. If
a change is made, the outcome of the change could be favorable or unfavorable.
We can write this decision tree in outline form as
b. Implement a change
33
There are two possible outcomes of making the change. If the chance of getting
a good outcome is high enough, then it will be better to implement the change.
Otherwise implementing the change will be unwise. For example, suppose that we
attach a probability p to getting a good outcome if we make a change. Suppose
we believe that U is the value (utility) of making a change with the good outcome,
L is the value (utility) of making a change if the poor outcome occurs and u is the
value (utility) of continuing the present source. Then we will only make a change if
p U + (1 - p) L > u.
Suppose we find that the best decision (i.e., the one with highest utility to the
customer) is to continue the present course. Then we will get utility score u.
But instead of simply making a decision, we could have chosen to gather data and
then make our decision based on the results of the data gathering exercise. If we
chose to gather data, then our decision tree becomes
b. Implement a change
b. Implement a change
Now suppose we gather data and get favorable information. This increases the
probability of getting a good outcome given we implement a change. Suppose
the change in probability is not enough to justify implementing the change. So
our conditional decision is if we get favorable information, we continue with the
present course. Now suppose that instead of getting favorable information, our
data gathering led us to collect unfavorable information. This lowers the probability
34
of getting a good outcome given we implement a change. As a result, our other
conditional decision is if we get unfavorable information, continue with the present
course. Thus our two conditional decisions are if we get favorable information,
we continue with the present course; if we get unfavorable information, continue
with the present course. Hence regardless of the outcome of the information, we
continue our present course. This simple example demonstrates an important
principle: Before collecting the information, think about everything you might
discovery from collecting the information. If none of these discoveries would
lead you to change your decision, then do not collect the information. Of course,
sometimes people collect information—even though they know what decision
they will make—in order to defend themselves against criticisms from others. And
sometimes people collect information to postpone making the decision.
When would information be valuable? Suppose that the favorable information led
to a substantial change in the probability of getting a good outcome. Suppose
that this change in probability was enough to justify implementing the change.
Then our two conditional decisions would be if we get favorable information,
implement a change; if we get unfavorable information, continue with the present
course. We can assign a value (or utility) to these two conditional decisions. Let u*
be the utility of implementing a change, given that we get favorable information.
Let u be the utility of continuing the present course, given that we can unfavorable
information. Let q be the probability of our getting favorable information if we
collect data. Then the utility if we decide to gather data will be
q u* + (1 - q) u.
Since the utility if we did not gather data was u, this tells us that our overall utility
has increase from u to q u* + (1 - q) u. Since u* >u, collecting the information can
only improve our utility. This demonstrates a well-established principle: the value
of information is non-negative, i.e., it can never make you worse off if you behave
rationally.
But in reality, there is a cost to collecting this information. Suppose that paying
this cost would reduce our utility by some factor d. Thus our utility if we collect
information is
d (q u* + (1 - q )u) ,
What determines u*? Before making a decision, the chance of getting a good
outcome after making a decision was p. Suppose that if we get favorable
information, this probability changes to p* while if we get unfavorable information,
35
it changes to p**. Then if q is the chance of getting favorable information, the rules
of probability require that p=qp*+(1-q)p**. Thus while the utility of making the
change was originally
p U + (1 - p)L,
u*=p* U + (1 - p*) L = p* (U - L) + L.
So the critical value u* depends upon p* and, in particular, on how much p* differs
from p.
The degree to which the new information can change the value of p depends upon
the confidence in the original value of p as well as in the impact of the data. One
key question is if the new information tells us something unexpected (i.e. , the
favorable outcome), how much will our initial beliefs change? But given that they
do change, we need to know what the potential payoff might be. In this example,
H was the maximum payoff if we knew for certain that there would be a good
outcome. If the potential payoff, H, were small, then gathering more information
would also be pointless.
The final consideration is cost. Since analysts often collect information from the
client’s subject matter experts, it is important to treat the time of these subject
matter experts as precious. If they feel their time is being wasted, then they will
complain to the client who will eventually begin to wonder about the value of
doing your analysis. There are many cases in which an organization chooses a
flawed heuristic over a more sophisticated procedure just because the flawed
heuristic seems to require less painful information collection.
There are also privacy issues. Invasion of privacy can lead to a loss of customer good
will and, in some cases, legal repercussions. And if we are gathering information
that is potentially proprietary intellectual property issues become paramount. The
fact that information technology has made it easier to collect information does not
mean that information collection is costless.
Once you identify the variables on which you should collect data, the next step is
collecting that data. Data collection is analogous to asking certain subjects certain
close-ended questions under certain circumstances. Hence there are five steps
involved in data collection:
36
3. Determining the questions to be asked
SAMPLE DESIGN
37
E[Y] = g(a1 X1’+… an Xn’ )
Because time is often an important dimension, there are a separate body of time-
series methods when observations are collected over time. Time-series analysis
typically corrects for seasonal patterns (e.g., unusually high sales during holiday
seasons) and provides a natural way of identifying trends.
SAMPLING PLAN
38
DETERMINING THE QUESTIONS TO BE ASKED
A key issue in designing the experiment is determined the nature of the variable
being assessed. Is the variable categorical (e.g., values of the variable are blue,
red, white) where there is no natural ordering between the values of the variable?
When we have categorical scales, the data can be summarized by the proportion
of observations which assumed each of the possible values of the categorical
variables (e.g., the proportion of blue responses, red responses, etc.)
One can ask YES/NO questions or multiple-choice questions for nominal scales.
One extension (likert-type questions) asks subjects to indicate whether they fully
agree, partially agree, are neutral, partially disagree or fully disagree with the
statement.)
Alternatively the variable might be ordinal (e.g., short, medium, tall) where there
is a natural ordering between the values of the variables. When we have ordinal
sales, it is possible to define the normalized quantity for each response x by the
fraction of responses less than or equal to x (e.g., the fraction of people who are
either short or medium.)
A second approach, semantic differential, has the form: `what is your experience
navigating our web-site’ with answers like `very hard, somewhat hard, OKAY,
somewhat easy, very easy’ where the two ends of the scale represented opposites.
In this case, the response is ordinal.
In both Likert and semantic differential scales, the response scales may be improved
by providing concrete examples of what would have to be true for a `fully agree’ or
a `fully disagree’ response to be true.
Alternatively the variable might be interval (e.g., thirty degrees centrigrade, forty
degrees centrigrade, fifty degrees centigrade) where the differences between
values (e.g., forty degrees minus thirty degrees) are meaningful. Note that when
we have interval scales, it is possible to define a normalized quantity for each
response x by subtracting the lowest possible value from x and dividing the result
by the difference between the highest and lowest value.
It is important to remember that subjects will often answer a question even when
they have no idea about what question is being asked or about what their answer
means. (For example, individuals will generally answer the question which is more
important diamonds or water even though the answer clearly depends upon
39
whether the individual feels that the choice is between having no water at all for a
week (and dying of dehydration) or simply go without an added glass of water for
an hour or two.) Questions must be designed with care.
While some data needs to be created and collected, some data already exists. The
purpose of extraction is to collect all this data from the many sources in which it
appears so that it can eventually be loaded into a common database. In extracting
this data, it is critical to know the data source from which each data element was
taken, i.e., the data must be traceable to its source. If the results of an analysis
depend critically on the data element, then understanding the validity of this data
element becomes critical. In addition, if there is some change in the clients for the
analysis, it will be important to transition the database to reflect the data sources
which these new clients consider important. This requirement is called traceability
and typically requires careful documentation.
40
OBJECTIVE 3. DETERMINE HOW AND WHY TO HARMONIZE, RESCALE, CLEAN &
SHARE DATA
Data cleaning, while often the least glamorous phase of analysis, is often the most
necessary. This is especially the case with pre-existing databases. Because pre-
existing databases were collected for other purposes, the quality of the data will
be driven by what was important in the original use of this data and hence need
not satisfy the quality requirements for the analysis at hand. For example, vendors
often have to fill in various forms in order to get reimbursed for their services.
Sometimes third parties successfully get their own questions added to these
surveys. But both vendor and buyer are primarily interested in the fields which
determine how much the vendor gets compensated for their services. As a result,
these decision-relevant fields get scrutinized carefully and the rest do not.
There are many other reasons why survey quality may be deficient:
1. Individuals asked to fill out a lengthy survey will get fatigued and simply
put in default values so that they can finish the survey. If there are five
possible answers to a survey, they may also check a neutral response. Or
in a survey of satisfaction, they may either indicate that they are satisfied
with everything or satisfied with nothing.
3. Biases can often arise because most people, when asked to fill out a survey,
simply refuse. Those who did fill out the survey are often people with more
leisure time or with more emotional commitment to the organization
asking that the survey be filled out.
1. Identifying the range of valid responses for each question and labeling
the data field
2. Identifying invalid data responses (e.g., where letters are used where
numbers are required)
41
4. Identifying suspicious data responses (e.g., when physically questionable
numbers are put in for a response) Are there outliers that don’t seem to
make sense?
5. Identify suspicious distribution of values (e.g., when one finds that 99%
of the respondents in a survey of poor neighborhoods have incomes of
more than a million dollars.) Descriptive statistics can be very helpful in
identifying suspicious distributions. For example, histograms specify the
frequency with which various data response are used. Box and Whisker
charts as well as stem and leaf plots provide compact descriptions of the
variation in the data within a field and help identify outliers. Scatterplots
show how the value of one set of variables depends on another. Summary
statistics like the mean, median, upper and lower fractiles can also be
useful
So a key part of data cleaning is determining whether the data makes sense. It
also involves handling null or missing values. There are several possible solutions:
42
It is important to determine whether important observations (e.g., observations
from a specific group of sub-users) is missing.
A field should be created with the data of each observation (a date stamp.) A field
should also be created identifying the data source from which this information
is collection. This field will be important in the next step where information from
different data sources is combined into a single database
While the individual responses come from different data sources, they need
to be placed into a common database (which typically is organized into rows
representing observations and columns representing observed characteristics of
that observation). This requires that all of the data be summarized at a common
level of granularity.
For example, we might have 1000 observations of one product, 5000 observations
of a product and its location and 3 observations of a product, its option content and
its location. If details about a product’s location are not relevant for the analysis,
then we can sum up our observations so that all data is at this less granular level. In
other cases, we need to go to this more granular level. If we simply dropped all the
observations that did not have this information, there could be insufficient sample
size to support a meaningful analysis. Alternatively we may rewrite all of our 9000
records at this more granular level with fields for the product, its location and its
option content. We now must treat many of our records as if they had missing
values for location and option content.
In some cases, the model may require information on a variable which is not in the
database but can be computed from items in the database. This may require the
creation of a new field in the database for this derived variable.
In some cases, a single observation may reflect the responses of 10,000 people
while another observation may reflect the responses of 100 people. As opposed
to creating a database with 10,100 rows for these two observations, it may be
useful to introduce a weighting field that identifies the number of respondents
associated with the observation.
43
Because different datasets are typically generated with different data architectures
and different programming languages, these languages may use different
standards for encoding information. Thus missing values can be represented by
spaces, the words NA, the words Not/Available, etc.
Some decisions may be required in how to handle textual fields. This could be
handled by creating numeric columns describing the textual field and—without
deleting the textual field—using the columns to classify the field. For example,
the textual field might contain verbatim user expressions of satisfaction. A column
might be created which expresses the encoder’s interpretation of that field as
expressing satisfaction, dissatisfaction or neutrality.
Before loading the database, it is useful to assess whether certain fields have the
same value across all datasets. If this is the case, then it may be worth deleting
those fields.
The data is then loaded into the common database. Information is typically
normalized so that any given item of information only occurs in the database
exactly once. This is the place to do some final checks on the quality of the data:
3. Consistency: Is the data provided under a given field and for a given
concept consistent with the definition of that field and concept?
8. Common Format: Is the data in a format easily used in the application for
which it is intended?
10. Cost-effective: Is the cost of collecting and using the data commensurate
with its value?
44
The term data warehouse is generally used to describe:
1. A staging area, i.e, the operational data sets from which the information
is extracted
3. Access layers, i.e., multiple OLAP (on-line analytical processing) data marts
which store the data in a form which will be easy for the analysis to retrieve
The data mart is organized along a single point of view (e.g., time, product type,
geography) for efficient data retrieval. It allows analysts to
1. slice data, i.e., filtering data by picking a specific subset of the data-cube
and choosing a single value for one of its dimensions;
2. dice data, i.e., grouping data by picking specific values for multiple
dimensions;
3. drill-down/up, i.e., allow the user to navigate from the most summarized
(high-level) to the most detailed (drill-down);
4. roll-up, i.e., summarize the data along a dimension (e.g., computing totals
or using some other formula);
Fact tables are used to record measurements or metrics for specific events at a
fairly granular level of detail. Transaction fact details record facts about specific
events (like sales events), snapshot fact tables record facts at a given point in
time (like account details at month end) and accumulating snapshot tables record
aggregate facts at a given point in time. Dimension tables have a smaller number
of records compared to fact tables although each record may have a very large
number of attributes. Dimension table includes time dimension tables, geography
dimension table, product dimension table, employee dimension table, and range
dimension tables.
Each dimension is typically ranged into hierarchies, e.g., the geography dimension
might be arranged in stores, cities, states and countries. These hierarchies are often
dynamic, e.g., a firm may redraw its organizational boundaries. In the star schema,
there is often a single fact table with many dimensional tables surrounding it.
The leads to a data mart which will service the analysts in an efficient matter. However
the data warehouse and data marts are not finished until they are documented in
a way that makes them usable by external parties. While it is tempting to assume
45
that the modeler will know what the variables mean, the reality is that there will
often be requests to revisit the data months or years after the analysis is done.
These requests may come from the client or they may come from peer reviewers
interested in replicating your work. In this case, failure to document your data
fields as well as the sources of the data can be very costly.
The many ways of understanding data can be organized into nine steps. The
following list from Booz-Allen-Hamilton’ s The Field Guide to Data Science
describes some of the techniques which can be useful in implementing each of
these steps:
1. Filtering
46
d. Sensitivity analysis and wrapper methods are typically essential when
you don’t know which features of your data are important. Wrapper
methods, unlike sensitivity analysis, typically involving identifying a set
of features on a small sample and then testing that set on a holdout
sample.
4. Extracting features
d. Fast Fourier Transforms and Discrete wavelet transforms are used for
frequency data.
b. Box plots, scatter plots, box and whisker plots provide compact
representations of how data is distributed. But when the data can be
reasonably described by parametric distributions, distribution fitting
are even more efficient ways of summarizing data.
47
7. Segmenting the data to find natural groupings
f. For text data, topic modeling allows for segmentation of the data
48
c. Hidden Markov models are useful in estimating an unobservable state
based on observable values.
Having solid data and relationships allows the first true refinement of your analytics
and business problem, as you now have the ability to go beyond anecdotes of the
situation and describe the situation with some level of mathematical rigor. You may
find at this point that the true constraint of the system isn’t what you thought it was,
and that therefore the analytics problem needs to be reframed around that newly
surfaced constraint. Or you may find that the business problem itself missed a key
facet (interrelationships between customers and purchases, a time-series effect in
the data, or anything else) that needs to be included prior to continuing. Once in
a while, you actually do get the business problem and the analytics problem right
the first time and you can proceed to selecting your methodology and creating
your model.
SUMMARY
It is no accident that the CAP exam weights data the most heavily of the seven
domains. Without proper data gathering, cleaning, transformation, and loading,
all you have are nice anecdotes. With reliable data sorted usefully, you can actually
solve your problem in a meaningful way.
49
FURTHER READING
Vose D (2008) Risk Analysis: A Quantitative Guide, 3rd ed. (John Wiley & Sons,
Chichester, UK).
50
CHA PTE R 5
DOMAIN IV– METHODOLOGY (APPROACH) SELECTION
In this chapter, you will learn about examples of various methods available to
analytics professionals and how to go about choosing some of them over others
for a specific task. This chapter does not intend to offer an exhaustive list of such
methods; instead it is rather illustrative to convey the process of selection from
some methods. There are a myriad of analytics methodologies in the literature
that are available from which a modeler can select. Later in this chapter, we will
list typically used analytics methodologies, their characteristics, classifications, and
when to use them.
Selection (of methods) process is best informed by the problem framing, prior
experience and depth of knowledge of the analytics professional with the methods
available, the problem at hand, etc. However, it is very possible that the problem at
hand is rather new or not framed completely. In such situations, selecting the best
methodology to solve a problem could be iterative in nature since a methodology
may prove to be ineffective so other methodologies may need to be tried as well. It
is often the case that an analyst (the modeler) is not given enough time to explore
many options. Therefore, utilizing experience and knowledge will certainly help to
improve the chance of selecting an effective method in a timely manner.
Learning Objectives:
Almost all analytical models can be classified into one of three categories:
descriptive, predictive, and prescriptive. These three categories of models do as
their names imply.
• Optimization
▪▪ Linear programming
▪▪ Integer programming
▪▪ Nonlinear programming
▪▪ Network optimization
▪▪ Dynamic programming
▪▪ Metaheuristics
• Simulation-Optimization
• Stochastic Optimization
• Simulation
▪▪ Discrete event
▪▪ Monte Carlo
▪▪ Agent-based modeling
• Regression
▪▪ Logistic
▪▪ Linear
▪▪ Step-wise
52
• Statistical Inferences
▪▪ Confidence intervals
▪▪ Hypothesis testing
▪▪ Analysis of variance
▪▪ Design of experiments
• Classification
• Clustering
• Game Theory
Benefits
ANALYTICS
53
OBJECTIVE 2. SELECT SOFTWARE TOOLS
The following are the primary factors that an analyst generally considers to select
an appropriate methodology:
1. Time–Typically modelers work under tight timelines. They are faced with
the challenge of choosing the right methodology and quickly running their
modeling and analysis to answer the business needs.
2. Accuracy of the model needed – Some models (and their level of aggregation)
influence the accuracy of the results. This is closely related to the quality and
readiness of data in addition to the level of accuracy needed for the model as
requested by customer. If available data are not accurate, using a very accurate
model may be a waste of time. For instance, a modeler is advised not to seek
optimal solutions where he or she has more noise in the data than the signal
of the optimal answer.
There are countless methodologies in the literature from which a modeler can
choose to solve a problem. Here are a few types of analytics methodologies that
are commonly used:
55
• Economic analysis. Evaluation often used to guide the optimal allocation
of scarce resources:
• Statistical inferences:
▪▪ Confidence intervals
▪▪ Hypothesis testing
• Design of experiments
• Data mining
• Forecasting
56
• Artificial intelligence:
▪▪ Fuzzy logic
▪▪ Expert systems
• Decision trees
• Optimization
▪▪ Linear programming
▪▪ Integer programming
▪▪ Combinatorial optimization
▪▪ Nonlinear programming
▪▪ Constraint programming
▪▪ Metaheuristics
▪▪ Greedy heuristics
• Markov chain
57
Even if you have all the time in the world, the knowledge of all methodologies, and
availability and accuracy of the needed data, it is highly desirable and advisable
to run scenarios on the “back of an envelope”–often referred to as quick and
dirty (Q-n-D). That approach may provide the high level understanding needed
to make a decision quickly to go with certain strategies and/or orient the applied
methodology accordingly. In all of these endeavors, it is important to communicate
with your stakeholders in a way that they’ll understand your approach and its pros
and cons. Often, your managers will have much less deep analytics knowledge
than you, but they will likely understand the business needs better.
A key point for the CAP® is that the exam is vendor and toolset neutral. The society
is looking for understanding of how to apply tools, not certifying people in the use
of a particular tool. A good analyst will have a tool chest with several different tools
to fit various situations.
Here are software categories from an analytics point of view as we see it:
• Spreadsheet systems
• Statistical systems
• Optimization systems
• Simulation systems
▪▪ Structured data
▪▪ Unstructured data
58
Figure 6 shows a set of sample software applications that are used in analytics and
being compared against different aspects.
* This includes: Monte Carlo, discrete-event, system dynamics, and agent-based simulation. Figure 6. Sample software application
characteristics
Developed models need to be both verified and validated. The verification step
refers to making certain that the model is built the way it was designed and meant
to be. The validation step refers to making certain that the model is representing
real life to a certain level of accuracy. If the modeler realizes that validation
and verification are unachievable, then the modeler needs to consider other
approaches.
To help the testing process, it is advisable to divide data into three portions:
• Testing – This portion of data is used to test the model (verify) that was it
was modeled as it was designed.
• Validating – This portion of data is used to test that the model behaves
closely to the physical behavior being modeled.
59
OBJECTIVE 4. SELECT APPROACHES
As this topic is not covered in the CAP®, suffice it to say that after you test your
models, it makes sense to go with the most accurate model that otherwise complies
with your time and cost constraints.
SUMMARY
In this chapter, we covered how a class of analytics is chosen and then how a
particular methodology as well a supporting environment (software) is selected.
There is certainly no one methodology to choose but there are ones more fit than
others based on the available time, data, expertise, and relevance.
The following statements summarize the main points covered in this chapter:
FURTHER READING
Big data: The next frontier for innovation, competition, and productivity, a
McKinsey & Company report. http://www.mckinsey.com/insights/business_
technology/big_data_the_next_frontier_for_innovation.
60
CHA PTE R 6
DOMAIN V – MODEL BUILDING
Model building is at the heart of any analytical effort; it is the climax of the analytics
problem framing activities. Good models depend on all previous steps: framing
the business problem; framing the analytics problem; and acquiring, exploring,
and scrubbing the data. Now it is time to develop a model to show key drivers of
your outcomes, forecast your targets, determine the best use of resources, etc.
Effective model building requires identifying the relevant inputs and selecting the
model that performs best on holdout or testing data. The emphasis in this chapter
is on statistical methods for predictive modeling.
You have learned to frame the problem, acquire and clean the data, and select
a methodology or modeling approach to solve the problem. Now it is time to
build the model. This chapter introduces the process of model building in a
business analytics context. You will learn to identify model structures appropriate
to your analytical objective. Different model structures require different data
characteristics. You will also learn about the importance of honest assessment and
data splitting to assess models. You will learn to select a champion model from
several candidates, and to communicate the key findings to stakeholders.
Learning Objectives
1. Identify and build effective model structures to help solve the business
problem
By this point in the project plan, you should have collected the data or set out a
data collection plan. However, much of the work with data occurs alongside the
model building process. Different models assume particular data structures. Do
you need transactional data? Individual-level data? Household? Do your values
have a time component? What kinds of summary statistics should be used to roll
up values from lower to higher levels? If transactions are coded in dollars, should
a household-level value consist of a sum, average, maximum, or something else?
The answers to these questions can depend on the business objective and what
you intend to learn from that variable; they can also depend on the class of models
you have selected.
Building the models and working the data into an appropriate form require
collaboration among the analyst, the data owner, and the subject matter expert.
The subject matter expert should have a clear vision for the types of characteristics
needed for modeling. Demographics, historical behavior, attitudinal surveys, and
other characteristics should be identified and carefully selected by someone who
understands the business problem. This was likely addressed while framing the
analytics problem statement. The analyst must pay close attention to data quality
requirements for modeling. For example, in some models, data should be equally
spaced, missing values should be handled, variance stabilizing transformations
should be applied, and so on. This is more likely to be addressed at the time of
model building than during data acquisition. The data owner needs to know how
to bring the characteristics together, from potentially disparate sources, to create
the data structure that the analyst requires. Some of this work will be done early in
the project, but much of the data cleaning necessarily occurs at the time of model
development, as each modeling type has its own data obstacles.
You have determined the model type(s) and gathered the appropriate data. The
next step is building and refining the model. Building the model is, for many
analysts, the most enjoyable part of an analytical project. Although it can require a
great deal of work to define a simulation model, thinking through the relationships
and identifying all relevant sources of variability is problem solving at its best.
There is invariably a sense of anticipation in preparing to evaluate the results of a
predictive model and its assessment on holdout data.
If you have fit several models, then it will be necessary to perform an honest
assessment of their performance so that a champion can be selected. This is
discussed next. One key consideration in running models is how the models
will be used later. For example, a model that will result in scoring activity should
have a way to score new observations without refitting the model or estimating
62
new parameters. Preferably, it should be possible to perform scoring in a real-
time production environment where specialized analytical software might not be
available.
Honest assessment techniques can vary, and might include data splitting, k-fold
cross validation, leave-one-out cross validation, etc. What is critical in honest
assessment is that the observations used to fit the model and estimate parameters
are not the observations that are scored in assessment. The process, conceptually,
is relatively simple. Honest assessment with data splitting on a binary target is
described as follows.
1. Select a large sample of data for modeling. For a binary target, a good
practice is to ensure that you have at least 2000 observations in the smaller
of the two target classes.
63
3. Fit models and estimate parameters using the training data.
Selecting the champion model can be as simple as selecting the model with
the best performance; alternatively, you might select the champion based on a
combination of model performance and interpretability. Some models, such as
neural networks, might not be selected (because of their difficulty in interpreting
the model), but might be used as a benchmark against which other models are
compared.
Not all models are supervised or have data labeled with predefined classes. Some
are unsupervised with unknown class labels for the data. Examples of unsupervised
techniques commonly used in business analytics include: segmentation through
clustering, rule generation through market basket/association analysis, deriving
links among nodes through social network analysis, measurement of latent
variables through common factor analysis, etc. Unsupervised analyses should
also be empirically validated to ensure that your findings reflect more than the
idiosyncrasies of the sample. However, the techniques for validating unsupervised
analyses are not as straightforward and typically rely on the analyst’s best judgment.
Once your champion model is selected, it is time to improve both the model
and the data approach to refine the model. This may be something as simple
as recognizing that really you need to take into account time series information
as well as household transactions per year and reformulating the data structure
to take that into account. Or you may find that there is a subsegment of your
population that your model really doesn’t measure well and you need to create a
subsidiary model for that subsegment.
A key concept here is managing the tension between “I need an answer” and “I
don’t fully trust the model yet.” Your business stakeholders chartered this project
because they need an answer. Every day that passes makes that need more acute.
At the same time, you as the analyst know the strengths and weaknesses of your
model in a way that your stakeholders may not appreciate. Negotiating a reasonable
level of confidence up front can help with this, along with communicating your
plan of how to get from where you are to where you need to be.
64
OBJECTIVE 4. INTEGRATE THE MODELS
Although this topic is beyond the scope of the CAP® exam, model integration is
needed when you are bringing a new model into an existing model environment.
Often your model will take outputs from other models, and its output will feed
inputs. Documenting your inputs and outputs in an API (application programming
interface) like schema will help with that integration.
SUMMARY
The model is the heart of the answer to the business analytics problem. Build it
carefully, test it thoroughly, calibrate it properly, but be willing to tear it down and
start over again to get a truthful answer to the business problem.
FURTHER READING
Berry MJA, Linoff GS (1999) Mastering Data Mining: The Art and Science of
Customer Relationship Management (Wiley, New York).
Few S (2012) Show Me the Numbers: Designing Tables and Graphs to Enlighten,
2nd ed. (Analytics Press, Burlingame, CA).
Hand DJ, Mannila H, Smyth P (2001) Principles of Data Mining (MIT Press,
Boston).
Law AM, Kelton DW (2000) Simulation Modeling and Analysis, 3rd ed. (McGraw-
Hill, New York).
Ross SM (2010) Introductory Statistics, 3rd ed. (Academic Press, Burlington, MA).
Siegel E (2013) Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie,
or Die (Wiley, New York).
Tufte ER (2001) The Visual Display of Quantitative Information, 2nd ed. (Graphics
Press, Cheshire, CT).
65
TH I S PA GE IS DE LIBE RATELY LEF T BLANK
CHA PTE R 7
DOMAIN VI – SOLUTION DEPLOYMENT
An effective deployment requires careful planning so that all staff involved know
their roles in the deployment, and are trained so they appropriately use and
interpret the results.
The CRISP-DM standard is not the only tool for deployment. For example, Six
Sigma’s define, measure, analyze, improve, control (DMAIC) methodology includes
some of the same concepts, especially in the sections on proposed solution,
piloted solution, and sustained solution.
Learning Objectives
5. Support deployment
▪▪ Review Project – All projects should be reviewed for what went right or
wrong, and what should be improved in the future.
Quite simply this comes down to making sure that your answer is still tied to the
original question. It is not uncommon for discrepancies to creep in to the analysis
as the problem is framed and communicated. It is also not uncommon for the
business context to have changed since the project started, invalidating key
assumptions. While the above items must be taken into account, be wary of those
who will tell you that you have to change the answers in the model to fit existing
biases of senior management or to “play politics.” For organizations to accept the
results of the process, those results must be integral and acknowledged as having
integrity, not just being the news that senior management wants to hear.
That said, not all of your stakeholders will need to understand the ins and outs
of the model. A peer review of the model for technical correctness is strongly
recommended, but beyond that you need to focus on answering the actual
questions that stakeholders have, not just telling them all about how your model is
the best thing since sliced bread. It is also important to communicate the sensitivity
of the model to key assumptions and conditions.
Your report format will vary with the organization and how it will use your report. The
main thing is that the report needs to have a clear message. Either recommend a
course of action or no action and state your reasons. Basic report guidelines apply
to this as with any other report. An executive summary and recommendations for
further action should be at the front, with the supporting details, methodology, and
references for further knowledge in the main body. Be clear about the assumptions
and limitations of your model. Your audience should have enough information to
judge whether the model you have developed meets the needs of the project, or
where future resources should be directed. Use graphical aids to communicate
68
findings whenever possible. Well-constructed graphics can simplify results and
uncover patterns that are easily missed in tables. Always consider good graphical
practice. A poorly constructed graphic can be misleading, as outlined very well in
books by Tufte (2001), Few (2012), among others.
Two key items to consider as a model becomes the basis for an organization taking
action are:
Periodically survey and interview key stakeholders to see how their day-to-day
interaction with the model is going and how their results have changed since they
started using the model. Pay particular attention to functional areas where the
model is being ignored as irrelevant, as they’ll tell you where key assumptions
either have been already invalidated or will soon become invalidated and use that
as a way to strengthen and update the model.
69
SUMMARY
FURTHER READING
Chapman P, et al., CRISP-DM 1.0 Step by Step data mining guide, http://lyle.
smu.edu/~mhd/8331f03/crisp.pdf and http://www.the-modeling-agency.com/
crisp-dm.pdf.
Few S (2012) Show Me the Numbers: Designing Tables and Graphs to Enlighten,
2nd ed. (Analytics Press, Burlingame, CA).
Laursen GHN, Thorlund J (2010) Business Analytics for Managers: Taking Business
Intelligence Beyond Reporting (John Wiley & Sons, Hoboken, NJ).
Tufte ER (2001) The Visual Display of Quantitative Information, 2nd ed. (Graphics
Press, Cheshire, CT).
70
CHA PTE R 8
DOMAIN VII – MODEL LIFECYCLE
Now is the time to think through the process you want to define for building and
deploying analytics, before the deadlines of the business require this to be done
in an ad hoc manner. An effective process requires defining the roles of the various
departments involved and the governance process that will be used to iron out
differences and make decisions.
Learning Objectives
It can be tempting during the rush of data collection and model building to skimp
on documentation, figuring that there will be time to write it down later once
things settle down. Do not fall for this. People will inevitably leave the project
before completing their documentation if you do. For the model to be trusted it
has to be repeatable, and that means writing down what you and your team did
and how you did it.
• Key assumptions made about the business context and analytics problem
Essentially you are leaving behind enough of a record for someone else to come in
and recreate the model and get the same results. This documentation should be
kept in a known place, ideally backed up in a few different places.
Evaluation criteria should be created up front both in terms of the business results
expected and the accuracy and confidence expected from the model. Some of the
criteria that might be used include:
• Can a “lift” or “gain” graph be constructed to show how well the model
is predicting?
The model should be routinely checked over time and quality parameters recorded.
When the model quality starts to decay, it is time for the next step of recalibrating
the model and rechecking its assumptions.
The results from the model should be tracked over the long term because even
a model that performs well initially may degrade as input data changes or user
requirements change. Additionally, the model results may also help in areas
beyond that expected, such as identifying data quality problems, or new areas for
modeling. In the case of data quality problems or minor changes in the business
environment, a simple recalibration of the model, similar to what was done in
the model building phase of partitioning data into training and testing data, and
constructing the parameters will be sufficient to get the model working again.
If there has been a fundamental change in a key assumption or two, however,
then the project needs to be revalidated against the business problem to see if
the overall approach is even still valid. No model lasts forever. At some point the
resulting model will need to be improved, replaced, or sunset.
72
OBJECTIVE 4. SUPPORT TRAINING ACTIVITIES
As your analytics effort takes shape and grows within your organization, you will
be fighting for resources to do more and better projects. A key weapon in that
fight is being able to point to the benefits that your previous models have brought
to the organization. How much money has the organization made because your
models pointed the way? How much money has the organization saved because
your models pointed out wasted effort? To answer these questions in a defensible
manner, you have to be able to evaluate the business benefit of the model over
time. To do that, you need to be able to simulate what the organization would
have been doing without the changes wrought by the model.
One way to do that is by looking at how your organization is doing against industry
benchmarks during the time period in question. Have you grown from a second
quintile organization to a first quintile in a key area? Another way is to look at
how products that have been modeled have changed their financial returns to the
organization. Has net profit grown since the model was introduced? How about
return on net assets?
Whatever way you approach it, evaluating the business benefit allows you to “keep
score” and market your capabilities to the organization at large, helping it grow
and develop by solving business problems that are otherwise insoluble.
SUMMARY
73
FURTHER READING
Chapman P, et al., CRISP-DM 1.0 Step by Step data mining guide, http://lyle.
smu.edu/~mhd/8331f03/crisp.pdf and http://www.the-modeling-agency.com/
crisp-dm.pdf.
Wirth R (2000) CRISP-DM: Towards a standard process model for data mining.
Proc. Fourth Internat. Conf. Practical Appl. Knowledge Discovery Data Mining,
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.198.5133.
74
APP E N DI X A
SOFT SKILLS FOR THE ANALYTICS PROFESSIONAL
Introduction
Learning Objectives
For consistency of organization, we are listing soft skills as tasks. However, these
are not tasks to be completed on a deadline; the need for these skills is consistent
throughout the entire project. Domain I of the CAP® program focuses on framing
the business problem, and Domain II on determining whether the business
75
problem has an analytics solution. Here we focus on communicating the content
of Domains I and II to external stakeholders, those who may be less versed in the
process.
For example, if you are approached by a client or employer who states that sales
of their industry-leading widget are falling and they want to know how to optimize
the pricing structure, your first response should not be “yes, of course” or “no, this
can’t be done.” Your first response is to engage the client in a dialog to discover
what they really want. Do they want to find out why sales are falling? Do they have
reason to believe changing the pricing is the best response? What data do they
have on past sales, customers, supply chain, commodities pricing—all of which
could have an impact on sales figures. If price is the only thing they are concerned
with, it doesn’t matter what they’re selling or to whom. As Seth Godin (2013) wrote
in his blog post “Q&A: Purple Cows and Commodities” (http://sethgodin.typepad.
com/seths_blog/2013/08/qa-purple-cows-and-commodities.html), “If you tell me
that price is the only thing that matters to customers, I respond that nothing about
this product matters to them.”
The job of an analytics professional is to find the deep underlying motives of any
client engagement: he or she is (almost) an analyst in the mode of a psychiatric
worker. Question, question, question until it is clear what the problem is and how
a solution can be attempted.
For successful interactions in such cases, it may be helpful for the analytics
professional to unleash their inner four-year-old and keep asking ‘Why?’ This
should be done with care—it doesn’t take long to get impatient at a long string of
why’s. Rather, the professional learns to reframe the answer to a question in such a
way that it continues to drill down to the answer. Learning this skill can save time,
money, and project success.
76
researchers is the limitations of human factors otherwise known as the softer skills.
Complexities arise because each party in the communication exchange views the
exchange through different conceptual frameworks; this difference in perspectives
wedges a gap between what is being communicated by one party and heard by
the other.
77
the project management meetings; their presence may be an indication of their
status within the organization.
Determine who the project stakeholders are—it is likely more than one person.
It could be the C-level who want to optimize the bottom line; it might be the
IT people who want to optimize use of their systems; it might be the owner of
process; it might be the workers who will use the new process: in short, it could
be anyone from the highest to the lowest levels of the organization and perhaps
even include the customers who are outside the organization, and often is all of
the above.
Eliciting the needs of all stakeholders is essential. Once elicited, however, the need
must be prioritized to enhance the solution. If the solution requires the purchase
of new software or the collection of new data, there are myriad ways to present
profit and loss along a sliding scale, or if the customer is not delighted by all
the background work, then communication is really essential to explain why these
things happened or will happen in such a way as to exclude no one.
The analytics professional is the person at the heart of the analytics process.
That person has an understanding of the entire process from beginning to life-
cycle maintenance. He or she may be engaged to work with a client who is not
versed in the process but has heard that analytics is a wonderful tool to promote
one’s business. It is the job of the analytics professional to ensure that his or her
questions and comments are seen as necessary to the process, not as intrusive and
time wasting.
Many are familiar with ISO 9000 series accreditation. The process of obtaining
that accreditation is massive and intrusive, time consuming and painstaking, etc.
However, if all concerned are aware that the end is to ensure that they contribute to
this demonstration of excellence, it becomes less of an intrusion and is seen more
as a contribution to an overall process. If this is not made clear, then individuals may
be less willing to spend time enumerating the processes they follow and will slow
the entire application process. This could in turn mean a less viable response to
requests for proposals, which could mean fewer dollars of income at the corporate
level and the loss of a job. There are those who may not see the dire progression
of events unless it is laid out clearly, which in the case of the analytics process, the
professional is sure to do.
Not only should the entire process be transparent to all involved, but there are
times when the analytics professional is called on to be a translator. He or she must
be able to move from very technical fields with associated jargon and acronyms to
78
a less technical field where there is little or no familiarity with specific terminology
related to the analytics process. Were the average client or stakeholder familiar
with the terminology and the steps to the process, the analytics professional might
find him or herself to be an extraneous expense rather than an added value after
all.
SUMMARY
Not only does the analytics professional need to have knowledge of the analytics
process, he or she must have sufficient command of the less science-based skill set
that enables easy communication and coordination with stakeholders, clients, and
users. The analytics professional should be agile and able to move easily from the
technical to the nontechnical areas of an analytics project, and relay to each sector
the perspective of the other.
FURTHER READING
Pink D (2013) To Sell is Human: The Surprising Truth about Moving Others
(Riverhead books, New York).
Timmer J (2013) Applying science to communicate science: Right now, it’s hard to
find relevant information on how to do it well, August 1, http://arstechnica.com/
staff/2013/08/applying-science-to-communicate-science/.
Weinschenk SM (2013) How to Get People to Do Stuff: Master the Art and
Science of Persuasion and Motivation (Peachpit, San Francisco).
79
TH I S PA GE IS DE LIBE RATELY LEF T BLANK
APP E N DI X B
USING THE STUDY GUIDE TO HELP PREPARE FOR THE CAP® EXAM
Every individual prepares for an exam in their own way. That having been said and
recognizing that education and experience differ for most, we suggest that the
following may be of help.
First, cultivate good study habits: most of us have done this over our academic
career, but sometimes we forget that studying is not quite the same as reading and
tend to sit in our easy chairs and half-heartedly listen to TV and study. That might
work for some but is generally not the best way to do it. Study in a quiet place
with minimal distraction; take notes, whether those are penciled in the margin
or highlighted on a computer screen. Ask yourself questions: on a scale of (1)
‘Never heard of this before’ to (10) ‘I could teach this in my sleep’, where does
your knowledge lie? If it is closer to 1, then maybe you need to find out more; if it’s
closer to 10, then maybe you don’t need more study and can help others.
81
TOPIC AREAS
• What’s the business problem: too little revenue, too slow process, too
many returns, too few customers? Are these problems that have an
analytics solution or is there a different solution?
• What data are available? What data are needed? Are the data usable in
their present format? What needs to change in the data collection or data
format to implement an analytics solution?
• What do you need to deploy the model? Who will be using the model?
Who will need to get the results?
• How will you know that the model is still providing the same solution to
the original problem? How can you tell when or if the data are no longer
providing the agreed upon results? What do you do if the model provides
skewed data?
82
GLOS S A RY
TERM DEFINITION
80/20 Rule AKA the Pareto principle: roughly 80% of results come
from 20% of effort
83
Amortization allocation of cost of an item or items over a time period
such that the actual cost is recovered; often used to
account for capital expenditures
84
Automation use of mechanical means to perform work previously
done by human effort
85
Branch-and-Bound a general algorithm for finding optimal solutions of
various optimization problems; consists of a system
enumeration of all candidate solutions where large
subsets of fruitless candidates are discarded en masse
using upper and lower estimated bounds of the quantity
being optimized (http://en.wikipedia.org/wiki/Branch_
and_bound)
86
Chief Analytics Officer possible title of one overseeing analytics for a company;
(CAO) may include mobilizing data, people, and systems
for successful deployment, working with others to
inject analytics into company strategy and decisions,
supervising activities of analytical people, consulting
with internal business functions and units so they may
take advantage of analytics, contracting with external
providers of analytics (Davenport, Enterprise Analytics,
p. 173)
87
Confidence interval a type of interval estimate of a population parameter
used to indicate the reliability of an estimate. It is
an observed interval (i.e., it is calculated from the
observations), in principle different from sample
to sample, that frequently includes the parameter
of interest if the experiment is repeated (http://
en.wikipedia.org/wiki/Confidence_interval)
88
Cost of capital the cost of funds used for financing a business. Cost
of capital depends on the mode of financing used—it
refers to the cost of equity if the business is financed
solely through equity, or to the cost of debt if it is
financed solely through debt (www.investopedia.com)
89
Data warehouse a central repository of data that is created by integrating
data from one or more disparate sources; used for
reporting and data analysis (http://en.wikipedia.org/wiki/
Data_warehouse)
90
Discrete event models the operation of a system as a discrete sequence
simulation of events in time; between events, no change in the
system is assumed thus a simulation can move in time
from one event to the next (http://en.wikipedia.org/wiki/
Discrete_event_simulation)
Effective domain the domain of a function for which its value is finite (A.
Holder, editor. Mathematical Programming Glossary.
INFORMS Computing Society, http://glossary.
computing.society.informs.org/, 2006-08. Originally
authored by Harvey J. Greenberg, 1999-2006.)
91
Enterprise resource a cross-functional enterprise system driven by an
planning (ERP) integrated suite of software modules that supports the
basic internal business processes of a company (http://
en.wikipedia.org/wiki/Enterprise_resource_planning)
92
Failure Mode and a systematic, proactive method for evaluating a
Effects Analysis process to identify where and how it might fail, and
(FMEA) to assess the relative impact of different failures to
identify the parts of the process that are most in need
of change (http://intranet.uchicago.edu/quality/
FailureModesandEffectsAnalysis_FMEA_1.pdf)
Fixed cost a cost that is some value, say C, regardless of the level as
long as the level is positive; otherwise the fixed charge
is zero. This is represented by Cv, where v is a binary
variable. When v = 0, the fixed charge is 0; when v = 1,
the fixed charge is C. An example is whether to open
a plant (v = 1) or not (v = 0). To apply this fixed charge
to the non-negative variable x, the constraint x <= Mv
is added to the mathematical program, where M is a
very large value, known to exceed any feasible value
of x. Then, if v = 0 (e.g., not opening the plant that is
needed for x > 0), x = 0 is forced by the upper bound
constraint. If v = 1 (e.g., plant is open), x <= Mv is a
redundant upper bound. Fixed charge problems are
mathematical programs with fixed charges (A. Holder,
editor. Mathematical Programming Glossary. INFORMS
Computing Society, http://glossary.computing.society.
informs.org/, 2006-08. Originally authored by Harvey J.
Greenberg, 1999-2006.)
93
Game Theory in general, a (mathematical) game can be played by
one player, such as a puzzle, but its main connection
with mathematical programming is when there are at
least two players, and they are in conflict. Each player
chooses a strategy that maximizes his payoff. When
there are exactly two players and one player’s loss is the
other’s gain, the game is called zero sum. In this case, a
payoff matrix A is given where Aij is the payoff to player
1, and the loss to player 2, when player 1 uses strategy
i and player 2 uses strategy j. In this representation
each row of A corresponds to a strategy of player 1,
and each column corresponds to a strategy of player 2.
If A is m × n, this means player 1 has m strategies, and
player 2 has n strategies (A. Holder, editor. Mathematical
Programming Glossary. INFORMS Computing Society,
http://glossary.computing.society.informs.org/, 2006-08.
Originally authored by Harvey J. Greenberg, 1999-2006.)
94
Global optimal refers to mathematical programming without convexity
assumptions, which are NP-hard. In general, there could
be a local optimum that is not a global optimum. Some
authors use this term to imply the stronger condition
there are multiple local optima. Some solution strategies
are given as heuristic search methods (including those
that guarantee global convergence, such as branch
and bound). As a process associated with algorithm
design, some regard this simply as attempts to assure
convergence to a global optimum (unlike a purely
local optimization procedure, like steepest ascent). (A.
Holder, editor. Mathematical Programming Glossary.
INFORMS Computing Society, http://glossary.
computing.society.informs.org/, 2006-08. Originally
authored by Harvey J. Greenberg, 1999-2006. See the
supplement by J.D. Pintér.)
95
Heuristic in mathematical programming, this usually means a
procedure that seeks an optimal solution but does not
guarantee it will find one, even if one exists. It is often
used in contrast to an algorithm, so branch and bound
would not be considered a heuristic in this sense. In
AI, however, a heuristic is an algorithm (with some
guarantees) that uses a heuristic function to estimate
the “cost” of branching from a given node to a leaf
of the search tree (Also, in AI, the usual rules of node
selection in branch and bound can be determined by
the choice of heuristic function: best-first, breadth-first,
or depth-first search) (A. Holder, editor. Mathematical
Programming Glossary. INFORMS Computing Society,
http://glossary.computing.society.informs.org/, 2006-08.
Originally authored by Harvey J. Greenberg, 1999-2006.)
Influence diagram depicts structure of decision process and notes the data
needed to make the decision
96
Innovative award administered by the Analytics Section
Applications in of INFORMS to recognize creative and unique
Analytics Award developments, applications, or combinations of
analytical techniques. The prize promotes the
awareness of the value of analytics techniques in
unusual applications, or in creative combination to
provide unique insights and/or business value (http://
www.informs.org/Community/Analytics/News-Events2/
Innovative-Applications-in-Analytics-Award)
97
Knapsack problem an integer program of the form, Max{cx: x in Zn+ and ax
<= b}, where a > 0. The original problem models the
maximum value of a knapsack that is limited by volume
or weight (b), where x_j = number of items of type j put
into the knapsack at unit return c_j, that uses a_j units
per item (A. Holder, editor. Mathematical Programming
Glossary. INFORMS Computing Society, http://glossary.
computing.society.informs.org/, 2006-08. Originally
authored by Harvey J. Greenberg, 1999-2006.)
Lead time time between the initial phase of a process and the
emergence of results, as between the planning and
completed manufacture of a product (http://www.
thefreedictionary.com/lead+time)
98
Linear program opt{cx: Ax = b, x >= 0}. (Other forms of the constraints
are possible, such as Ax <= b.) The standard form
assumes A has full row rank. Computer systems ensure
this by having a logical variable (y) augmented, so the
form appears as Opt{cx: Ax + y = b, L <= (x, y) <=
U} (also allowing general bounds on the variables).
The original variables (x) are called structural. Note
that each logical variable can be a slack, surplus, or
artificial variable, depending on the form of the original
constraint. This computer form also represents a range
constraint with simple bounds on the logical variable.
Some bounds can be infinite (i.e., absent), and a
free variable (logical or structural) is when both of its
bounds are infinite (A. Holder, editor. Mathematical
Programming Glossary. INFORMS Computing Society,
http://glossary.computing.society.informs.org/, 2006-08.
Originally authored by Harvey J. Greenberg, 1999-2006.)
99
Logistic regression a type of probabilistic classification model [1] used for
predicting the outcome of a categorical dependent
variable (i.e., a class label) based on one or more
predictor variables (features). Logistic regression
can be binomial or multinomial. Binomial or binary
logistic regression deals with situations in which the
observed outcome for a dependent variable can have
only two possible types (for example, “dead” versus
“alive”). Multinomial logistic regression deals with
situations where the outcome can have three or more
possible types (e.g., “better” versus “no change”
versus “worse”) (http://en.wikipedia.org/wiki/Logistic_
regression)
100
Mean time between a measure of how reliable a hardware product or
failures (MTBF) component is. For most components, the measure
is typically in thousands or even tens of thousands of
hours between failures (http://whatis.techtarget.com/
definition/MTBF-mean-time-between-failures)
Median the value such that the number of terms having values
greater than or equal to it is the same as the number
of terms having values less than or equal to it (http://
searchdatacenter.techtarget.com/definition/statistical-
mean-median-mode-and-range)
Mode value of the term that occurs the most often (http://
searchdatacenter.techtarget.com/definition/statistical-
mean-median-mode-and-range)
101
Net present value value in today’s currency of an item or service
(Davenport, Enterprise Analytics, p. 22)
Next best offer (NBO) a targeted offer or proposed action for customers
based on analyses of past history and behavior, other
customer preferences, purchasing context, attributes of
the produces, or services from which they can choose
(Davenport, Enterprise Analytics, p. 83)
10 2
OLAP an abbreviation for “Online Analysis and Processing”;
a type of database technology that has long been used
by the business community to analyze and interactively
explore large financial data sets. The basic idea is that
data sets are viewed as cubes with hierarchies along
each axis (http://biolap.sourceforge.net/whitepaper.pdf)
103
Pattern recognition in machine learning, pattern recognition is the
assignment of a label to a given input value (http://
en.wikipedia.org/wiki/Pattern_recognition)
10 4
Pricing a tactic in the simplex method, by which each variable
is evaluated for its potential to improve the value of
the objective function. Let p = c_B[B^-1], where B is a
basis, and c_B is a vector of costs associated with the
basic variables. The vector p is sometimes called a dual
solution, though it is not feasible in the dual before
termination; p is also called a simplex multiplier or
pricing vector. The price of the jth variable is c_j - pA_j.
The first term is its direct cost (c_j) and the second term
is an indirect cost, using the pricing vector to determine
the cost of inputs and outputs in the activity’s column
(A_j). The net result is called the reduced cost, and its
value determines whether this activity could improve
the objective value (A. Holder, editor. Mathematical
Programming Glossary. INFORMS Computing Society,
http://glossary.computing.society.informs.org/, 2006-08.
Originally authored by Harvey J. Greenberg, 1999-2006.)
Problem assessment/ initial step in the analytics process; involves buy in from
framing all parties involved on what the problem is before a
solution can be found
105
Proprietary data data that no other organization possesses; produced
by a company to enhance its competitive posture
(Davenport, Enterprise Analytics, p. 37)
10 6
Response surface a surface in (n+1) dimensions that represents the
methodology (RSM) variations in the expected value of a response variable
(see, regression) as the values of n explanatory
variables are varied. Usually the interest is in finding the
combination that gives a global maximum (or minimum)
(http://www.answers.com/topic/response-surface)
107
Robust optimization a term given to an approach to deal with uncertainty,
similar to the recourse model of stochastic
programming, except that feasibility for all possible
realizations (called scenarios) is replaced by a penalty
function in the objective. As such, the approach
integrates goal programming with a scenario-based
description of problem data (A. Holder, editor.
Mathematical Programming Glossary. INFORMS
Computing Society, http://glossary.computing.society.
informs.org/, 2006-08. Originally authored by Harvey J.
Greenberg, 1999-2006.)
108
Sensitivity analysis the concern with how the solution changes if some
changes are made in either the data or in some of the
solution values (by fixing their value). Marginal analysis
is concerned with the effects of small perturbations,
maybe measurable by derivatives. Parametric analysis
is concerned with larger changes in parameter values
that affect the data in the mathematical program,
such as a cost coefficient or resource limit (A. Holder,
editor. Mathematical Programming Glossary. INFORMS
Computing Society, http://glossary.computing.society.
informs.org/, 2006-08. Originally authored by Harvey J.
Greenberg, 1999-2006.)
109
Six Sigma a set of strategies, techniques, and tools for process
improvement. It seeks to improve the quality of
process outputs by identifying and removing the
causes of defects (errors) and minimizing variability
in manufacturing and business processes (http://
en.wikipedia.org/wiki/Six_Sigma)
110
Stepwise regression a semi-automated process of building a model by
successively adding or removing variables based solely
on the t-statistics of their estimated coefficients (http://
people.duke.edu/~rnau/regstep.htm)
111
Validation (of a model) determining how well the model depicts the real-world
situation it is describing (http://www.easterbrook.ca/
steve/2010/11/the-difference-between-verification-and-
validation/)
Variable cost a periodic cost that varies in step with the output or the
sales revenue of a company. Variable costs include raw
material, energy usage, labor, distribution costs, etc.
(http://www.businessdictionary.com/definition/variable-
cost.html)
112
Verification includes all the activities associated with the producing
(of a model) high quality software: testing, inspection, design
analysis, specification analysis (http://www.easterbrook.
ca/steve/2010/11/the-difference-between-verification-
and-validation/)
113
TH I S PA GE IS DE LIBE RATELY LEF T BLANK
REV I E W QU E ST I O NS
These questions will never be on the CAP® certification exam: they are here solely
as study aids. All questions on the certification exam are multiple choice with four
possible correct answers of which only one is correct.
2. What is a stakeholder?
4. Suppose that the business problem is that the organization wants to increase
sales by increasing cross-selling to existing customers. Your project sponsor
looks to you to tell her how the organization can get there based on the data
at hand. What’s your first move?
c. Talk with marketing to see what they have planned for the next sales
campaign
d. Ask your sponsor what the actual numeric target of increased sales is
overall
5. Your sponsor has come back with a numeric goal of increasing sales from
an average of $10,000 per customer to $11,000 per customer in the next 12
months, what’s your next move?
115
6. You now have a little more information from the project sponsor, along with
several rumors from other sources. You know that you should base the cost
of increased sales over current levels at the marginal cost, rather than the
fully allocated cost; that the company has to maintain at least the same return
on sales as it currently has as the sales increase from 10,000 per customer to
11,000 per customer; and that top-line revenue must also increase by 10% (i.e.,
you can’t get there by dropping your lowest performing customers). Once
you’ve listed these assumptions or rules in your project charter, what’s next?
b. Talk with your marketing and data groups to see what data exist
c. Figure out how the increased sales goal should be broken down into
metrics
a. Data group
c. Manufacturing
d. Contracts
9. A post office area manager received many complaints that the only branch she
has in the north side of the town has a very long waiting time. She hired you
as a consultant to recommend justifying opening new positions in her branch.
What would be a relevant methodology to use?
b. Queuing theory
c. Data mining
d. Linear programming
116
10. A major aircraft manufacturing company is intending to determine the main
causes for fatal failures in their battery system. The best methodology to use
to pin point the root causes is:
d. Choice B or C
12. You are given three months to solve an analytics problem and the needed data
will require two months to collect. What would be the strategy with the best
outcome?
a. Wait until the data are available to choose the best methodology
c. Ignore the data and design a tool that fits all possible scenarios
13. One good methodology to reduce the dimensionality of a set of data is to use:
b. linear programming.
c. discrete-event simulation.
d. artificial intelligence.
117
14. You are given a set of data to be utilized for a model. Their level of accuracy
is within +/- 20%. What approach and/or software would you use for the
problem?
a. Approach and/or software that deals with data at +/- 1% accuracy level
b. Approach and/or software that deals with data at +/- 0.01% accuracy
level
c. Approach and/or software that deals with data at +/- 10% accuracy
level
d. Approach and/or software that deals with data at +/- 30% accuracy
level
15. You are asked to establish a model to map many independent variables (X’s) to
one dependent variable (Y). The model should explain the level of significance
of the X’s to Y and their level of correlation. What is the first methodology to
come to mind in this situation?
a. Stepwise regression
b. Fuzzy logic
118
17. A factory has skilled workers that operate complicated equipment and there
is a need to transfer the knowledge to new hires. The procedure cannot be
explained in a crisp manner with exact numbers. For example, the operator
cannot explain what the right temperature and pressure are to maximize
the strength of the material at a certain condition. They simply just know by
experience. One good candidate approach to model that variables and rules
is:
a. fuzzy logic.
b. neural network.
c. linear regression.
d. logistic regression.
a. Prescriptive
b. Descriptive
c. Soft skills
d. Predictive
b. stepwise regression.
c. decision tree.
d. Markov chain.
20. A chemical plant is under study to identify the bottleneck in its operation to
facilitate scheduling. One proper methodology to model the plant is:
a. system dynamics.
b. discrete-event simulation.
c. Markov chain.
d. fuzzy logic.
119
21. You are given a problem by a client in which you need to determine the right
amount to be purchased from what location so the total cost of manufacturing,
transportation, and duties is minimized. The first methodology to come in
mind to model this problem is:
a. step-wise regression.
b. mixed-integer programming.
c. linear programming.
d. logistic regression.
22. Genetic algorithm, Tabu search, and ant colony optimization are examples
of optimization algorithms that are inspired by natural phenomenon and are
examples of the following type of analytics methodologies:
a. Metaheuristics
b. Simulation
c. Pattern recognition
d. Visualization
23. Once you’ve built your model how do you know that the model will still answer
your business problem?
c. Both a and b
d. Neither a nor b
12 0
26. How often should model maintenance be done?
27. What will happen if you don’t ever bother to evaluate model performance and
returns over time?
28. Which of the following BEST describes the data and information flow within
an organization?
a. Information assurance
b. Information strategy
c. Information mapping
d. Information architecture
29. A multiple linear regression was built to try to predict customer expenditures
based on 200 independent variables (behavioral and demographic). 10,000
randomly selected rows of data were fed into a stepwise regression, each row
representing one customer. 1,000 customers were male, and 9,000 customers
were female. The final model had an adjusted R-squared of 0.27 and seven
independent variables. Increasing the number of randomly selected rows of
data to 100,000 and rerunning the stepwise regression will MOST likely:
121
30. A clothing company wants to use analytics to decide which customers to
send a promotional catalogue in order to attain a targeted response rate.
Which of the following techniques would be the MOST appropriate to use for
making this decision?
a. Integer programming
b. Logistic regression
c. Analysis of variance
d. Linear regression
32. A box and whisker plot for a dataset will MOST clearly show:
33. In the initial project meeting with a client for a new project, which of the
following is the MOST important information to obtain?
d. Available budget
12 2
34. Which of the following statements is true of modeling a multi-server checkout
line?
c. Variability in arrival and service times will tend to play a critical role in
congestion.
GASOLINE “GREEN”
TECHNOLOGY TECHNOLOGY
(NUMBERS IN $ (NUMBERS IN $
THOUSANDS) THOUSANDS)
WHOLESALE PRICE/VEHICLE 25 40
VARIABLE COST/VEHICLE 15 35
How large a subsidy per vehicle sold will be required, assuming there will be
enough demand to motivate the switch?
c. Cannot be determined
d. Equal to $5000
123
36. A furniture maker would like to determine the most profitable mix of items to
produce. There are well-known budgetary constraints. Each piece of furniture
is made of a predetermined amount of material with known costs, and
demand is known. Which of the following analytical techniques is the MOST
appropriate one to solve this problem?
a. Optimization
b. Multiple regression
c. Data mining
d. Forecasting
37. You have simulated the Net Present Value (NPV) of a decision. It ranges
between –$10 million and +$10 million. To best present the likelihood of
possible outcomes, you should:
38. A company ships products from a single dock at their warehouse. The time
to load shipments depends on the experience of the crew, products being
shipped and weather. The company thinks there is significant unmet demand
for their products and would like to build another dock in order to meet this
demand. They ask you to build a model and determine if the revenue from
the additional products sold will cover the cost of the second dock within
two years of it becoming operational. Which of the following is the MOST
appropriate modeling approach and justification?
12 4
39. Two investors who have the same information about the stock market buy an
equal number of shares of a stock. Which of the following statements must be
true?
d. If the investors are optimistic, they should have borrowed rather than
bought the shares.
a. Use 70,000 randomly selected data points when building the model,
and hold the remaining 30,000 out as a test dataset.
c. Randomly partition the data into 4 datasets of equal size, build four
models and take their average.
d. Use 1,000 randomly selected data points when building the model.
125
42. One of the main advantages of tree-based models and neural networks is that
they:
c. P is equal to $3,000,000.
44. After building a predictive model and testing it on new data, an under
prediction by a forecasting system can be detected by its:
a. negative-squared.
b. bias.
12 6
45. All times in the decision tree below are given in hours. What is the expected
travel time (in hours) of the optimal (minimum travel time) decision?
0.6 9
jam
Traffic
Jam
0.5 no
rain 0.4 6
Rain?
dry 0.3 9
Drive 0.5 jam
Traffic
Drive Jam
or no
fly? 0.7 6
Fly
0.8 10
0.5 delay
Rain? rain Flight
Delay
dry no
0.5 0.2 5
5
a. 7.8
b. 6.9
c. 7.4
d. 7.0
a. Ensure that all the model input data items are available when needed.
c. Ensure that all users are reviewing the model results in a timely fashion.
127
47. A segmentation of customers who shop at a retail store may be performed
using which of the following methods?
0.9
0.8
Strategy A
0.7
Strategy B
0.6
0.5
0.4
0.3
0.2
0.1
(400) (200) - 200 400 600 800 1,000 1,200 1,400 1,600
NPV, Millions US $
b. Strategy B has the same downside risk as Strategy A since the curves
have the same shape.
12 8
49. Each month you generate a list of marketing leads for direct mail campaigns.
Which of the following should you do before the list is used?
c. Remove opt-outs.
50. When analyzing responses of a survey of why people like a certain restaurant,
factor analysis could reduce the dimension in which of the following ways?
51. A preferred method or best practice for organizing data in a data warehouse
for reporting and analysis is:
a. transactional-based modeling.
b. multidimensional modeling.
c. relation-based modeling.
d. tuple-based modeling.
129
ANSWERS TO REVIEW QUESTIONS:
2. Stakeholders are all who are affected by the problem and its solution. Note
that this may include more than those in the initial meetings and those in
charge of the problem solution.
4. Note that your sponsor didn’t give you much information to go on, and you
don’t know what your goal really is, except that you know you’re looking to
get more sales per customer. There’s not enough to go on here to start to
formulate the problem. Choice D would be the best response to start to get
some numbers to go with the business’ goal.
5. Even given the statement above, you don’t yet have a complete view of the
business problem. You don’t know why the organization has chosen to focus its
attention on increasing sales per customer. Without that, you don’t know what
margins are acceptable on those sales. You may assume that general business
rules apply and that you should assume that any sales under a 20% margin
are inherently unprofitable and should be rejected. But without surfacing and
clarifying that assumption and many others, you don’t know if it is valid or not.
You have to ask and keep asking until you know what assumptions are valid.
Again, choice D is the most appropriate answer.
6. Here the most appropriate answer is choice A. This is important because if you
go straight to looking at data, your hypotheses about what’s important will be
inherently biased by the existing data and explanations. If the answer were
in your existing explanations, you probably wouldn’t have the problem in the
first place. But now that you have the initial set of drivers, you can start talking
with your data group and decomposing your metrics to allocate the increased
performance to performing groups. Any group with changing goals needs to
be on your stakeholder list and part of the reviews.
130
7. Any group with changing requirements needs to be invited. If you plan on
selling more items, then the manufacturing group needs to be part of the
discussion so they can advise on how much they can actually produce before
requiring more investment for another line, more employees, etc.
9. b. Queuing theory
10. d. Choice B or C
14. c. Approach and/or software that deals with data at +/- 10% accuracy level
18. b. Descriptive
22. a. Metaheuristics
23. The answer is to go back to the original question or problem and see if that has
been answered. There may be times when the original question or problem
may have become only a part of the solution, but it still needs to have been
answered.
131
24. Among other things, stakeholders may be concerned with the implications of
the solution, the future impact on their business, whether the new solution will
lead to more on time performance in the long run, the ease of implementation,
impact on personnel of changes in processes, and other concerns related to
their way of doing business.
25. c. Both a and b. If a change in business conditions has occurred that invalidate
the assumptions of the original model, a new or revised model should be
fielded and tested and validated before being deployed as a replacement.
27. If the model performance is not evaluated, over time the returns may become
skewed and may not provide accurate answers to the original question.
28. d. Information architecture refers to the analysis and design of the data stored
by information systems, concentrating on entities, their attributes, and their
interrelationships. It refers to the modeling of data for an individual database
and to the corporate data models that an enterprise uses to coordinate the
definition of data in several (perhaps scores or hundreds) distinct databases.
29. a. have no impact upon the adjusted R-squared. The increase in size of the
data will not impact the adjusted R-squared calculation because both samples
are sufficiently large randomly selected subsets of data.
30. b. Logistic regression This type of classification model is often used to predict
the outcome of a categorical dependent variable (response vs. no response)
based on one or more predictor variables, so this is the most appropriate
answer. The goal of the analytics in the stated problem is to determine who
is most likely to respond, and the binary nature of this predicted outcome is
provided by logistic regression.
32. d. if the data is skewed and, if so, in which direction. A box and whisker plot,
sometimes just called a “box plot,” was invented by John Tukey as a way to
graphically display the distribution of data. The ends of the box are at the
first and third quartiles, and there is a line somewhere in the box representing
the median value. The whiskers extend either to the minimum and maximum
values in the data set, or possibly less if they do not include points identified
as outliers.
13 2
33. c. Business issue and project goal. Understanding the business issue and
project goal provides a sound foundation on which to base the project.
34. c. Variability in arrival and service times will tend to play a critical role in
congestion. Arrival and service time distributions are inputs to a queuing
model that would be used to model a checkout line and directly influence
congestion.
37. b. present a histogram to show likelihood of various NPVs. Net Present Value
(NPV) takes as input a time series of cash flow (both incoming and outgoing)
and a discount rate and outputs a price. By showing a histogram (a graphical
representation of the distribution of data), it is possible to see how likely
various NPVs (beyond the given minimum and maximum) are to occur. This
would be useful information to have when considering a decision, especially
since the range of outcomes includes $0, meaning the decision could result in
a profit or a loss.
38. d. Discrete event simulation because there are a sequence of random events
through time. The time to load shipments depends on the experience of
the crew, products being shipped and weather. Given there is a sequence of
random events through time, discrete event simulation is the most appropriate
modeling approach.
39. c. Both investors are subject to the same uncertainty regarding the stock
market.
133
40. a. Use 70,000 randomly selected data points when building the model, and
hold the remaining 30,000 out as a test dataset. This split provides sufficient
data to build the model and sufficient data to test the model. This is the best
allocation of the customer data points. (A common ‘rule of thumb’ is to use
about two thirds of the data to build the model and one third to test it)
42. c. reveal interactions without having to explicitly build them into the model.
Tree-based models and neural networks are employed to find patterns in
the data that were not previously identified (or input into the model building
process).
43. d. P is less than $3,000,000. When the demand is 1000 or greater, the profit
is $3,000,000. But when the demand is less than 1000, the profit is less than
$3,000,000. Given this and that the average demand is 1000 units, the expected
monthly profit must be less than $3,000,000.
44. b. bias. The bias measures the difference, including the direction of the
estimate and the right answer. Depending on whether it’s positive or negative,
it will show whether there is an over or under estimate.
45. d. 7.0 To answer this question, one needs to solve the decision tree using the
“roll back” technique. Continuing back the bottom branch of the tree, the
expected time if you fly is (0.5)(9.0) + (0.5)(5) = 7.0 hours. Now, when faced
with the “drive or fly” decision, you should choose to fly (since 7.0 hours is less
than 7.35 hours). Thus, answer d) 7.0 hours is the expected travel time of the
optimal (or minimal travel time) decision.
46. b. Determine if there has been a change in model accuracy over time. The
most important maintenance activity for the analytics professional responsible
for maintaining the simulation model is to monitor the accuracy of the model
over time. If there has been a change in accuracy, the analytics professional
may need to revisit the assumptions of the model.
13 4
48. a. Strategy B exhibits stochastic (probabilistic) dominance over Strategy A.
Because the cumulative probability curve for Strategy B is below (or to the
right) of the corresponding curve for Strategy A, it can be said that Strategy B
exhibits stochastic dominance (SD) over Strategy A. B stochastically dominates
A when, for any good outcome x, B gives at least as high a probability of
receiving at least x as does A, and for some x, B gives a higher probability of
receiving at least x. Since the curves do not cross, B stochastically dominates A.
49. c. Remove opt-outs. The list of marketing leads should not include people or
organizations that have opted out.
50. a. Collapse several survey questions regarding food taste, health value,
ingredients and consistency into one general unobserved “food quality”
variable. Factor analysis is a statistical method used to describe variability
among observed variables in terms of a potentially lower number of unobserved
variables called factors. The information gained about the interdependencies
between observed variables can be used later to reduce the set of variables
in a dataset.
For more information on the review questions numbered 28 to 51, see https://
www.certifiedanalytics.org
135
STUDY GUIDE REFERENCES FOR SPECIFIC DOMAINS
136
Hand DJ, Mannila H, Smyth P (2001) Principles of Data Mining (MIT Press, Boston).
Hillier FS, Lieberman GJ (2005) Introduction to Operations Research, 8th ed. (McGraw-Hill, New York).
Law AM, Kelton DW (2000) Simulation Modeling and Analysis, 3rd ed. (McGraw-Hill, New York).
Ross SM (2010) Introductory Statistics, 3rd ed. (Academic Press, Burlington, MA).
Siegel E (2013) Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (Wiley, New York).
Tufte ER (2001) The Visual Display of Quantitative Information, 2nd ed. (Graphics Press, Cheshire, CT).
Domain VI — Deployment
Chapman P, et al., CRISP-DM 1.0 Step by Step data mining guide, http://lyle.smu.
edu/~mhd/8331f03/crisp.pdf and http://www.the-modeling-agency.com/crisp-dm.pdf.
Laursen GHN, Thorlund J (2010) Business Analytics for Managers: Taking Business Intelligence
Beyond Reporting (John Wiley & Sons, Hoboken, NJ).
Domain VII — Lifecycle Maintenance
Chapman P, et al., CRISP-DM 1.0 Step by Step data mining guide, http://lyle.smu.
edu/~mhd/8331f03/crisp.pdf and http://www.the-modeling-agency.com/crisp-dm.pdf.
Wirth R (2000) CRISP-DM: Towards a standard process model for data mining. Proc. Fourth Internat.
Conf. Practical Appl. Knowledge Discovery Data Mining, http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.198.5133.
FURTHER READING
Albright SC, Winston W, Zappe C (2011) Data Analysis and Decision Making, 4th ed. (South-Western
Cengage Learning, Mason, OH).
Bartlett R (2013) A Practitioner’s Guide to Business Analytics: Using Data Analysis Tools to Improve
Your Organization’s Decision Making and Strategy (McGraw-Hill, New York).
Bennett G, Levis J (2013) Steering toward analytics certification. OR/MS Today (June).
Berry MJA, Linoff GS (1999) Mastering Data Mining: The Art and Science of Customer Relationship
Management (Wiley, New York).
Big data: The next frontier for innovation, competition, and productivity. Report, McKinsey &
Company. http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_
innovation.
Breeden J (2013) Tipping Sacred Cows: Kick the Bad Work Habits that Masquerade as Virtues (Jossey-
Bass, San Francisco, CA).
Brohaugh W (2007) Write Tight: Say Exactly What You Mean With Precision and Power (Sourcebooks,
Naperville, IL).
Chapman P, et al., CRISP-DM 1.0 Step by Step data mining guide, http://lyle.smu.
edu/~mhd/8331f03/crisp.pdf and http://www.the-modeling-agency.com/crisp-dm.pdf.
Clemen RT (1997) Making Hard Decisions: An Introduction to Decision, 2nd ed. (Duxbury Press, Pacific
Grove, CA).
Covey S (2004) The 7 Habits of Highly Effective People (Simon & Schuster, New York).
Davenport T, Harris J (2110) Analytics at Work: Smarter Decision, Better Results (Harvard Business
Review Press, Boston).
Davenport T, Kim J (2013) Keeping up with the Quants: Your Guide to Understanding and Using
Analytics (Harvard Business Review Press, Boston).
137
Duarter N (2012) HBR Guide to Persuasive Presentations (Harvard Business Review Press, Boston).
Eckerson W (2012) Secrets of Analytical Leaders: Insights from Information Insiders (Technics
Publications, Westfield, NJ).
Few S (2012) Show Me the Numbers: Designing Tables and Graphs to Enlighten, 2nd ed. (Analytics
Press, Burlingame, CA).
Framing the problem at https://www.boundless.com/business/management/decision-making/
observation-framing-the-problem/.
Franks B (2012) Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with
Advanced Analytics (John Wiley & Sons, Hoboken, NJ).
Hand DJ, Mannila H, Smyth P (2001) Principles of Data Mining (MIT Press, Boston).
Hillier F, Hillier M (2013) Introduction to Management Science: A Modeling and Case Study Approach,
5th ed. (McGraw-Hill Higher Education, New York).
Hillier FS, Lieberman GJ (2010) Introduction to Operations Research, 9th ed. (McGraw-Hill, New York).
Hubbard DW (2010) How to Measure Anything: Finding the Value of “Intangibles” in Business, 2nd ed.
(John Wiley & Sons, Hoboken, NJ).
Jarman K (2013) The Art of Data Analysis: How to Answer Almost Any Question Using Basic Statistics
(John Wiley & Sons, Hoboken, NJ).
Kirkwood CW (1997) Strategic Decision Making: Multiobjective Decision Analysis with
Spreadsheets (Duxbury Press, Pacific Grove, CA).
The Ladder of Inference: Avoiding “Jumping to Conclusions”, http://www.mindtools.com/pages/
article/newTMC_91.htm.
Law AM, Kelton DW (2006) Simulation Modeling and Analysis, 4th ed. (McGraw-Hill, New York).
Laursen GHN, Thorlund J (2010) Business Analytics for Managers: Taking Business Intelligence
Beyond Reporting (John Wiley & Sons, Hoboken, NJ).
Lindstrom C (2009) How to write a problem statement, March 18, http://www.ceptara.com/blog/how-
to-write-problem-statement.
Mayer Schonberger V, Cukier K (2013) Big Data: A Revolution That Will Transform How We Live, Work,
and Think (Houghton Mifflin Harcourt, New York).
Neter J, Kutner M, Nachtsheim C, Wasserman W (1996) Applied Linear Statistical Models, 4th
ed. (McGraw-Hill/Irwin, New York).
Nixon NW (2013) Focus first on framing, not solving, the problem, April 18, http://philadelphia.
regionsbusiness.com/print-edition-commentary/focus-first-on-framing-not-solving-the-problem/.
Philllips J (2013) Building a Digital Analytics Organization: Creating Value by Integrating Analytical
Processes, Technology, and People into Business Operations (Pearson, Upper Saddle River, NJ).
Pink D (2013) To Sell is Human: The Surprising Truth about Moving Others (Riverhead Books, New York).
Provost F, Fawcett T (2013) Data Science for Business: What you need to know about data mining and
data-analytic thinking (O’Reilly Media, Sebastopol, CA).
Redman T (2001) Data Quality: The Field Guide (Digital Press, Woburn, MA).
Ross SM (2010) Introductory Statistics, 3rd ed. (Academic Press, Burlington, MA).
Sashihara S (2011) The Optimization Edge: Reinventing Decision Making to Maximize All Your
Company’s Assets (McGraw-Hill, New York).
Savage S (2012) The Flaw of Averages: Why We Underestimate Risk in the Face of Uncertainty (John
Wiley & Sons, Hoboken, NJ).
138
Saxena R, Srinivasan A (2012) Business Analytics: A practitioner’s Guide. (Springer, New York).
Seelig T (2013) Shift your lens: The power of re-framing problems. Seelig T, ed. inGenius: A Crash
Course on Creativity (HarperOne, New York), http://stvp.stanford.edu/blog/?p=6435.
Shmueli G (2012) Practical Time Series Forecasting: A Hands-On Guide (Springer, New York).
Siegel E (2013) Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (Wiley, New York).
Silver N (2012) The Signal and the Noise: Why Most Predictions Fail but Some Don’t (Penguin Press,
New York).
Soares S (2013) Big Data Governance: An Emerging Imperative (MC Press Online, Boise, ID).
Spitzer DR (2007) Transforming Performance Management: Rethinking the Way We Measure and
Drive Organizational Success. (AMACOM, New York).
Spradlin D (2012) The power of defining the problem, September 25, http://blogs.hbr.org/cs/2012/09/
the_power_of_defining_the_prob.html.
Taylor J (2011) Decision Management Systems: A Practical Guide to Using Business Rules and
Predictive Analytics (Pearson Education, Boston, MA).
Timmer J (2013) Applying science to communicate science: Right now, it’s hard to find relevant
information on how to do it well, August 1, http://arstechnica.com/staff/2013/08/applying-science-to-
communicate-science/.
Tufte ER (2001) The Visual Display of Quantitative Information, 2nd ed. (Graphics Press, Cheshire, CT).
Tversky A, Kahneman D (1974) Judgment under uncertainty: Heuristics and biases. Science
185(4157):1124–1131.
Vose D (2008) Risk Analysis: A Quantitative Guide, 3rd ed. (John Wiley & Sons, Chichester, UK).
Weinschenk SM (2013) How to Get People to Do Stuff: Master the Art and Science of Persuasion and
Motivation (Peachpit, San Francisco).
Wirth R (2000) CRISP-DM: Towards a standard process model for data mining. Proc. Fourth Internat.
Conf. Practical Appl. Knowledge Discovery Data Mining, http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.198.5133.
139