Empirical Probability Distribution
Empirical Probability Distribution
Empirical Probability Distribution
INTRODUCTION
The objective of this module is to demonstrate how to convert data into probabilities to solve
managerial decisions. Real historical (empirical) data does not necessarily fit a known
distribution, however these data frequencies and rankings can be used to estimate the
appropriate empirical probability distribution. Later, the empirical distributions are used in
decision trees and simulations to make optimum managerial decisions.
Empirical probability uses the number of occurrences of an outcome within a sample set as a
basis for determining the probability of that outcome. The number of times "event X" happens
out of 100 trials will be the probability of event X happening. An empirical probability is closely
related to the relative frequency of an event. An empirical distribution is one for which each
possible event is assigned a probability derived from experimental observation. It is assumed
that the events are independent and the sum of the probabilities is 1.
Empirical probability, also called experimental probability, is the probability your experiment
will give you a certain result. For example, you could toss a coin 100 times to see how many
heads you get, or you could perform a taste test to see if 100 people preferred cola A or cola B.
You could use this information to make an educated guess (a statistic) about what your
probabilities would be if you performed the experiments 1000, 10,000 or even an unlimited
number of times. If you don’t actually perform the experiment—if you just theorize about it—
then that’s called theoretical probability.
Empirical Probability is probability based upon data. That data can be either the result of a
designed experiment (experimental data) or the result of situations that occur beyond the
control of the analyst (observational data). In the fields of medicine and business, data-driven
probability is referred to as “Evidence-based” probability.
Course Module
In order for a theory to be proved or disproved, empirical evidence must be collected. An
empirical study will be performed using actual market data. In finance for example, many
empirical studies have been conducted on the capital asset pricing model (CAPM), and the results
are slightly mixed.
In some analyses, the model does hold in real world situations, but most studies have disproved
the model for projecting returns. Although the model is not completely valid, that is not to say
that there is no utility associated with using the CAPM. For instance, the CAPM is often used to
estimate a company's weighted average cost of capital.
Table: 1 Table: 2
Recall that f(x) is the probability of a specific outcome x, that is, the probability of a specific value
of a random variable. Discrete empirical probability can be calculated by counting the number of
occurrences of each outcome (numeric or otherwise):
where n(x) is the number of data points equal to the value x and n is the total number of data
points (sample size).
It is useful to first sort the data. The frequencies and probabilities are readily computed after
sorting as in the figure below.
Course Module
As would be expected due to the Law of Large Numbers, the accuracy of this method of
determining discrete probability improves for larger samples.
Examining the previous spreadsheet reveals that there are two methods by which a set of
empirical data may be used to generate random variables:
1. Using the full list of data: Give each element a 1/n probability of selection. The data can be
first sorted. Sorted data provides the analyst a better understanding of the likelihoods of the
various outcomes, this in turn, provides the analyst with a much better understanding of the
data.
2. Using the data distribution: This works well when there are not an overly cumbersome
number of levels of the discrete variable.
The spreadsheet shown in Fig. 2 demonstrates how the discrete example could be simulated
using the full list of data.
The spreadsheet shown in Fig. 3 is an example how the discrete empirical data could be
simulated using the probability distribution of the data computed from the previous example.
Fundamentals Of Business Analytics 5
Empirical Probability Distribution
Figure 3: Simulation of Discrete Data Using the Probability Distribution of the Data
The second method is fundamentally the same as the first, but takes advantage of the way
VLOOKUP works when using an approximate match for data in which the data key (first column
of the data) is sorted from smallest to largest. Compare the two methods mentioned previously
to note that the data distribution method is the full list of data method with the repeated
outcomes removed.
With continuous empirical data f(x) can be calculated using the cumulative distribution function
(cdf), F(x). When calculating probabilities from historical data, F(x) is called the Empirical
Cumulative Distribution Function and is abbreviated as ECDF(x). The ECDF(x) is easily calculated
by first sorting the data from smallest to largest and then using the frequency counts to
determine the cumulative probability:
Course Module
EXAMPLE: EMPIRICAL DISTRIBUTION IN A DECISION TREE: PRICING DECISIONS
A company is bidding to supply parts to an electronics manufacturer. The competitors’ bids for
10 previous similar contracts are shown in Table 4. If the bid is won, the total cost of completing
the contract is $350,000. What is the optimum bid?
As the electronics manufacturer will purchase the least expensive components, then low bid wins
in this situation. Because low bid wins, then the probability of winning given a specific bid is:
Figure 5: Empirical Cumulative Distribution Function (ECDF) for the Bidding Example
The probabilities of winning is then calculated as P(Win | Bid) = 1 – ECDF. Thus, for LOW BID
WINS bidding, the probability of winning is 1 − CDF. Conversely, for HIGH BID WINS bidding the
probability of winning is the CDF.
Course Module
Figure 6: Probability of Winning Given a Specific Bid (1 − ECDF) for the Bidding Example
From the ECDF, the slopes and intercepts to calculate the probability of winning given a specific
bid using interpolation can be calculated using the method shown in Table 5.
Intercept =
P(Win | Slope =
Rank Obs ECDF Bid ECDF− Slope ×
Bid) ∆P(Win|Bid)/ ∆Bid
Bid
1 1 0.10 369,800 0.90 −0.0000095 4.42
2 10 0.20 380,300 0.80 −0.0000143 6.23
3 5 0.30 387,300 0.70 −0.0003333 129.80
4 4 0.40 387,600 0.60 −0.0000476 19.06
5 7 0.50 389,700 0.50 −0.0000085 3.83
6 9 0.60 401,400 0.40 −0.0002500 100.75
7 3 0.70 401,800 0.30 −0.0000714 29.00
8 2 0.80 403,200 0.20 −0.0000625 25.40
9 6 0.90 404,800 0.10 −0.0000345 14.06
10 8 1.00 407,700 0.00 −0.0000345 14.06
Table 5: Slope–Intercept Table to Calculate P(Win | Bid) = 1 − ECDF
Using Table 5 and the VLOOKUP function, the expected value for a bid, EVBid, can be calculated
EVBid = P(Win ∣ Bid)(Bid - $350,000)
= (Slope(Bid)+ Intercept)(Bid - $350,000)
Fundamentals Of Business Analytics 9
Empirical Probability Distribution
The optimum bid is obtained using Excel’s One-Way Data Table command.
In a manner similar to the method used to simulate the five-point estimate, the ECDF must first
be inverted as shown in Table 6 and corresponding graph shown in figure 8.
Slope =
ECDF = Intercept = Bid −
Rank Bid ∆Bid/
Rand() Slope * ECDF
∆ECDF
1 0.00 359,300 105,000 359,300
2 0.10 369,800 105,000 359,300
3 0.20 380,300 70,000 366,300
4 0.30 387,300 3,000 386,400
5 0.40 387,600 21,000 379,200
6 0.50 389,700 117,000 331,200
7 0.60 401,400 4,000 399,000
8 0.70 401,800 14,000 392,000
9 0.80 403,200 16,000 390,400
10 0.90 404,800 29,000 378,700
1.00 407,700 0 407,700
Table 6: Slope–Intercept Table to Generate Random Bids for Simulation
Course Module
Figure 8: Inverse ECDF for Generating Random Bids for Simulation
As is the case of simulating the five-point estimate, the RAND() must be calculated in a cell that is
external to the cell used to compute the random variable so that the slope and intercept will
correspond to the appropriate percentile specified by the RAND(). Then the VLOOKUP function is
used to determine the appropriate slope and intercept to calculate the random bid that would
correspond to the percentile generated by the RAND().
The main advantage of using empirical probability is that the probability is backed by
experimental studies and data. It is free from assumed data or hypotheses. However, there are
two big disadvantages of empirical probability to consider:
Pinder, J. (2017). Introduction to Business Analytics Using Simulation, 125 London Wall, London EC2Y
5AS, United Kingdom
Schniederjans, M. (2017), Business Analytics Principles, Concepts, and Applications, Pearson Education,
Inc, Upper Saddle River, New Jersey 07458
https://www.investopedia.com/
https://www.managementstudyguide.com/
https://www.statisticshowto.com/experimental-empirical-probability/
Course Module