Data and Monte Carlo Simulations

Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Probability and Decision

Analysis

Using Data
and
Monte Carlo Simulations

1
Overview
• Constructing probability distributions from
data
• Fitting data to theoretical probability
distributions
• Understanding the basics of Monte Carlo
Simulations

2
Preliminaries
We will look at two types of data:
– Sample data
• Denoted as x1, x2,…, xn, for n observations, where xi is a
known number
– Subjectively assessed data
• Denoted as (x1, p1), (x2, p2)…, (xn, pn ), for n pairs of value,
where pi is the cumulative probability associated xi, that
is P(X ≤ xi) = pi, i = 1, 2, …n
• In both cases, we are looking at a subset of
the uncertainty population.
3
Preliminaries
• And we will examine two ways to use data to
construct probability distributions:
– Directly construct the distribution based on the data
– Select a theoretical distribution that best fits the data
• Notice that:
– Both types of data (sample, assessed) can be used
with both types of distributions.
– Sometimes it is easier to model a discrete distribution
as a continuous distribution.

4
Using data to construct probability
distributions
• Constructing a discrete distribution from data
– Count the number of occurrences of each
category.
– Assign probabilities to the categories
• The probabilities are relative frequencies.

5
Using data to construct probability
distributions
• Constructing a discrete distribution from data
– Count the number of occurrences of each
category.
– Assign probabilities to the categories
• The probabilities are relative frequencies.

6
Using data to construct probability
distributions
• Constructing a discrete distribution from data
– Count the number of occurrences of each
category.
– Assign probabilities to the categories
• The probabilities are relative frequencies.

› If the data are sample values, the discrete probabilities are


relative frequencies.
› If the data are subjective assessments, the discrete
probabilities are simply the assessments.

7
Using data to construct probability
distributions
• Some judgments are needed when using data.
– Ensure you have enough data
• A minimum of five observations per category.
– Familiarize yourself with the data to check for errors in
the data:
• Can be from many sources: e.g., data collection errors, data
entry errors, …
– “Get to know your data” to ensure that it is
representative of the uncertainty or underlying
population.
– Data is historical and you need to be cautious when
using it to predict the future.

8
Using data to construct probability
distributions
• We now look at constructing a discrete
probability distribution.
• First, construct an empirical distribution from
a sample:
– Sort the sample values from lowest to highest
– Assign probabilities to each value

9
Using data to construct probability
distributions
A sample of 10 observations
from an exponential distribution Assigned Cumulative
with rate parameter l= 1/10. Probability Probability
0.6 1/10 1/10
2 1/10 2/10
3.5 1/10 3/10
4 1/10 4/10
5.7 1/10 5/10
7.1 1/10 6/10
10.6 1/10 7/10
14.1 1/10 8/10
19.2 1/10 9/10
23.7 1/10 10/10

10
Using data to construct probability
distributions

An empirical
distribution can be
shown a CDF.

11
Using data to construct probability
distributions
This discrete distribution approximates the underlying continuous
distribution. The more observations used, the closer the approximation.

Based on 10 observations Based on 30 observations

12
Using data to construct probability
distributions
How can we measure the quality or closeness of
a CDF to the continuous distribution?
– Measure how far apart the two distributions are
by measuring the vertical distance between them
• E.g., Kolmogorov-Smirnov distance
– Compare the mean and standard deviation of the
fitted distribution to the underlying distribution
• In both cases, the closer or smaller the
measured difference, the better the
approximation.
13
Using data to construct probability
distributions
• Some important formulas (point estimates):
Sample mean

Estimate of population standard deviation based on Sample data

14
Using data to fit probability
distributions
• Instead of constructing a distribution empirically
from sample data, you can look for a theoretical
distribution that closely matches the data.
• Fitting a theoretical distribution to data means
finding the values of the parameters such that
the theoretical distribution matches the data as
closely as possible.
– Parameters are the key characteristics that specify a
distribution.
• E.g., the parameters of a normal distribution are mean and
standard deviation.

15
Using data to fit probability
distributions
Standard deviation (s or σ) of a normal
distribution

16
Using data to fit probability
distributions
However, the best theoretical distribution for
sample data is not always the best fitting one
based on parameters. Why?
– The top fitting distributions are very close to each
other.
– Also keep in mind that some distributions have a
great deal flexibility in shaping to match data.
– Different measures of fit may produce different
results.
17
Using data to fit probability
distributions
@RISK is a good tool for matching distributions.
– It can run the fit on all of the distributions in its
library.
– It uses three measures of fit that compare the
parameters of the theoretical distribution to the
sample.
1. Kolmogorov-Smirnov distance
– Based on maximum vertical distance between distribution and
data
2. Anderson-Darling distance
– Similar to K-S distance but factors in the extreme tails
3. Chi-Squared distance
– Based on matching fractiles of distribution and data

18
Example 1
• Assessed yearly profits of an income property
(Obtained from Triangular (-25000,18,300,24000))

Assessed values ( xi ) Prob. ( pi )


P(Yearly Profit ≤ -$25,000) 0.00
P(Yearly Profit ≤ -$10,000) 0.10
P(Yearly Profit ≤ $0) 0.30
P(Yearly Profit ≤ $15,000 ) 0.75
P(Yearly Profit ≤ $24,000) 1.00
19
Example 1

To measure how closely a


fitted theoretical distribution
is to the assessed,
1) Calculate the Vertical
Distance between
Assessed Values and Fitted
Distribution.
2) Compute the Root Mean
Square Error (RMSE)

𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒12 + 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒22 + 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒32


𝑅𝑀𝑆𝐸 =
3
20
Mechanics of simulations
• A simulation model is a mathematical model
in which a probability distribution is used to
represent the possible values of an uncertain
variable.
– Similar to decision trees
– Allows for continuous as well as discrete
distributions

21
Simulation
• An imitation that reflects the operation of a real-
world process/system over time.
• Many real-world systems are very complex that
cannot be solved mathematically.
– Hence, numerical, computer-based simulation can be
used to imitate the system behavior.
• Simulations are used as:
– Analytical tool: predicts the effect of changes to
existing systems.
– Design tool: predicts the performance of new systems.
• Simulations models are “run” rather than solved.
22
Introduction to Monte Carlo
simulations
• Generate a Uniform Random Variable
U~Uniform(0,1).
• Using excel enter, “=rand()” then press F9 to
generate a new Random Variable (RV).

• Consider an unfair coin with probability of heads


being 0.3. How can I simulate the coin toss by
generating a Uniform(0,1)?
• Generate U.
• If U < 0.3 then we obtain a head.
coin_toss.xlsx
23
Roulette wheel

American
Roulette Wheel:
- 18 black
- 18 red
- 2 green
- Total: 38 slots

24
Roulette – Monte Carlo simulation
• Example: Generate a number between [0,37].
• We have the capability to generate
U~Uniform(0,1).
– Generate U(0,1)
– If (i/38 ≤ U < (i+1)/38) then the generated
number is i for i=0,..37.
– In other words: outcome = floor(U * 38)

25
Roulette
• Let’s transform coin_toss.xlsx into roulette.xlsx

26
Statistical recalls
• Population mean, µ , not a random variable
• Sample mean, 𝑋, ത random variable
• 𝑋ത is the best estimate of µ
• Using our sample data,
– Calculate a Confidence Interval on µ
– Construct a hypothesis test.

27
Constructing a confidence interval
• By the law of large numbers, if we take n samples from a
population, the mean of the n-samples tends towards the
actual mean of the population when n tends towards
infinity.
𝑛
1
lim ෍ 𝑋𝑖 = 𝜇
𝑛→∞ 𝑛
𝑖=1

• Based on the central limit theorem, 𝑋ത is a random variable


𝑠
that tends to become normally distributed 𝑁~(𝜇, ) when
𝑛
n is large, where 𝑠 is the estimated population standard
deviation based on the sample data.

28
Constructing a confidence interval
• A 1-α confidence interval implies that there is
1-α probability that the actual population
mean falls within the boundaries of the
confidence interval.
• Example:
– A 95% confidence interval for a simulation
outcome is [15, 25]. This means that we have 95%
confidence that the population mean is
somewhere between 15 and 25.
29
Constructing a confidence interval
• Given that , 𝑋ത tends to become normally
𝑠
distributed 𝑁~(𝜇, ), then the boundaries of
𝑛
a 95% confidence interval are defined as:
𝑠
– Low bound = 𝑋ത − 1.96 ⋅
𝑛
𝑠
– High bound = 𝑋ത + 1.96 ⋅
𝑛

30
Example – 32 simulation runs for 100 coin tosses
each with P(head) = 0.75
Simulation Run X=number of Heads
1 72
2 74 95% Confidence Interval
3 81 s
. . = X  1.96 
n
. .
. . = 74.5  4.786 / 32
29 78 = 74.5  0.846
30 78  95% Confidence Interval [73.654,75.346]
31 70
32 69

X = 74.5 Notice that since X is Binomial with parameters


n=100 and p=0.75. The population mean of X is
s = 4.786 np=75 and is within the 95% CI.
n = 32
31
Note
• Note that when n increases the width of the
confidence interval is reduced.
– We become more confident.

32
Example: warehouse storage
• Our warehouse can store 80 items.
• The warehouse should be filled when it becomes half
empty.
• Daily demand probability distribution is:
• P(Daily demand = 0 items) = 0.10
• P(Daily demand = 1 items) = 0.15
• P(Daily demand = 2 items) = 0.20
• P(Daily demand = 3 items) = 0.30
• P(Daily demand = 4 items) = 0.20
• P(Daily demand = 5 items) = 0.05

What is the expected number of days until the


warehouse becomes half empty?

33
Random number mapping
A number between 0 and 1 The daily demand is determined
is selected randomly. by the mapping demonstrated below.

0.30
0.20 0.20
0.15
0.10
0.05
Demand
(0 to 0.1) (0.1 to 0.25) (0.25 to 0.45) (0.45 to 0.75) (0.75 to 0.95) (0.95 to 1)

0 1 2 3 4 5

If U=0.345 then Demand is 2


34
Simulation Run # 1
Let X be the number of days until the warehouse is half empty

Day Random Number Demand Total Demand to Date


1 0.651 3 3
2 0.105 1 4
3 0.677 3 7
4 0.975 5 12
5 0.818 4 16
6 0.133 1 17
7 0.002 0 17
8 0.818 4 21
9 0.774 4 25
10 0.538 3 28
11 0.953 5 33
12 0.616 3 36
X1 =14 13
14
0.233
0.563
1
3
37
40
35
Simulation Run # 2
Let X be the number of days until the warehouse is half empty

Day Random Number Demand Total Demand to Date


1 0.166 1 1
2 0.963 5 6
3 0.632 3 9
4 0.828 4 13
5 0.191 1 14
6 0.919 4 18
7 0.195 1 19
8 0.64 3 22
9 0.951 5 27
10 0.785 4 31
11 0.247 1 32
12 0.396 2 34

X2 =15
13 0.191 1 35
14 0.799 4 39
15 0.836 4 43 36
After 30 runs, we obtain the
following:
X= 95% Confidence Interval
Simulation Run Number of Days
s
1 14 = X  1.96 
2 15 n
3 18
. . = 16.7  1.705 / 30
. .
. . = 16.7  0.311
28
29
15
16
 95% Confidence Interval
30 17 [16.389,17.011]

X = 16.7
s = 1.705
n = 30 37
Back to central limit theorem
• Central Limit theorem: If all samples of a particular
size are selected from any population, the sampling
distribution of the sample mean is approximately a
normal distribution.

This approximation improves with larger samples.

› We can reason about the distribution of the sample mean


with no information about the shape of the population
distribution from which the sample is taken.
› The central limit theorem is true for all distributions.
› A sample of 30 or more is large enough to apply the CLT.

38
39
Mechanics of simulations
1. Construct a deterministic model
– No probability distributions are in this model.
2. Apply (“embed”) distributions to the
constant values in the deterministic model
where you expect variation or uncertainty
– You now have a probabilistic or stochastic model.
– These distributions may be assumed or based on
beliefs.

40
Mechanics of simulations
3. Randomly draw (sample) values from the
distributions to apply to the model for
recalculation
– With each new draw, you are running different
combinations of your model thru 1,000s of iterations.
– Each iteration is a single sample from the distribution.
4. Plot the outcomes of the iterations of the model
– This gives you the distribution of the uncertainty of
interest – risk profile
– You can now factor probabilities into your decisions.

41
Mechanics of simulations
Iteration – a recalculation of the model
• For every iteration, a new value is chosen for
each uncertainty according to the corresponding
probability distribution, and this value is used in
the calculations for that particular iteration.
• Increasing the number of iterations results in
sampled values more closely aligned with the
distribution
• At a minimum run 1,000 iterations; 10,000s is not
unusual
42
Simulation Process: (1) Deterministic model – no probabilities (not shown); (2)
Generic (stochastic) model with probabilities: (3) Iterations; (4) Risk profile

2 4

43
Sampling from probability
distributions
• Problem: how to draw a representative sample of
size n from a given probability distribution for an
uncertain variable X
– Needed to run iterations of the stochastic model
• Solution: a mathematical theorem states that as
long as we choose the probability values
uniformly (every possible value is equally likely)
from the interval (0, 1), then the corresponding x
values in the CDF will have approximately the
desired distribution
– This theorem is foundational for simulation programs.

44
Sampling from probability
distributions

45
@Risk exercise
• Test the CLT on a sample of size 30 from a
population with the following probability
distributions,

› Exponential with Mean 10.


› Uniform with minimum 15 and maximum 25
U(15,25).

46
Risk Profile Example
• A Risk Profile has been constructed for a certain
project,
› 45% chance profit is triangular Tr(25,36,40)
› 35% chance profit is Uniform U(10,35)
› 20% loss of exactly 25

• Calculate the Mean and Standard deviation of the


Profit.
• What is the probability the profit is greater than 20?
• What is the probability the profit is less than 14?
Check risk_profile.xlsx
47
NPV Example
• Check NPV_uncertainty.xlsx

• What is the probability that the NPV is above


50,000 USD?
• What is the mean NPV?
• What is the 90% confidence interval?

48
Leah Sanchez Example
• Let’s look at Leah Sanchez calendar sales
example (page 482).

49
50
Influence diagram

51
Develop a deterministic model
Next step is the
deterministic
model: a static
This is what Leah wants to know.
model whose
This is uncertain. consequence value
is completely
determined by the
input values.
Here is Leah’s in
Excel.
52
Demand is actually uncertain.
• Therefore, Leah wants to include the demand
uncertainty into the analysis.
• After assessing Leah’s cumulative probabilities for
various demand levels, we fit a probability
distribution.
• The distribution happens to be a general beta
distribution with:
– Min=600
– Max=1400
– α=2
– β=18
53
The demand distribution

54
Simulations
• Now Leah can simulate, for a given order
quantity, the associated distribution of the
profit given the uncertain demand.

55
Profit prob. distribution when 680 calendars are ordered.

56
Why do we have a big jump at $6,120?
• Because there is a 42% probability for demand
to be greater than or equal to 680 (the order
quantity).

57
Further investigations
• Leah decides to vary the order quantity and check E[profits]

58
Zooming in for orders around 700

59
Leah also checks the 5th and 95th percentiles.

60
Conclusion
• Leah has gained useful information and insight
from simulating the calendar ordering problem,
and now has a much better understanding of the
distribution of Profit for each value of Order
Quantity she is considering.

• She may go with the alternative that maximizes


expected Profit (700) or she may choose another
alternative, particularly if she wants to reduce
risks.
61
Simulation vs. decision tree
What if Leah used decision-tree modeling instead of
simulation modeling?
• First, substitute a discrete distribution for the
continuous beta distribution used in simulation:
– Leah uses the extended Pearson-Tukey (EP-T)
distribution – it requires only three points.
– The next slide shows a decision-tree for only two
different values for Order Quantity.
– However, a full analysis would use several possible
values, just as in the simulation model.

62
EP-T three-point
approximation
Fractile Probability
615
0.05 0.185

679 0.5 0.63


0.95 0.185

780

63
64
Simulation vs. decision tree

65
Simulation vs. decision tree
“When should I use simulation, and when should
I use decision trees?”
– In many cases both approaches work fine.
– However, there are two key issues to consider:
• If your decision situation involves a large number of
uncertainties, the necessarily large decision tree can be
very clumsy to work with. Use a simulation approach.
• If your decision situation involves future or
“downstream” decisions, then a decision tree might be
easier to work with.
66

You might also like