06a Fenton Estimation
06a Fenton Estimation
06a Fenton Estimation
24
Introduction
Interpolation:
• data is used to characterize site at which data was obtained
• estimator errors decrease with increasing correlation
between observations, i.e. the more highly correlated the
site is, the fewer samples required to characterize it.
(Unfortunately, we generally don’t know a-priori how
highly correlated a site is!)
• the data can be assumed known – uncertainty now occurs
between data points, so we need only model this residual
variability.
• Kriging and/or conditional models can be used to
characterize the residual variability.
27
Interpolation vs. Extrapolation
Extrapolation:
• the data are being used to characterize the soil population
(i.e. to infer the population parameters for use at other
sites).
• estimator error increases with increasing correlation
between observations, i.e. the more highly correlated a site
is, the less representative it is of other sites – you cannot
expect to accurately characterize a neighboring site if all of
your samples are taken from a (highly correlated) stiff soft
clay layer at the current site.
• statistical estimates of population parameters are typically
quite inaccurate (due to correlation), especially estimates of
correlation length.
28
Interpolation vs. Extrapolation
• Practicing geotechnical engineers are typically interpolating.
That is, they sample with the goal to characterize the site at
which the samples are observed.
• Published research papers and textbooks are extrapolating (or
at least they should be). That is, they are expressing soil
property information that is meant to be useful at sites other
than the single site at which the data were obtained.
• Unfortunately, all too often, research papers will provide soil
property statistics where the locally observed trend has been
removed. This leads to significantly underestimated
variabilities (only useful at sites where similar trends occur
and have been similarly removed).
• In extrapolation trends should be generally considered to be
part of the uncertainty being characterized.
29
Choosing a Distribution
Once the data have been gathered, we need to decide how to best
represent the “population”. The first step is to decide on a
population distribution. There are several possibilities;
1. Trace driven simulation: use the data directly in a
simulation. This is the least preferable approach since it
can only reproduce the data and not all possibilities. This
approach is most commonly used in earthquake ground
motion simulation.
2. Empirical distribution: the data are used to define an
empirical cumulative distribution function (e.g. P[ X < x ]
is just equal to the number of observed values less than
x). This does not allow for the extremes that often control
design. That is, most samples will not include those
1/1000 extremes that would lead to failure.
30
Choosing a Distribution
31
Choosing a Distribution
Extrapolation:
• fit the simplest distribution that you can – you are trying to
model the population, not the specific data set.
Interpolation:
• fit a distribution of reasonable complexity – just remember
that you still need to capture the range of possibilities that
might occur between your observation points (so there is
probably little point in employing a 20 parameter
distribution).
32
Choosing a Distribution
33
Choosing a Distribution
35
Frequency Comparison
Example:
Suppose that, just after construction, a series of 50 randomly selected one-
kilometre long sections of highway through a hilly region were selected to
evaluate the annual probability of slope failure under the existing design code.
The number of years until an observable slope failure occurred within each
one-kilometre length, ti , was recorded, with the following results
3, 2, 8, 9, 10, 4, 4, 2, 7, 7, 1, 14, 2, 1, 8, 3, 4, 5, 4, 2, 10,
2, 1, 7, 8, 4, 3, 3, 21, 1, 3, 9, 1, 4, 5, 1, 4, 1, 4, 3, 5, 3, 1,
9, 1, 6, 3, 5, 12, 11
A previous analysis of similar data suggested that the annual probability of
observable slope failure in each one-kilometre section of highway is 0.2.
Assuming that sections fail independently and that each year constitutes an
independent trial, how reasonable does the hypothesis that the annual
probability of slope failure per km is 0.2 appear to be on the basis of this data?
36
Frequency Comparison
If sections fail independently, and each year is also independent, then we have
50 independent observations of the ‘number of trials’ (i.e. years) to first failure
of a 1-km section. Under the given assumptions, the ‘number of trials to first
failure’ follows a geometric distribution.
The estimate of the annual probability of slope failure per km is just one (year)
over the average time to slope failure;
1
=pˆ = 0.199
( 3 + 2 + + 11) / 50
which is very close to the hypothesized annual probability.
The following page compares the frequency histogram with that predicted by
theory along with the empirical and fitted cumulative distribution functions.
37
Frequency Histogram
38
Parameter Estimation
39
Classical Estimators
1 n
Sample Mean: µˆ X= x= ∑
n i =1
xi estimates the true mean μX
1 n
Sample Variance: σˆ = s= 2
∑ i
2
n − 1 i =1
X ( x − x ) 2
n− j
1
=
Sample Correlation: ρˆ X ( j ∆x) ∑
σˆ X (n − j − 1) i=1
2
( xi − x )( xi + j − x )
40
Estimation in the Presence of Correlation
Friction angles measured along a 10 km line where soil properties are largely
spatially independent.
• in this case, both the estimated mean and variance obtained over
0 – 0.75 km are quite representative of the entire 10 km.
42
1 n
Classical Estimate of the Mean: µˆ X= x= ∑
n i =1
xi
43
Introduction of Correlation Between Samples
44
1 n
=
Classical Estimate of the Variance: s 2
∑
n − 1 i =1
( xi − x ) 2
45
Introduction of Correlation Between Samples
46
Case 1: Data are Gathered over the Design Site
• we will know the soil properties at the data site locations and
will not be attempting to extrapolate beyond the site borders,
• estimates for μX , σX , and θX are “local” and can be be
considered to be reasonably accurate
• best estimates for the value and variability of the random field
between observation points can be obtained using Best Linear
Unbiased Estimation (BLUE) or Kriging.
• probability estimates should be obtained using a random field
conditioned on the data (possibly via conditional simulation)
47
Case 2: Data are Gathered at a Similar Site
48
Characterization of Trends?
49
Estimating the Mean
1 n
Classical sample mean: µˆ X = ∑ X i
n i =1
1 n 1 n
E [ µˆ X ] E=
= n ∑ X i n=
∑ E[ Xi ] µX (unbiased)
= i 1= i 1
1 n n 1 n n 2
Var [ µˆ X ] =2 ∑∑
Cov
X i , X
j 2 ∑∑ ij X
ρ σ γ ( T ) σ 2
X
n =i 1 =j 1 =i 1 =j 1
n
E σ
=ˆ X2 σ X2 1 − γ (T )
52
Estimating the Covariance Structure
53
Estimating the Covariance Structure
2 n − j +1
Bias: E C (τ j ) σ X
ˆ ρ (τ j ) − γ ( D )
n
n − j + 1 ρ (τ j ) − γ ( D )
E ρˆ (τ j )
n 1− γ ( D)
Note that in a strongly correlated field, ρˆ (τ j ) will become
negative, often at about the field midpoint.
54
Estimating the Covariance Structure
2
Its estimator is
n− j
V (τ j ) = ( X i+ j − X i ) ,
1
∑
2
ˆ j=0,1, , n − 1
2(n − j ) i =1
This estimator does not depend on µˆ X , which is a significant
advantage. In particular, it means it is unbiased,
V (τ j ) ( )
1 2
E=
ˆ E
2 X i + j − X i
57
The Sample Semivariogram
60