Declustering and Debiasing: January 2007
Declustering and Debiasing: January 2007
Declustering and Debiasing: January 2007
net/publication/228427948
CITATIONS READS
21 2,706
2 authors:
All content following this page was uploaded by Michael J. Pyrcz on 09 March 2015.
Abstract
Strategic project decisions are based on the distributions global variables, for example, total mineable
resource, or recoverable oil volume. These global variables distributions are very sensitive to rock type
proportions and the histograms of continuous variables. Representivity of the input one point statistics is
important in all spatial models.
The process of assembling representative one point statistics is complicated by sample clustering and
spatial bias in the data locations. Explanation is provided on the source of nonrepresentative sampling
and the need for declustering and debiasing. This work addresses some key implementation details on
declustering. Standard declustering is not always able to correct for sampling spatial bias. Two methods
for correcting bias in the one point statistics: “trend modeling for debiasing” and “debiasing by
qualitative data” are reviewed and demonstrated with a poly metallic data set.
Introduction
Great computational effort is exerted to build realistic geostatistical simulation models. The goodness of
these models is judged by their ability to reproduce input one-point statistics and continuity structures.
Geostatistical techniques slavishly reproduce input lithofacies proportions and the histogram of
petrophysical properties (Gringarten et al., pg. 1, 2000). There is very little intrinsic declustering or
debiasing within geostatistical simulation algorithms. Geostatistical simulation always weights the input
distribution. Gaussian simulation in particular ensures that the input distribution is approximately
reproduced. Clustered sampling that misrepresents the proportions within specific bins, or spatial bias
sampling that does not characterize the full range of the distribution, in the input distributions must be dealt
with explicitly.
The importance of representative input distributions must be evaluated with respect to the sensitivity of
the response variable to clustering or bias in the input statistics. Simulated models are only an intermediate
result. Management decisions focus on the results after the application of a transfer function.
Declustering methods are commonly applied in an automated fashion. This work addresses the
properties of a variety of declustering algorithms and methods of improving application. Issues such as
working with anisotropy, within facies and with multiple variables are addressed. It is essential to
understand the applicability and limitations of declustering algorithms since blindly applying declustering
may be worse than working with the naïve statistics.
Declustering is ineffective in cases with spatial bias. Debiasing tools such as “trend modeling for
debiasing” and “debiasing by qualitative data” should be brought into common practice for the purpose of
improving the inference of the one point statistic.
Nonrepresentative Sampling
It is natural that spatial data are collected in a nonrepresentative manner. Preferential sampling in
interesting areas is intentional and facilitated by geologic intuition, analogue data and previous samples.
This practice of collecting clustered or spatially biased spatial samples is encouraged by technical and
economical constraints, such as future production goals, accessibility and the costs of laboratory work.
The cost of uncertainty is not the same everywhere in the area of interest. For example, the cost of
uncertainty within a high grade region is much higher than the cost of uncertainty within clearly waste
material. Good delineation and a high level of certainty within the high grade materials allows for accurate
reserves estimation and optimum mine planning.
1
Future production goals may also encourage clustered or spatially biased sampling. It is common to
start mining in high grade regions. In this case it is desirable to delineate and characterize the high grade
regions.
Practical issues of accessibility can also cause spatially biased sampling. For example, the drilling
depth or available drilling stations may constrain sample selection. In the presence of a vertical trend,
limited depth of drilling may result in a subset of the underlying distribution not being sampled. There are
many possible scenarios under which accessibility would be a concern (see Figure 1).
Nonrepresentivity may also be introduced at the assaying stage. For example, when removing sections
of core for the purpose of permeability measurement, it is unlikely that a section of shale would be
subjected to expensive testing. Likewise, barren rock may not be sent for assays.
Conventional Statistics
Conventional statistics do not provide reasonable solutions to the problem of constructing representative
spatial distributions. A simple random sample from the population of interest would be representative, but
inappropriate in most cases. A sample is said to be representative when each unit of the population has the
same probability of being sampled. In conventional statistics this is accomplished by avoiding preferential
sampling or opportunity sampling. As explained above, there are many reasons that geologic samples are
collected in a biased manner.
Regular or random stratified sampling may be able to provide a good approximation of a representative
distribution. Sampling on a regular grid is rarely practical for the same accessibility and economic reasons
stated above. Regular sampling grids may be applied in preliminary resource investigation. These
sampling campaigns are often augmented by nonsystematic infill drilling. One approach would be to omit
the clustered infill samples for the purpose of building distributions. While this would more closely agree
with conventional statistical theory, throwing away expensive information is not very satisfying (Isaaks and
Srivastava, pg. 237-238, 1997).
Declustering
Declustering is well documented and widely applied (Deutsch, pg. 53-62, 2001; Isaaks and Srivastava, pg.
237 – 248, 1997; Goovaerts, pg. 77-82, 1997). There are various types of declustering methods, such as
cell, polygonal and kriging weight declustering. These methods rely on the weighting of the sample data,
in order to account for spatial representivity. Figure 2 shows the effect of weighting the histogram. Note
that weighting does not change the values: only the influence of each sample is changed.
There are two important assumptions in all declustering techniques: (1) the entire range of the true
distribution has been sampled or the data is not spatially biased, and (2) the nature of the clustering is
understood. Declustering may not perform well without these assumptions. The first assumption is
required since the weighting only adjusts the influence of each sample on the distribution and does not
change the actual sample value. Figure 3 shows an example where declustering could not work; there are
no low samples to give more weight to.
The second assumption is that the nature of the clustering is understood. If the data have no spatial
correlation, there would be no reason to apply declustering. Each sample, regardless of location, would be
a random drawing from the underlying distribution. Without an understanding of the spatial nature of the
data, declustering may be incorrectly applied.
There are a variety of methods to calculate the declustering weights. Polygonal, cell and kriging weight
methods will be discussed.
Polygonal Declustering
Polygonal declustering is commonly applied in other scientific disciplines, such as hydrology, for the
purpose of correcting for clustering in spatial data. The method is flexible and straightforward. The
polygonal declustering technique is based on the construction of polygons of influence about each of the
2
sample data. These polygons of influence are described by all midpoints between each neighbouring
sample data. Simple example data set with polygons of influence is shown in Figure 4.
For each polygon of influence the area is calculated and the weight assigned to each sample is
calculated as the proportion of polygon area to the entire area of interest (same as the sum of all polygon
areas).
area j
w' j = n
⋅n
∑ area
j =1
j
The area associated to peripheral samples is very sensitive to the boundary location. If the boundary is
located far from the data, then the peripheral samples will receive a large amount of weight, since the area
of their polygons of influence increase.
In general this great sensitivity to the boundary is perceived as weakness in polygonal declustering. A
common technique is to simply apply the boundary of the area of interest. This may be defined by geology,
leases etc. This approach may be reasonable depending on the problem setting. The second technique is to
assign a maximum distance of influence to the samples.
The application of polygonal declustering to a 3D data set requires the calculation of complicated solid
boundaries and volumes, which is computationally expensive. A close approximation can be rapidly be
calculated by discretizing the area of interest into a fine grid and assigning each node to the nearest data.
Directional weights may be applied to the polygonal declustering algorithm in order to account for
anisotropy. This is demonstrated in Figure 5 for anisotropy ratios of 1, 2, 5, and 10:1.
Cell Declustering
The cell declustering technique is the most common method applied in Geostatistics. It is insensitive to the
boundary locations and for this reason is seen as more robust than polygonal declustering. A cell
declustering algorithm, DECLUS is standard in GSLIB.
3
For a given cell the weight of each sample is calculated as follows,
1
ni
w' j = ⋅n
number of cells with data
where ni is the number of samples in the cell in which sample j is located and n is the total number cells
with samples.
The weights assigned by cell declustering are sensitive to the cell size. If the cell size is set as very small
then every sample occupies its own cell and the result is equal weighting or the naïve sample distribution.
If the cell size is set as very large then all samples reside in the same cell and the result is once again equal
weighting.
A specific cell size will result in an unique set of weights. The question is, ‘which cell size identifies
the best weights?’. If there is a coarse grid with additional infill sampling then the coarsest sample spacing
is the best cell size (see Figure 7)
If this configuration is not present then a common procedure is to assign the cell size, which maximizes
or minimizes the declustered mean. This is demonstrated in Figure 8. This procedure is applied when the
samples are clearly clustered in low or high values (apply the cell size which renders the maximum or
minimum declustered mean respectively). The results are only accurate when there is a clear minimum or
maximum. One should not blindly assign the minimizing or maximizing cell size. It is shown in the next
section that such an assignment may results in poorer results than the naïve distribution in expected value.
Cell Declustering with the Minimizing and Maximizing Cell Size Warning
A large number of runs were carried out to determine whether in expected terms the minimum or maximum
result in an acceptable declustered mean (a declustered mean which closely approximates the true
underlying mean). 101 realizations were generated by sequential Gaussian simulation of a 50 x 50 space.
The continuity range was selected as small (range of 10) with respect to the simulation size so that the
realization mean and variance were consistently near 0.0 and 1.0 respectively (minimal ergodic
fluctuations). Regular samples where taken at 10 unit spacing and then infill samples were taken around a
specific percentile data value. By changing the specific percentile the level of clustering was changed.
The expected true mean, sample mean, 9 cell size, 10 cell size and minimum or maximum cell size were
compared for each percentile (see Figure 9). For cases with a percentile near the median (low magnitude of
clustering), application of the minimizing or maximizing cell size resulted in poorer results than the naïve
sample mean. The application of the coarsest regular sample spacing resulted in a best declustered mean in
expected terms. This confirms that there is a problem, in expected terms, with systematically applying the
cell size with min/max declustered mean, and that knowledge of the appropriate cell size provides better
results in expected terms.
Cell Declustering weights are also sensitive to the grid origin. To remove this sensitivity the procedure is
repeated with a specified number of origin offsets and the results are averaged.
1. smoothes out steps that would occur in the declustered mean and cell size relationship.
2. makes large cells sizes (> ½ data set size) unreliable
These effects are demonstrated by an exercise. Cell declustering was applied to the simple data set with
a variety of origin offsets. Figure 10 shows the smoothing effect of the application of origin offsets. The
greater the number of offsets the smoother the relationship between declustered mean and cell size. Also, it
4
can be seen that offsets cause the results to be unreliable when large cell sizes are applied. The data set
dimensions are 10x5 units. It would be expected that at a cell size of 10 that the declustered mean would
be equal to the naïve mean. With offsets this does not occur. The cause is demonstrated in Figure 11.
• Calculate the declustered mean for cells sizes 5% to 50% of the size of the area of interest and
apply the minimum number of offsets required to get a reasonably continuous relationship (around
5 is usually sufficient).
The common practice is to apply cell declustering to the primary variable and apply these weights to all
other collocated variables. This is intuitive since clustering should be related only to the data locations, not
the data values. If the cell size is chosen based on the declustered mean and cell size relationship, this
practice may be questioned since the declustered mean is dependent on the data values. Will the
maximizing or minimizing cell size be the same for each variable?
To explore this issue an exhaustive data set was generated with three collocated and correlated standard
normal variables. Table 1 below lists the properties of each variable.
Samples (50) were drawn from the exhaustive data sets, and the sample variograms and correlations were
checked (see Figure 12). The sampling scheme was based on coarse grid (20 unit spacing) with some infill
clusters and random samples. The relationships of declustered mean vs. cell size are shown in Figure 13.
For all three variables there is a clear maximum or minimum at the same cell size. This has occurred
despite different variograms, and correlations. This exercise has supported the current practice of applying
the same cell declustering weights to all collocated variables.
In general, this approach is much more computationally intensive than the polygonal and cell declustering
techniques. Also, there may be artifacts in the weights due to the string effect. This string effect is
illustrated in Figure 14.
The conditioning data at the extents of the string receive greater weight. For the peripheral data the
weighting is greater, even at locations much closer to other data. This is caused by the implicit assumption
of kriging that area of interest is imbedded in an infinite domain.
5
Kriging Weight Declustering and Negative Weights
It is appropriate to include negative weights when calculating the sum of weights. It is possible that this
would result in a negative declustering weight. One way in which this would be realized would be if the
conditioning data is outside the area of interest and it is screened. In general, the conditioning data are
within the area of influence and a negative declustering weight would not occur.
To illustrate the application of polygonal declustering with a rock type model, a synthetic example was
constructed. A random 2D data set was constructed with a uniform distribution in x and y and a normal
Gaussian property. The rock type model was constructed by smoothing an unconditional sequential
6
indicator simulation with 4 categories (see Figure 21). Then conventional polygonal declustering was
applied to the data set irrespective of the rock type (see of Figure 22} left for a map of declustered weights
and the polygons of influence). Also, polygonal declustering was performed constrained by the rock types
(see right of Figure 22) for the declustered weights and the polygons of influence. Considering the rock
types significantly improves the declustering weights.
When the entire range of the distribution is not sampled then it is necessary to apply debiasing
techniques.
Debiasing
There are two methodologies that may be applied to correct biased samples. The first method, “trend
modeling for debiasing”, separates the variable into a trend and residual. The second approach, “debiasing
by qualitative data”, corrects the distribution with a representative secondary data distribution and
calibration relationship to the primary variable of interest.
In the presence of a clear and persistent trend, trend modeling may be applied to ensure that the correct
distribution is reproduced. Trend modeling is well established (Goovaerts, pg. 126, 1997; Deutsch, pg.
182, 2001). The steps are as follows: (1) remove an appropriate trend model, (2) stochastically model
residuals, and (3) replace trend a posteriori. The resulting models reproduce the trend. An advantage of this
technique is that the simulation step may be simplified since cosimulation is not necessarily required.
While this technique will often debias the distribution, there is no direct control over the resulting
distribution. The result should be checked. Care should also be taken to build an appropriate trend model.
This requires that the mean of the residuals is close to 0.0 and the correlation between the trend and
residual is close to 0.0.
Another technique is to use soft data that is representative of the entire area of interest, and an
understanding of the relationship between the primary and soft secondary data to correct the primary
distribution (Deutsch et al., 1999). Then, this corrected distribution is applied as a reference distribution to
the subsequent simulation of the primary variable. The underlying relationship between the primary and
secondary data may be assessed from geologic modeling or the physics of the setting. This relationship
may not be directly observed due to a lack of data; nevertheless, a relationship between the secondary and
primary data, fˆx, y ( x, y) must be inferred for debiasing (see Figure 23).
The construction of the bivariate calibration is the difficult component of debiasing. There are a variety
of techniques for building this distribution. For example, the program SDDECLUS by Deutsch relies on
the user submitting data pairs which describe the bivariate relationship. This approach allows for the
greatest flexibility, since there is no constraint on the form of the bivariate calibration. For each paired
primary data a weight is assigned based on the secondary distribution
Another method is to calculate a series of conditional distributions of the primary given the secondary
data, fprimary|secondary, over the range of observed secondary value. This can be extrapolated over the range
of all secondary data by a trend. This is illustrated in Figure 24. The primary distribution is then calculated
by scaling the binned bivariate calibration by the secondary distribution. For the above bivariate
calibration this is illustrated in Figure 24. This is a discrete approximation of the solution of the secondary
distribution as expressed in Equation 1.
f y ( y ) = ∫ f y| x ( y | x ) ⋅ f x ( x )dx (1)
x
The trend method indirectly corrects the global distribution. This leads to models with precise trend
reproduction and indirect control over the distribution. The qualitative method focuses on directly
correcting the global distribution and retaining consistency by applying the secondary data as collocated
data in the simulation. The result is direct control over the reproduced distribution and indirect control over
trend reproduction.
The two techniques, also differ in the information which is integrated into the numerical model. In the
first method the simulation is augmented by information concerning the spatial behavior of the primary
variable. The second method relies on information concerning a more representative secondary data and
the relationship between the primary and secondary data. The information available may limit the ability to
7
apply either method. The method chosen also affects the resulting model of uncertainty. Each will
potentially decrease the overall model uncertainty. This is expected since each option involves adding
conditioning to the numerical model.
Debiasing Example
A realistic data set based on a 2D poly metallic “red” vein is presented. This sample set was gathered by
drilling. Some data were removed for checking and illustrating the method. For the sake of comparison, an
approximation of the true gold distribution was constructed by applying polygonal declustering to the
complete “red” data set (see Figure 26). In the complete data set the entire area of interest is well
delineated (see left of Figure 27 for location map of the complete data set) and polygonal declustering
results in a reasonable distribution. Since the true underlying distribution is not available, this distribution
will be assumed to be a good approximation of the underlying distribution.
There is a significant positive correlation between the gold grade and the thickness of the vein
(correlation coefficient = 0.6), so it was decided to apply gold as the primary variable and a smooth kriged
thickness map as the representative secondary data, see Figure 27.
Polygonal declustering was applied to the reduced data set. The resulting declustered distribution and
the voronoi polygons are shown in Figure 28. There is a great difference between the underlying mean
gold grade (0.69 g/t) and the declustered mean gold grade (1.25 g/t) and distributions do not have the same
shape. There is additional information that could aid in the inference of the correct distribution, such as
thickness that has a significant correlation to the primary variable, gold. There is also a clear trend in the
gold grades. This analogue information improves the distribution.
Debiasing by qualitative data with a bivariate trend was applied to correct the gold distribution. The
results are shown in Figure 29. The bivariate trend was set as a second order function with a linear segment
for gold grades greater than 5.0 g/t. This density calibration table was weighted by the thickness
(secondary) distribution, and the resulting corrected gold (primary) distribution is shown on the left of
Figure 29. Any estimated negative grades were set to zero. Sequential Gaussian simulation (SGSIM) was
performed with the debiased distribution as a reference distribution and the thickness map as a collocated
secondary data. A correlation coefficient of 0.72, was calculated from the density calibration table, was
applied to the secondary data. An omni-directional variogram with a nugget effect of 0.4 and an isotropic
spherical structure with a range of 140 units was used to represent the gold spatial continuity. No effort
was made to calculate and model a more specific variogram model since variogram inference is not a focus
of this work. Three realizations are shown in Figure 30.
The strong correlation between the primary data and the collocated secondary data has resulted in a
clear trend in the realizations. Some example simulated distributions are shown in Figure 31 and the
distribution of the realization means for 100 realizations are shown in Figure 32. The average of the
realization means is 0.84, which is higher than the average of the reference distribution (see Figure 26).
Nevertheless, the resulting distribution is closer to the reference true distribution in shape and statistics than
the declustering results.
Trend modeling for debiasing was also applied. A trend model was constructed from a moving window
average of all the gold samples in the complete data set. This model was scaled such that the mean of the
residuals was near 0. The gold samples, gold trend model and distribution of the residuals are shown in
Figure 33. Sequential Gaussian simulation was performed with the residuals and the trend model was
added a posteriori. Any negative estimates were set to 0. Three example realizations are shown in Figure
34. The trend is consistently reproduced in each realization. Some realization distributions are shown in
Figure 35 and the distribution of the realization means for 100 realizations are shown in Figure 36. The
mean of the realization means is 0.90. The resulting distributions are closer to the approximate true
distribution in shape and mean than the declustering results.
8
Conclusions
Nonrepresentative sampling is unavoidable in most geologic settings. Declustering techniques are widely
used and are generally effective for correcting for nonrepresentative data. It is important to understand the
appropriate methods and settings for the application of declustering. In settings where the underlying
distribution has not been adequately sampled, declustering may not be adequate and debiasing is required.
Debiasing relies on analogue information such as a trend in the primary variable or a well sampled
secondary variable and a calibration. Two debiasing methods, trend modeling for debiasing and debiasing
by qualitative data, have been demonstrated with a mining data set.
Acknowledgements
We are grateful to the industry sponsors of the Centre for Computational Geostatistics at the University of
Alberta and to NSERC and ICORE for supporting this research. Also, we would like to acknowledge
Julian Ortiz who contributed to work on declustering.
References
Deutsch, C.V. and A.G. Journel. 1998. GSLIB: Geostatistical Software Library: and User’s Guide,
2nd Ed. New York: Oxford University Press.
Deutsch, C.V., P. Frykman, and Y.L. Xie, 1999. Declustering with Seismic or “soft” Geologic
Data, Centre for Computational Geostatistics Report One 1998/1999, University of Alberta.
Deutsch, C. V., November 2001. Geostatistical Reservoir Modeling, in final stages of production at
Oxford University Press,
Goovaerts, P., 1997. Geostatistics for Natural Resources Evaluation. New York: Oxford University
Press.
Gringarten, E., P. Frykman, and C.V. Deutsch. December 3-6, 2000. Determination of Reliable
Histogram and Variogram Parameters for Geostatistical Modeling, AAPG Hedberg Symposium,
"Applied Reservoir Characterization Using Geostatistics", The Woodlands, Texas.
Isaaks, E. H. and R. M. Srivastava. 1989. An Introduction to Applied Geostatistics, New York:
Oxford University Press.
9
Figure 1 - Some examples of accessibility constraints illustrated on a cross section.
Figure 2 - The influence of weighting on a distribution. On the right the naïve distribution (dotted
line) is superimposed on the declustered distribution with the weights indicated.
Figure 3 An example underlying distribution (bold line) and the sample distribution (histogram).
The entire range of the true distribution has not been sampled.
10
Figure 4 - The polygon of influence.
Figure 5 – The effect of anisotropy on the polygons of influence. The horizontal distance was
weighted by factors 1, 2, 5, and 10.
11
Figure 6 – A simple illustration of the cell declustering technique.
Figure 8 – The Relationship between Declustered Mean and Cell Size For the Simple Data Set
12
Figure 9 – A chart indicating the expected true, sample and declustered means (for cell size 9, 10 and
minimizing or maximizing), vs. the percentile of clustering. The consistent application of the coarsest
sample spacing as cell size results in a declustered mean closer to the true mean in expected terms.
6.8
Origin
Declustered Mean
6.6 Offsets
6.4 1
6.2 5
6 10
5.8
5.6
0 2 4 6 8 10
Cell Size
Figure 10 – The declustered mean vs. cell size relationship with a variety of origin offsets. Note that
the offsets smooth out steps in the declustered mean and cell size relationship. Also, with origin
offsets the declustered mean does not approach the naïve sample mean as the cell size becomes large.
13
Figure 11 – The effect of offsets with large cell sizes. If the origin is shifted the data may not all
reside within a single cell, instead the data is divided into 4 cells. Thus, the declustered mean is not
the naïve sample mean.
Figure 12 – For each data set, the exhaustive data set with the samples indicated, the variogram
model and experimental variogram of the samples and the scatter plot of the samples with the
variable 1 samples are shown.
14
Figure 13 – The declustered mean vs. cell size for the three variables.
15
Figure 14 – A string of data, with weights from the kriging weight declustering method indicated,
superimposed on the maps of the weights assigned to each data at all locations. The string effect
causes the outer data to receive greater weight (see maps for Data 1 and 6).
16
Figure 16 – An example cross section with wells and sample locations.
17
Figure 18 – An illustration of the difference in weight assignments due to a boundary for cell,
polygonal and kriging weight declustering. Cell declustering would equally weight the data, while
polygonal would larger weight to the data near the unsampled area. Kriging weight declustering
would result in weights similar to polygonal, but subject to screening and string effects.
18
Figure 20 – An example model broken up into separate rock types.
Figure 22 – Polygonal declustering weights and polygons, with and without facies.
19
Figure 23 - The calibration bivariate distribution, fˆx, y ( x, y) , and known marginal distribution of the
soft data variable, f x (x) .
Figure 24 - Calibration by bivariate trend. The points indicate the known primary and secondary
data. The arrow indicates a linear bivariate trend. The lines represent probability contours.
20
Figure 25 An illustration of the numerical integration of the conditional distribution along the
previously indicated linear bivariate trend.
21
Figure 27 – The original red.dat database (on the left) and the modified data base with kriged
thickness map.
Figure 28 – The resulting distribution from polygonal declustering of the modified red.dat data set
and a location map of the data set, with the associated voursior polygons.
22
Figure 29 – The density calibration table with collocated thickness data, the thickness distribution,
original gold distribution and the corrected gold distribution
Figure 30 – Three realizations of gold grade using the debiased distribution and collocated thickness
(secondary) data.
23
Figure 31 – The histogram of one realization and the cumulative distribution of 20 realizations.
Figure 33 – The reduced “red” data set with a gold trend, and the distribution of the residuals at the
data locations.
24
Figure 34 – Three realizations of gold grade resulting from addition of a stochastic residual and a
deterministic trend model.
Figure 35 – The histogram of one realization and the cumulative distribution of 20 realizations.
25