Statistics Diffusions
Statistics Diffusions
Statistics Diffusions
Michael Sørensen
Department of Mathematical Sciences
University of Copenhagen
Universitetsparken 5, DK-2100 Copenhagen Ø, Denmark
1
1 Introduction
In this chapter we consider parametric inference based on discrete time observations X0 , Xt1 ,
. . . , Xtn from a d-dimensional stochastic process. In most of the chapter the statistical model
for the data will be a diffusion model given by a stochastic differential equation. We shall,
however, also consider some examples of non-Markovian models, where we typically assume
that the data are partial observations of a multivariate stochastic differential equation. We
assume that the statistical model is indexed by a p-dimensional parameter θ.
The focus will be on estimating functions. An estimating function is a p-dimensional
function of the parameter θ and the data:
Usually we suppress the dependence on the observations in the notation and write Gn (θ).
We obtain an estimator by solving the equation
Gn (θ) = 0. (1.1)
Estimating functions provide a general framework for finding estimators and studying their
properties in many different kinds of statistical models. The estimating function approach
has turned out to be very useful for discretely sampled parametric diffusion-type models,
where the likelihood function is usually not explicitly known. Estimating functions are
typically constructed by combining relationships (dependent on the unknown parameter)
between an observation and one or more of the previous observations that are informative
about the parameters.
As an example, suppose the statistical model for the data X0 , X∆ , X2∆ , . . . , Xn∆ is the
one-dimensional stochastic differential equation
where θ > 0 and W is a Wiener process. The state-space is (−π/2, π/2). This model will
be considered in more detail in Subsection 3.6. For this process Kessler & Sørensen (1999)
proposed the estimating function
n
1
h i
sin(X(i−1)∆ ) sin(Xi∆ ) − e−(θ+ 2 )∆ sin(X(i−1)∆ ) ,
X
Gn (θ) =
i=1
which can be shown to be a martingale, when θ is the true parameter. For such martingale
estimating functions, asymptotic properties of the estimators as the number of observations
tends to infinity can be studied by means of martingale limit theory, see Subsection 3.2. An
explicit estimator θ̂n of the parameter θ is obtained by solving the estimating equation (1.1):
Pn !
−1 i=1 sin(X(i−1)∆ ) sin(Xi∆ ) 1
θ̂n = ∆ log Pn 2
− ,
i=1 sin(X(i−1)∆ ) 2
provided that
n
X
sin(X(i−1)∆ ) sin(Xi∆ ) > 0. (1.2)
i=1
2
If this condition is not satisfied, the estimating equation (1.1) has no solution, but fortunately
it can be shown that the probability that (1.2) holds tends to one as n tends to infinity. As
illustrated by this example, it is quite possible that the estimating equation (1.1) has no
solution. We shall give general conditions that ensure the existence of a unique solution
when enough data are available.
The idea of using estimating equations is an old one and goes back at least to Karl
Pearson’s introduction of the method of moments. The term estimating function may have
been coined by Kimball (1946).
3
The following condition ensures the existence of a consistent Gn –estimator. We denote
transposition of matrices by T , and ∂θT Gn (θ) denotes the p × p-matrix, where the ijth entry
is ∂θj Gn (θ)i .
Here and later Q(g(θ)) denotes the vector (Q(gj (θ)))j=1,...,p , where gj is the jth coordinate
of g, and Q (∂θT g(θ)) is the matrix {Q ∂θj gi (θ) }i,j=1,...,p .
To formulate the uniqueness result in the following theorem, we need the concept of
locally dominated integrability. A function f : D r × Θ 7→ IRq is called locally dominated
integrable with respect to Q if for each θ′ ∈ Θ there exists a neighbourhood Uθ′ of θ′ and a non-
negative Q-integrable function hθ′ : D r 7→ IR such that | f (x1 , . . . , xr ; θ) | ≤ hθ′ (x1 , . . . , xr )
for all (x1 , . . . , xr , θ) ∈ D r × Uθ′ .
Theorem 2.2 Assume Condition 2.1 and (2.3). Then a θ̄-consistent Gn –estimator θ̂n ex-
ists, and
√ L
−1
n(θ̂n − θ0 ) −→ Np 0, W −1V W T (2.6)
under P , where V = V (θ̄). If, moreover, the function g(x1 , . . . , xr ; θ) is locally dominated
integrable with respect to Q and
then the estimator θ̂n is the unique Gn –estimator on any bounded subset of Θ containing θ̄
with probability approaching one as n → ∞.
4
Lemma 2.3 Consider a function f : D r × K 7→ IRq , where K is a compact subset of Θ.
Suppose f is a continuous function of θ for all (x1 , . . . , xr ) ∈ D r , and that there exists a
Q-integrable function h : D r 7→ IR such that kf (x1 , . . . , xr ; θ)k ≤ h(x1 , . . . , xr ) for all θ ∈ K.
Then θ 7→ Q(f (θ)) is continuous, and
n
1X P
sup k f (Xi−r+1 , . . . , Xi ; θ) − Q(f (θ)) k → 0. (2.8)
θ∈K n i=r
Proof: That Q(f (θ)) is continuous follows from the dominated convergence theorem.
To prove (2.8), define for η > 0:
and let k(η) denote the function (x1 , . . . , xr ) 7→ k(η; x1 , . . . , xr ). Since k(η) ≤ 2h, it follows
from the dominated convergence theorem that Q(k(η)) → 0 as η → 0. Moreover, Q(f (θ)) is
uniformly continuous on the compact set K. Hence for any given ǫ > 0 we can find η > 0
such that Q(k(η)) ≤ ǫ and kθ − θ′ k < η implies that kQ(f (θ)) − Q(f (θ′ ))k ≤ ǫ. Define the
balls Bη (θ) = {θ′ : kθ − θ′ k < η}. Since K is compact, there exists a finite covering
m
[
K⊆ Bη (θj ),
j=1
≤ kFn (θ) − Fn (θℓ )k + kFn (θℓ ) − Q(f (θℓ ))k + kQ(f (θℓ )) − Q(f (θ))k
n
1X
≤ k(η; Xν−r+1, . . . , Xν ) + kFn (θℓ ) − Q(f (θℓ ))k + ǫ
n ν=r
n
1X
≤ k(η; Xν−r+1 , . . . , Xν ) − Q(k(η))
n ν=r
+Q(k(η)) + kFn (θℓ ) − Q(f (θℓ ))k + ǫ
≤ Zn + 2ǫ,
where
n
1X
Zn = k(η; Xν−r+1, . . . , Xν ) − Q(k(η))
n ν=r
+ max kFn (θℓ ) − Q(f (θℓ ))k.
1≤ℓ≤m
5
By (2.2), P (Zn > ǫ) → 0 as n → ∞, so
!
P sup kFn (θ) − Q(f (θ))k > 3ǫ → 0
θ∈K
for all ǫ > 0, where B̄ǫ (θ) is the closed ball with radius ǫ centered at θ. By Theorem 8.3
it follows that (8.4) holds with M = K for every ǫ > 0. Let θ̂n′ be a Gn –estimator, and
define a Gn –estimator by θ̂n′′ = θ̂n′ 1{θ̂n′ ∈ K} + θ̂n 1{θ̂n′ ∈
/ K}, where 1 denotes an indicator
function, and θ̂n is the consistent Gn –estimator we know exists. By (8.4) the estimator θ̂n′′
is consistent, so by Theorem 8.2, P (θ̂n 6= θ̂n′′ ) → 0 as n → ∞. Hence θ̂n is eventually the
unique Gn –estimator on K.
2
6
We shall, in this section, be concerned with statistical inference based on estimating
functions of the form n X
Gn (θ) = g(∆i , Xti−1 , Xti ; θ). (3.2)
i=1
for all ∆ > 0, x ∈ D and all θ ∈ Θ. Thus, by the Markov property, the stochastic
process {Gn (θ)}n∈IN is a martingale with respect to {Fn }n∈IN under Pθ . Here and later
Fn = σ(Xti : i ≤ n). An estimating function with this property is called a martingale
estimating function.
where y 7→ p(s, x, y; θ) is the transition density and t0 = 0. Under weak regularity conditions
the maximum likelihood estimator is efficient, i.e. it has the smallest asymptotic variance
among all estimators. The transition density is only rarely explicitly known, but several
numerical approaches and accurate approximations make likelihood inference feasible for
diffusion models. We shall return to the problem of calculating the likelihood function in
Subsection 4.
The vector of partial derivatives of the log-likelihood function with respect to the coor-
dinates of θ,
n
X
Un (θ) = ∂θ log Ln (θ) = ∂θ log p(∆i , Xti−1 , Xti ; θ), (3.5)
i=1
where ∆i = ti − ti−1 , is called the score function (or score vector). Here it is obviously as-
sumed that the transition density is a differentiable function of θ. The maximum likelihood
estimator usually solves the estimating equation Un (θ) = 0. The score function is a mar-
tingale with respect to {Fn }n∈IN under Pθ , which is easily seen provided that the following
interchange of differentiation and integration is allowed:
Eθ ∂θ log p(∆i , Xti−1 , Xti ; θ) Xt1 , . . . , Xti−1
Z
∂θ p(∆i , Xti−1 , y; θ)
= p(∆i , Xti−1 , y, θ)dy
D p(∆i , Xti−1 , y; θ)
Z
= ∂θ p(∆i , Xti−1 , y; θ)dy = 0.
D
Since the score function is a martingale estimating function, the asymptotic results presented
in the next subsection applies to the maximum likelihood estimator. Asymptotic results
7
for the maximum likelihood estimator in the fixed ∆ (low frequency) asymptotic scenario
considered in this section were established by Dacunha-Castelle & Florens-Zmirou (1986).
Asymptotic results when the observations are made at random time points were obtained
by Aı̈t-Sahalia & Mykland (2003).
A simple approximation to the likelihood function is obtained by approximating the tran-
sition density by a Gaussian density with the correct first and second conditional moments.
For a one-dimensional diffusion we get
(y − F (∆, x; θ))2
" #
1
p(∆, x, y; θ) ≈ q(∆, x, y; θ) = q exp −
2πφ(∆, x; θ) 2φ(∆, x; θ)
where Z r
F (∆, x; θ) = Eθ (X∆ |X0 = x) = yp(∆, x, y; θ)dy. (3.6)
ℓ
and
φ(∆, x; θ) = (3.7)
Z r
Varθ (X∆ |X0 = x) = [y − F (∆, x; θ)]2 p(∆, x, y; θ)dy.
ℓ
and by differentiation with respect to the parameter vector, we obtain the quasi-score func-
tion
n
(
X ∂θ F (∆i , Xti−1 ; θ)
∂θ log QLn (θ) = [Xti − F (∆i , Xti−1 ; θ)] (3.8)
i=1 φ(∆i , Xti−1 ; θ)
)
∂θ φ(∆i , Xti−1 ; θ) h 2
i
+ (X ti
− F (∆i , X ti−1
; θ)) − φ(∆i , X ti−1
; θ) .
2φ(∆i , Xti−1 ; θ)2
It is clear from (3.6) and (3.7) that {∂θ log QLn (θ)}n∈IN is a martingale with respect to
{Fn }n∈IN under Pθ . This quasi-score function is a particular case of the quadratic martin-
gale estimating functions considered by Bibby & Sørensen (1995) and Bibby & Sørensen
(1996). Maximum quasi-likelihood estimation for diffusions was considered by Bollerslev &
Wooldridge (1992).
3.2 Asymptotics
In this subsection we give asymptotic results for estimators obtained from martingale esti-
mating functions as the number of observations goes to infinity. To simplify the exposition
the observation time points are assumed to be equidistant, i.e., ti = i∆, i = 0, 1, . . . , n. Since
∆ is fixed, we will in most cases suppress ∆ in the notation and write for example p(x, y; θ)
and g(x, y; θ).
It is assumed that the diffusion is ergodic, that its invariant probability measure has
density function µθ for all θ ∈ Θ, and that X0 ∼ µθ under Pθ . Thus the diffusion is
stationary.
8
When the observed process, X, is a one-dimensional diffusion, the following simple con-
ditions ensure ergodicity, and an explicit expression exists for the density of the invariant
probability measure. The scale measure of X has Lebesgue density
!
x b(y; θ)
Z
s(x; θ) = exp −2 dy , x ∈ (ℓ, r), (3.9)
x# σ 2 (y; θ)
and Z r
[s(x; θ)σ 2 (x; θ)]−1 dx = A(θ) < ∞.
ℓ
Under Condition 3.1 the process X is ergodic with an invariant probability measure with
Lebesgue density
µθ (x) = [A(θ)s(x; θ)σ 2 (x; θ)]−1 , x ∈ (ℓ, r). (3.10)
For details see e.g. Skorokhod (1989). For general one-dimensional diffusions, the measure
with Lebesgue density proportional to s(x; θ)σ 2 (x; θ)]−1 is called the speed measure.
Let Qθ denote the probability measure on D 2 given by
This is the distribution of two consecutive observations (X∆(i−1) , X∆i ). Under the assumption
of ergodicity the law of large numbers (2.2) is satisfied for any function f : D 2 7→ IR such
that Q(|f |) < ∞, see e.g. Skorokhod (1989).
We impose the following condition on the function g in the estimating function (3.2)
Qθ g(θ)T g(θ) = (3.12)
Z
g(y, x; θ)T g(y, x; θ)µθ (x)p(x, y; θ)dydx < ∞,
D2
Since the estimating function Gn (θ) is a martingale under Pθ , the asymptotic normality
in (2.3) follows without further conditions from the central limit theorem for martingales,
see Hall & Heyde (1980). This result goes back to Billingsley (1961). In the martingale case
the asymptotic covariance matrix V (θ) in (2.3) is given by
V (θ) = Qθ0 g(θ)g(θ)T . (3.14)
9
Theorem 3.2 Assume Condition 2.1 is satisfied with r = 2, θ̄ = θ0 , and Q = Qθ0 , where θ0
is the true parameter value, and that (2.3) holds for θ = θ0 with V (θ) given by (3.14). Then
a θ0 -consistent Gn –estimator θ̂n exists, and
√ L
−1
n(θ̂n − θ0 ) −→ Np 0, W −1V W T (3.15)
under Pθ0 , where W is given by (2.5) with θ̄ = θ0 and V = V (θ0 ). If, moreover, the function
g(x, y; θ) is locally dominated integrable with respect to Qθ0 and
Qθ0 (g(θ)) 6= 0 for all θ 6= θ0 , (3.16)
then the estimator θ̂n is the unique Gn –estimator on any bounded subset of Θ containing θ0
with probability approaching one as n → ∞.
In practice we do not know the value of θ0 , so it is necessary to check that the conditions
of Theorem 3.2 hold for a neighbourhood of any value of θ0 ∈ int Θ.
The asymptotic covariance matrix of the estimator θ̂n can be estimated consistently by
means of the following theorem.
Theorem 3.3 Under Condition 2.1 (2) – (4) (with r = 2, θ̄ = θ0 , and Q = Qθ0 ),
n
1X Pθ0
Wn = ∂θT g(X(i−1)∆ , Xi∆ ; θ̂n ) −→ W, (3.17)
n i=1
where θ̂n is a θ0 -consistent estimator. The probability that Wn is invertible approaches one
as n → ∞. If, moreover, the function (x, y) 7→ kg(x, y; θ)k is dominated for all θ ∈ N by
a function which is square integrable with respect to Qθ0 , then
n
1X Pθ0
Vn = g(X(i−1)∆ , Xi∆ ; θ̂n )g(X(i−1)∆ , Xi∆ ; θ̂n )T −→ V. (3.18)
n i=1
Proof: Let C be a compact subset of N such that θ0 ∈ int C. By Lemma 2.3,
1 Pn
n i=1 ∂θ T g(X(i−1)∆ , Xi∆ ; θ) converges to Qθ0 (∂θ T g(θ)) in probability uniformly for θ ∈ C.
This implies (3.17) because θ̂n converges in probability to θ0 . The result about invertibility
follows because W is invertible. Also the uniform convergence in probability for θ ∈ C of
1 Pn T T
n i=1 g(X(i−1)∆ , Xi∆ ; θ) g(X(i−1)∆ , Xi∆ ; θ) to Qθ0 (g(θ)g(θ) ) follows from Lemma 2.3.
2
In the case of likelihood inference, the function Qθ0 (g(θ)) appearing in the identifiability
condition (3.16) is related to the Kullback-Leibler divergence between the models. Specifi-
cally, if the following interchange of differentiation and integration is allowed,
Qθ0 (∂θ log p(x, y, θ)) = ∂θ Qθ0 (log p(x, y, θ)) = −∂θ K̄(θ, θ0 ),
where K̄(θ, θ0 ) is the average Kullback-Leibler divergence between the transition distribu-
tions under Pθ0 and Pθ given by
Z
K̄(θ, θ0 ) = K(θ, θ0 ; x) µθ0 (dx),
D
with Z
K(θ, θ0 ; x) = log[p(x, y; θ0 )/p(x, y; θ)]p(x, y; θ0) dy.
D
Thus the identifiability condition can be written in the form ∂θ K̄(θ, θ0 ) 6= 0 for all θ 6= θ0 .
The quantity K̄(θ, θ0 ) is sometimes referred to as the Kullback-Leibler divergence between
the two Markov chain models for the observed process {Xi∆ } under Pθ0 and Pθ .
10
3.3 Godambe-Heyde optimality
In this section we present a general way of approximating the score function by means
of martingales of a similar form. Suppose we have a collection of real valued functions
hj (x, y, ; θ), j = 1, . . . , N satisfying
Z
hj (x, y; θ)p(x, y; θ)dy = 0 (3.19)
D
for all x ∈ D and θ ∈ Θ. Each of the functions hj could be used separately to define an
estimating function of the form (2.1), but a better approximation to the score function,
and hence a more efficient estimator, is obtained by combining them in an optimal way.
Therefore we consider estimating functions of the form
n
X
Gn (θ) = a(X(i−1)∆ , θ)h(X(i−1)∆ , Xi∆ ; θ), (3.20)
i=1
where h = (h1 , . . . , hN )T , and the p × N weight matrix a(x, θ) is a function of x such that
(3.20) is Pθ -integrable. It follows from (3.19) that Gn (θ) is a martingale estimating function,
i.e., it is a martingale under Pθ for all θ ∈ Θ.
The matrix a determines how much weight is given to each of the hj s in the estimation
procedure. This weight matrix can be chosen in an optimal way using the theory of optimal
estimating functions reviewed in Section 9. The optimal weight matrix a∗ gives the estimating
function of the form (3.20) that provides the best possible approximation to the score function
(3.5) in a mean square sense. Moreover, the optimal g ∗(x, y; θ) = a∗ (x; θ)h(x, y; θ) is obtained
from ∂θ log p(x, y; θ) by projection in a certain space of square integrable functions, for details
see Section 9.
The choice of the functions hj , on the other hand, is an art rather than a science. The
ability to tailor these functions to a given model or to particular parameters of interest is a
considerable strength of the estimating functions methodology. It is, however, also a source
of weakness, since it is not always clear how best to choose the hj s. In the following and in
the Sections 3.6 and 3.7, we shall present ways of choosing these functions that usually work
well in practice.
Example 3.4 The martingale estimating function (3.8) is of the type (3.20) with N = 2
and
where F and φ are given by (3.6) and (3.7). The weight matrix is
!
∂θ F (∆, x; θ) ∂θ φ(∆, x; θ)
, , (3.21)
φ(∆, x; θ) 2φ2 (∆, x; θ)∆
In the econometrics literature, a popular way of using functions like hj (x, y, ; θ), j =
1, . . . , N, to estimate the parameter θ is the generalized method of moments (GMM) of
11
Hansen (1982). In practice, the method is often implemented as follows, see e.g. Campbell,
Lo & MacKinlay (1997). Consider
n
1X
Fn (θ) = h(X(i−1)∆ , Xi∆ ; θ).
n i=1
where θ̃n is a θ0 -consistent estimator (for instance obtained by minimizing Fn (θ)T Fn (θ)).
The GMM-estimator is obtained by minimizing the function
where by (2.2)
n
1X Pθ0
Dn (θ) = ∂θ h(X(i−1)∆ , Xi∆ ; θ)T −→ Qθ0 ∂θ h(θ)T .
n i=1
and we see that GMM-estimators are covered by the theory for martingale estimating func-
tions presented in this section.
We now return to the problem of finding the optimal estimating function G∗n (θ), i.e. the
estimating functions of the form (3.20) with the optimal weight matrix. We assume that the
functions hj satisfy the following condition.
Condition 3.5
(1) The functions hj , j = 1, . . . N, are linearly independent.
(2) The functions y 7→ hj (x, y; θ), j = 1, . . . N, are square integrable with respect to p(x, y; θ)
for all x ∈ D and θ ∈ Θ.
(3) h(x, y; θ) is differentiable with respect to θ.
(4) The functions y 7→ ∂θi hj (x, y; θ) are integrable with respect to p(x, y; θ) for all x ∈ D and
θ ∈ Θ.
The class of estimating functions considered here is a particular case of the class treated
in detail in Example 9.3. By (9.16), the optimal choice of the weight matrix a is given by
12
where Z
Bh (x; θ) = ∂θ h(x, y; θ)T p(x, y; θ)dy (3.23)
D
and Z
Vh (x; θ) = h(x, y; θ)h(x, y; θ)T p(x, y; θ)dy. (3.24)
D
The matrix Vh (x; θ) is invertible because the functions hj , j = 1, . . . N are linearly inde-
pendent.Compared to (9.16), we have omitted a minus here. This can be done because
an optimal estimating function multiplied by an invertible p × p-matrix is also an optimal
estimating function and yields the same estimator.
The asymptotic variance of an optimal estimator, i.e. a G∗n –estimator, is simpler than
the general expression in (3.15) because in this case the matrices W and V given by (2.5)
and (3.14) are equal and given by (3.25). This is a general property of optimal estimating
functions as discussed in Section 9. The result can easily be verified under the assumption
that a∗ (x; θ) is a differentiable function of θ: by (3.19)
Z
[∂θi a∗ (x; θ)] h(x, y; θ)p(x, y; θ)dy = 0,
D
so that
Z
W = ∂θT [a∗ (x; θ0 )h(x, y; θ0 )]Qθ0 (dx, dy)
D2
= µθ0 (a∗ (θ0 )Bh (θ0 )T ) = µθ0 Bh (θ0 )Vh (θ0 )−1 Bh (θ0 )T ,
and by direct calculation
V = µθ0 Bh (θ0 )Vh (θ0 )−1 Bh (θ0 )T . (3.25)
Thus we have as a corollary to Theorem 2.2 that is g ∗(x, y, θ) = a∗ (x; θ)h(x, y; θ) satisfies
the conditions of Theorem 2.2, then a sequence θ̂n of G∗n –estimators has the asymptotic
distribution √ D
n(θ̂n − θ0 ) −→ Np 0, V −1 . (3.26)
Example 3.6 Consider the martingale estimating function of form (3.20) with N = 2 and
with h1 and h2 as in Example 3.4, where the diffusion is one-dimensional. The optimal
weight matrix has columns given by
∂θ φ(x; θ)η(x; θ) − ∂θ F (x; θ)ψ(x; θ)
a∗1 (x; θ) =
φ(x; θ)ψ(x; θ) − η(x; θ)2
∂θ F (x; θ)η(x; θ) − ∂θ φ(x; θ)φ(x; θ)
a∗2 (x; θ) = ,
φ(x; θ)ψ(x; θ) − η(x; θ)2
where
η(x; θ) = Eθ ([X∆ − F (x; θ)]3 |X0 = x)
and
ψ(x; θ) = Eθ ([X∆ − F (x; θ)]4 |X0 = x) − φ(x; θ)2 .
For the square-root diffusion (the CIR-model)
q
dXt = −β(Xt − α)dt + τ Xt dWt , (3.27)
13
where β, τ > 0, the optimal weights can be found explicitly. For this model
In Subsections 3.6 and 3.7 we shall present martingale estimating functions for which the
matrices Bh (x; θ) and Vh (x; θ) can be found explicitly, but for most models these matrices
must be found by simulation, a problem considered in Subsection 3.5. In situations where a∗
must be determined by a relatively time consuming numerical method, it might be preferable
to use the estimating function
n
G•n (θ) = a∗ (X(i−1)∆ ; θ̃n )h(X(i−1)∆ , Xi∆ ; θ),
X
(3.29)
i=1
where θ̃n is a weakly θ0 -consistent estimator, for instance obtained by some simple choice of
the weight matrix a. In this way a∗ needs to be calculated only once per observation point.
Under weak regularity conditions, the G•n -estimator has the same efficiency as the optimal
G∗n -estimator; see e.g. Jacod & Sørensen (2008).
Most martingale estimating functions proposed in the literature are of the form (3.20)
with
θ
hj (x, y; θ) = fj (y; θ) − π∆ (fj (θ))(x), (3.30)
or more specifically,
n h i
θ
X
Gn (θ) = a(X(i−1)∆ , θ) f (Xi∆ ; θ) − π∆ (f (θ))(X(i−1)∆ ) . (3.31)
i=1
14
applied to each coordinate of f . The polynomial estimating functions given by fj (y) = y j ,
j = 1, . . . , N, are an example. For martingale estimating functions of the special form (3.31),
the expression for the optimal weight matrix simplifies a bit because
θ θ
Bh (x; θ)ij = π∆ (∂θi fj (θ))(x) − ∂θi π∆ (fj (θ))(x), (3.33)
i = 1, . . . p, j = 1, . . . , N, and
θ θ θ
Vh (x; θ)ij = π∆ (fi (θ)fj (θ))(x) − π∆ (fi (θ))(x)π∆ (fj (θ))(x), (3.34)
A useful approximations to the optimal weight matrix can be obtained by applying the
formula
k
si i
πsθ (f )(x) = Aθ f (x) + O(sk+1),
X
(3.36)
i=0 i!
where Aθ denotes the generator of the diffusion
d d
Ckℓ (x; θ)∂x2k xℓ f (x),
X X
1
Aθ f (x) = bk (x; θ)∂xk f (x) + 2 (3.37)
k=1 k,ℓ=1
where C = σσ T . The formula (3.36) holds for 2(k + 1) times continuously differentiable
functions under weak conditions which ensure that the remainder term has the correct or-
der, see Kessler (1997) and Subsection 3.4. It is often enough to use the approximation
θ
π∆ (fj )(x) ≈ fj (x) + ∆Aθ fj (x). When f does not depend on θ this implies that for d = 1
h i
Bh (x; θ) ≈ ∆ ∂θ b(x; θ)f ′ (x) + 12 ∂θ σ 2 (x; θ)f ′′ (x) (3.38)
Example 3.7 If we simplify the optimal weight matrix found in Example 3.6 by the expan-
sion (3.36) and the Gaussian approximation (3.28), we obtain the approximately optimal
quadratic martingale estimating function
n
(
∂θ b(X(i−1)∆ ; θ)
G◦n (θ)
X
= [Xi∆ − F (X(i−1)∆ ; θ)] (3.40)
i=1 σ 2 (X(i−1)∆ ; θ)
∂θ σ 2 (X(i−1)∆ ; θ) h
)
i
2
+ (X i∆ − F (X (i−1)∆ ; θ)) − φ(X (i−1)∆ ; θ) .
2σ 4 (X(i−1)∆ ; θ)∆
15
As in Example 3.6 the diffusion is assumed to be one-dimensional.
Consider a diffusion with linear drift, b(x; θ) = −β(x − α). Diffusion models with linear
drift and a given marginal distribution were studied in Bibby, Skovgaard & Sørensen (2005).
If σ 2 (x; θ)µθ (x)dx < ∞, then the Ito-integral in
R
Z t Z t
Xt = X0 − β(Xs − α)ds + σ(Xs ; θ)dWs
0 0
is a proper martingale with mean zero, so the function f (t) = Eθ (Xt | X0 = x) satisfies that
Z t
f (t) = x − β f (s)ds + βαt
0
or
f ′ (t) = −βf (t) + βα, f (0) = x.
Hence
f (t) = xe−βt + α(1 − e−βt )
or
F (x; α, β) = xe−β∆ + α(1 − e−β∆ )
If only estimates of drift parameters are needed, we can use the linear martingale estimating
function of the form (3.20) with N = 1 and h1 (x, y; θ) = y − F (∆, x; θ). If σ(x; θ) = τ κ(x)
for τ > 0 and κ a positive function, then the approximately optimal estimating function of
this form is
n
1
h i
−β∆ −β∆
X
Xi∆ − X(i−1)∆ e − α(1 − e )
i=1 κ2 (X(i−1)∆ )
G◦n (α, β) =
,
n
X X(i−1)∆ h i
Xi∆ − X(i−1)∆ e−β∆ − α(1 − e −β∆
)
2
i=1 κ (X(i−1)∆ )
where multiplicative constants have been omitted. To solve the estimating equation G◦n (α, β) =
0 we introduce the weights
n
wiκ −2
κ(X(j−1)∆ )−2 ,
X
= κ(X(i−1)∆ ) /
j=1
Pn Pn
and let X̄ κ = i=1 wiκ Xi∆ and X̄−1 κ
= i=1 wiκ X(i−1)∆ be conditional precision weighted
sample averages of Xi∆ and X(i−1)∆ , respectively. The equation G◦n (α, β) = 0 has a unique
explicit solution provided that the weighted sample autocorrelation
Pn
i=1 wiκ (Xi∆ − X̄ κ )(X(i−1)∆ − X̄−1
κ
)
rnκ = Pn κ κ 2
i=1 wi (X(i−1)∆ − X̄−1 )
is positive. By the law of large numbers for ergodic processes, the probability that rnκ > 0
tends to one as n tends to infinity. Specifically, we obtain the explicit estimators
X̄ κ − rnκ X̄−1
κ
α̂n =
1 − rnκ
1
β̂n = − log (rnκ ) .
∆
16
A slightly simpler and asymptotically equivalent estimator may be obtained by substituting
X̄ κ for X̄−1
κ
everywhere, in which case α is estimated by the precision weighted sample√
average X̄ κ . For the square-root process (CIR-model) given by (3.27), where κ(x) = x,
a simulation study and an investigation of the asymptotic variance of these estimators in
Bibby & Sørensen (1995) show that they are not much less efficient than the estimators from
the optimal estimating function; see also the simulation study in Overbeck & Rydén (1997).
To obtain an explicit approximately optimal quadratic estimating function, we need
an expression for the conditional variance φ(x; θ). As we saw in Example 3.6, φ(x; θ) is
explicitly known for the square-root process (CIR-model) given by (3.27). For this model the
approximately optimal quadratic martingale estimating function is
n
X 1 h
−β∆ −β∆
i
Xi∆ − X(i−1)∆ e − α(1 − e )
i=1 X(i−1)∆
n
Xh i
Xi∆ − X(i−1)∆ e−β∆ − α(1 − e−β∆ )
i=1
.
n
1
X 2
Xi∆ − X(i−1)∆ e−β∆ − α(1 − e−β∆ )
i=1 X(i−1)∆
τ 2 n
#
o
−2β∆ −β∆
− α/2 − X(i−1)∆ e − (α − X(i−1)∆ )e + α/2
β
where
ψ(x; α, β) = ( 12 α − x)e−2β∆ − (α − x)e−β∆ + 21 α /β.
It is obviously necessary for this solution to the estimating equation to exist that the ex-
pression for e−β̂n ∆ is strictly positive, an event that happens with a probability tending to
one as n → ∞. Again this follows from the law of large numbers for ergodic processes. 2
When the optimal weight matrix is approximated by means of (3.36), there is a certain
loss of efficiency, which as in the previous example is often quite small; see Bibby & Sørensen
(1995) and Section 6 on high frequency asymptotics below. Therefore the relatively simple
estimating function (3.40) is often a good choice in practice.
17
θ
It is tempting to go on to approximate π∆ (fj (θ))(x) in (3.31) by (3.36) in order to obtain
an explicit estimating function, but as will be demonstrated in Subsection 3.6, this can be
θ
a dangerous procedure. In general the conditional expectation in π∆ should therefore be
approximated by simulations. Fortunately, Kessler & Paredes (2002) have established that,
provided the simulation is done with sufficient accuracy, this does not cause any bias, only a
minor loss of efficiency that can be made arbitrarily small; see Subsection 3.5. Moreover, as
θ
we shall also see in Subsection 3.6, π∆ (fj (θ))(x) can be found explicitly for a quite flexible
class of diffusions.
where θ = (α, β) ∈ Θ ⊆ IR2 . This is the simplest model type for which the essential features
of the theory appear. Note that the drift and the diffusion coefficient depend on different
parameters. It is assumed that the diffusion is ergodic, that its invariant probability measure
has density function µθ for all θ ∈ Θ, and that X0 ∼ µθ under Pθ . Thus the diffusion is
stationary.
Throughout this subsection, we shall assume that the observation times are equidis-
tant, i.e. ti = i∆, i = 0, 1, . . . , n, where ∆ is fixed, and that the martingale estimat-
ing function (3.2) satisfies the conditions of Theorem 3.2, so that we know that (even-
tually) a Gn -estimator θ̂n exists, which is asymptotically normal with covariance matrix
−1
M(g) = W −1 V W T , where W is given by (2.5) with θ̄ = θ0 and V = V (θ0 ) with V (θ) given
by (3.14).
The main idea of small ∆-optimality is to expand the asymptotic covariance matrix in
powers of ∆
1
M(g) = v−1 (g) + v0 (g) + o(1). (3.42)
∆
Small ∆-optimal estimating functions minimize the leading term in (3.42). Jacobsen (2001)
obtained (3.42) by Ito-Taylor expansions, see Kloeden & Platen (1999), of the random ma-
trices that appear in the expressions for W and V under regularity conditions that will be
given below. A similar expansion was used in Aı̈t-Sahalia & Mykland (2003) and Aı̈t-Sahalia
& Mykland (2004).
18
To formulate the conditions, we define the differential operator Aθ , θ ∈ Θ. Its domain, Γ
is the set of continuous real-valued functions (s, x, y) 7→ ϕ(s, x, y) of s ≥ 0 and (x, y) ∈ D 2
that are continuous differentiable in s and twice continuously differentiable in y. The operator
Aθ is given by
Aθ ϕ(s, x, y) = ∂s ϕ(s, x, y) + Aθ ϕ(s, x, y), (3.43)
where Aθ is the generator (3.37) , which for every s and x is applied to the function y 7→
ϕ(s, x, y). The operator Aθ acting on functions in Γ that do not depend on x is the generator
of the space-time process (t, Xt )t<geq0 . We also need the probability measure Q∆θ given by
(3.11). Note that in this section the dependence on ∆ is explicit in the notation.
Condition 3.8 The function ϕ belongs to Γ and satisfies that
Z
ϕ(s, x, y)Qsθ0 (dx, dy) < ∞
ZD2
for all s ≥ 0.
As usual θ0 = (α0 , β0 ) denotes the true parameter value. We will say that a function with
values in IRk or IRk×ℓ satisfies Condition 3.8 is each component of the functions satisfies this
condition.
Suppose ϕ satisfies Condition 3.8. Then by Ito’s formula
ϕ(t, X0 , Xt ) = (3.44)
Z t Z t
ϕ(0, X0, X0 ) + Aθ0 ϕ(s, X0 , Xs )ds + ∂y ϕ(s, X0 , Xs )dWs
0 0
under Pθ0 . A significant consequence of Condition 3.8 is that the Ito-integral in (3.44) is a
true Pθ0 -martingale, and thus has expectation zero under Pθ0 . If the function Aθ0 ϕ satisfies
Condition 3.8, a similar result holds for this functions, which we can insert in the Lebesgue
integral in (3.44). By doing so and then taking the conditional expectation given X0 = x on
both sides of (3.44), we obtain
Note that A0θ is the identity operator. The previously used expansion (3.36) is a particular
case of (3.46). In the case where ϕ does not depend on x (or y) the integrals in Condition
3.8 are with respect to the invariant measure µθ0 . If, moreover, ϕ does not depend on time
s, the conditions do not depend on s.
19
Theorem 3.9 Suppose that the function g(∆, x, y; θ0) in (3.2) is such that g, ∂θT g, gg T and
Aθ0 g satisfy Condition 3.8. Assume, moreover, that we have the expansion
g(∆, x, y; θ0) = g(∆, x, y; θ0) + ∆∂∆ g(0, x, y; θ0) + oθ0 ,x,y (∆).
If the matrix Z r
S= Bθ0 (x)µθ0 (x)dx (3.47)
ℓ
is invertible, where
Bθ (x) = (3.48)
1 2
∂α b(x; α)∂y g1 (0, x, x; θ) 2 ∂β v(x; β)∂y g1 (0, x, x; θ)
,
1 2
∂α b(x; α)∂y g2 (0, x, x; θ) 2 ∂β v(x; β)∂y g2 (0, x, x; θ)
for all x ∈ (ℓ, r). In this case, the second term in (3.42) satisfies that
Z r 2 −1
2 4
v0 (g)22 ≥ 2 ∂β σ (x; β0 ) /σ (x; β0 )µθ0 (x)dx
ℓ
with equality if
∂y2 g2 (0, x, x; θ0 ) = ∂β σ 2 (x; β0 )/σ 2 (x; β0 )2 , (3.52)
for all x ∈ (ℓ, r).
Thus the conditions for small ∆-optimality are (3.50), (3.51) and (3.52). For a proof of
Theorem 3.9, see Jacobsen (2001). The condition (3.51) ensures that all entries of v−1 (g)
involving the diffusion coefficient parameter, β, are zero. Since v−1 (g) is the ∆−1 -order term
in the expansion (3.42) of the asymptotic covariance matrix, this dramatically decreases the
asymptotic variance of the estimator of β when ∆ is small. We refer to the condition (3.51)
as Jacobsen’s condition.
The reader is reminded of the trivial fact that for any non-singular 2 × 2 matrix, Mn ,
the estimating functions Mn Gn (θ) and Gn (θ) give exactly the same estimator. We call them
versions of the same estimating function. The matrix Mn may depend on ∆n . Therefore a
given version of an estimating function needs not satisfy (3.50) – (3.52). The point is that
a version must exist which satisfies these conditions.
20
Example 3.10 Consider a quadratic martingale estimating function of the form
where F and φ are given by (3.6) and (3.7). By (3.36), F (∆, x; θ) = x+O(∆) and φ(∆, x; θ) =
O(∆), so
a1 (x, 0; θ)(y − x)
!
g(0, y, x; θ) = . (3.54)
a2 (x, 0; θ)(y − x)2
Since ∂y g2 (0, y, x; θ) = 2a2 (x, ∆; θ)(y − x), the Jacobsen condition (3.51) is satisfied for all
quadratic martingale estimating functions. Using again (3.36), it is not difficult to see that
the two other conditions (3.50) and (3.52) are satisfied in three particular cases: the optimal
estimating function given in Example 3.6 and the approximations (3.8) and (3.40). 2
The following theorem gives conditions ensuring, for given functions f1 , . . . , fN , that a
small ∆-optimal estimating function of the form (3.20) and (3.30) exists. This not always
the case. We assume that the functions f1 (·; θ), . . . , fN (·; θ) are of full affine rank for all θ,
i.e., for any θ ∈ Θ, the identity
N
aθj fj (x; θ) + aθ0 = 0,
X
x ∈ (ℓ, r),
j=1
Theorem 3.11 Suppose that N ≥ 2, that the functions fj are twice continuously differen-
tiable and satisfies that the matrix
D(x) = (3.55)
∂x f2 (x; θ) ∂x2 f2 (x; θ)
is invertible for µθ -almost all x. Moreover, assume that the coefficients b and σ are con-
tinuously differentiable with respect to the parameter. Then a specification of the weight
matrix a(x; θ), independent of ∆, exists such that the estimating function (3.20) satisfies the
conditions (3.51), (3.50) and (3.52). When N = 2, these conditions are satisfy for
For a proof of Theorem 3.11, see Jacobsen (2002). In Section 6, we shall see that the
Godambe-Heyde optimal choice (3.22) of the weight-matrix in (3.20) gives an estimating
function which has a version that satisfies the conditions for small ∆-optimality, (3.50) –
(3.52).
We have focused on one-dimensional diffusions to simplify the exposition. The situ-
ation becomes more complicated for multi-dimensional diffusions, as we shall now briefly
21
describe. Details can be found in Jacobsen (2002). For a d-dimensional diffusion, b(x; α)
is d-dimensional and v(x; β) = σ(x; β)σ(x; β)T is a d × d-matrix. The Jacobsen condition
is unchanged (except that ∂y g2 (0, x, x; θ0 ) is now a d-dimensional vector). The other two
conditions for small ∆-optimality are
and −1
vec ∂y2 g2 (0, x, x; θ0 ) = vec (∂β v(x; β0 )) v ⊗2 (x; β0 ) .
In the latter equation, vec(M) denotes for a d × d matrix M the d2 -dimensional row vector
consisting of the rows of M placed one after the other, and M ⊗2 is the d2 × d2 -matrix with
(i′ , j ′ ), (ij)th entry equal to Mi′ i Mj ′ j . Thus if M = ∂β v(x; β) and M • = (v ⊗2 (x; β))−1 , then
the (i, j)th coordinate of vec(M) M • is i′ j ′ Mi′ j ′ M(i• ′ j ′ ),(i,j) .
P
For a d-dimensional diffusion process, the conditions analogous to those in Theorem 3.11
ensuring the existence of a small ∆-optimal estimating function of the form (3.20) is that
N ≥ d(d + 3)/2, and that the N × (d + d2 )-matrix
∂xT f (x; θ) ∂x2T f (x; θ)
The variance of the error can be estimated in the traditional way, and by the cental limit
theorem, the error is approximately normal distributed. This simple approach can be im-
proved by applying variance reduction methods, for instance methods that take advantage
of the fact that πθ∆ f (x) can be approximated by (3.36). Methods for numerical simulation
of diffusion models can be found in Kloeden & Platen (1999).
The approach just described is sufficient when calculating the conditional expectation
appearing in (3.30), although it is important to use the same random numbers (seed) when
calculating the estimating functions for different values of the parameter θ, for instance
when using a search algorithm to find a solution to the estimating equation. More care is
needed if the optimal weight functions are calculated numerically. The problem is that the
22
optimal weight matrix typically contain derivatives with respect to θ of functions that must
be determined numerically, see e.g. Example 3.6. Pedersen (1994) proposed a procedure for
determining ∂θ πθ∆ f (x; θ) by simulations based on results in Friedman (1975). However, it
is often preferable to use an approximation to the optimal weight matrix obtained by using
(3.36), possibly supplemented by Gaussian approximations, as explained in Subsection 3.3.
This is not only much simpler, but also avoids potentially serious problems of numerical
instability, and by results in Section 6 the loss of efficiency is often very small. The approach
outlined here, where martingale estimating functions are approximated by simulation, is
closely related to the simulated method of moments, see Duffie & Singleton (1993) and
Clement (1997).
One might be worried that when approximating a martingale estimating function by
simulation of conditional moments, the resulting estimator might have considerably smaller
efficiency or even be inconsistent. The asymptotic properties of the estimators obtained
when the conditional moments are approximated by simulation were investigated by Kessler
& Paredes (2002), who found that if the simulations are done with sufficient care, there is
no need to worry. However, their results also show that care is needed: if the discretization
used in the simulation method is too crude, the estimator behaves badly. Kessler & Paredes
(2002) considered martingale estimating functions of the general form
n h
X i
Gn (θ) = f (Xi∆ , X(i−1)∆ ; θ) − F (X(i−1)∆ ; θ) , (3.57)
i=1
23
for all θ ∈ Θ, for all x in the state space of X and for δ sufficiently small. Here R(x; θ)
is of polynomial growth in x uniformly for θ in compact sets, i.e., for any compact subset
K ⊆ Θ, there exist constants C1 , C2 > 0 such that supθ∈K |R(x; θ)| ≤ C1 (1 + |x|C2 ) for all x
in the state space of the diffusion. The inequality (3.60) is assumed to hold for any function
g(y, x; θ) which is 2(β + 1) times differentiable with respect to x, and satisfies that g and its
partial derivatives (with respect to x) up to order 2(β + 1) are of polynomial growth in x
uniformly for θ in compact sets. This definition of weak order is stronger than the definition
in Kloeden & Platen (1999) in that control of the polynomial order with respect to the
initial value x is added, but Kessler & Paredes (2002) point out that theorems in Kloeden
& Platen (1999) that give the order of approximation schemes can be modified in a tedious,
but straightforward, way to ensure that the schemes satisfy the stronger condition (3.60). In
particular, the Euler scheme (3.58) is of weak order one if the coefficients of the stochastic
differential equation (3.1) are smooth enough.
Under a number of further regularity conditions, Kessler & Paredes (2002) showed the
following results about a GM,δ M,δ
n -estimator, θ̂n , with Gn
M,δ
given by (3.57). We shall not go
into these rather technical conditions. Not surprisingly, they include conditions that ensure
the eventual existence of a consistent and asymptotically
√ normal Gn -estimator, cf. Theorem
3.2. If δ goes to zero sufficiently fast that nδ β → 0 as n → ∞, then
√ M,δ
D
n θ̂n − θ0 −→ N 0, (1 + M −1 )Σ ,
where Σ denotes the asymptotic covariance matrix of a Gn -estimator, see Theorem 3.2. Thus
for δ sufficiently small and M sufficiently large, it does not matter much that the conditional
moment F (x; θ) has been determined by simulation in (3.59). Moreover,√ β we can control the
loss of efficiency by our choice of M. However, when 0 < limn→∞ nδ < ∞,
√ M,δ
D
n θ̂n − θ0 −→ N m(θ0 ), (1 + M −1 )Σ ,
√
and when nδ β → ∞,
δ −β θ̂nN,δ − θ0 → m(θ0 )
in probability. Here the p-dimensional vector m(θ0 ) depends on f and is generally different
from zero. Thus it is essential that a sufficiently small value of δ is used.
where the real number λj (θ) ≥ 0 is called the eigenvalue corresponding to fj (x; θ). Under
weak regularity conditions, fj is also an eigenfunction for the transition operator πtθ , i.e.
24
Theorem 3.12 Let φ(x; θ) be an eigenfunction for the generator (3.37) with eigenvalue
λ(θ). Suppose Z r
[∂x φ(x; θ)σ(x; θ)]2 µθ (dx) < ∞ (3.62)
ℓ
for all t > 0. Then
πtθ (φ(θ))(x) = e−λ(θ)t φ(x; θ). (3.63)
for all t > 0.
Example 3.13 For the square-root model (CIR-model) defined by (3.27) with α > 0, β > 0,
(ν) (ν)
and τ > 0, the eigenfunctions are φi (x) = Li (2βxτ −2 ) with ν = 2αβτ −2 − 1, where Li is
the ith order Laguerre polynomial
i
xm
!
(ν) X m i+ν
Li (x) = (−1) ,
m=0
i−m m!
(ν)
and the eigenvalues are {iβ : i = 0, 1, · · ·}. It is easily seen by direct calculation that Li
solves the differential equation
By Theorem 3.12, (3.63) holds, so we can calculate all conditional polynomial moments, of
which the first four were given in Example 3.6. Thus all polynomial martingale estimating
functions are explicit.
2
is an ergodic diffusion on the interval (−π/2, π/2) provided that θ ≥ 1/2, so that Condition
3.1 is satisfied. This process was introduced by Kessler & Sørensen (1999), who called it an
Ornstein-Uhlenbeck process on (−π/2, π/2) because tan x ∼ x near zero. The generalization
to other finite intervals is obvious. The invariant measure has a density proportional to
cos(x)2θ .
The eigenfunctions are
25
where Ciθ is a Gegenbauer polynomial of order i, and the eigenvalues are i(θ + i/2), i =
1, 2, . . .. This follows because the Gegenbauer polynomial Ciθ solves the differential equation
Condition (3.62) in Theorem 3.12 is obviously satisfied because the state space is bounded,
so (3.63) holds.
The first non-trivial eigenfunction is sin(x) (a constant is omitted) with eigenvalue θ+1/2.
From the martingale estimating function
n
sin(X(i−1)∆ )[sin(Xi∆ )) − e−(θ+1/2)∆ sin(X(i−1)∆ ))],
X
Ǧn (θ) (3.65)
i=1
(a,b)
with eigenvalues λi (ρ, ϕ, σ) = i ρ + 21 nσ 2 , i = 1, 2, . . .. Here Pi (x) denotes the Jacobi
polynomial of order i.
2
26
For most diffusion models where explicit expressions for eigenfunctions can be found,
including the examples above, the eigenfunctions are of the form
i
ai,j (θ) κ(y)j
X
φi (y; θ) = (3.66)
j=0
where κ is a real function defined on the state space and is independent of θ. For martingale
estimating functions based on eigenfunctions of this form, the optimal weight matrix (3.22)
can be found explicitly too.
Theorem 3.15 Suppose 2N eigenfunctions are of the form (3.66) for i = 1, . . . , 2N, where
the coefficients ai,j (θ) are differentiable with respect to θ. If a martingale estimating functions
is defined by (3.30) using the first N eigenfunctions, then
j
∂θi aj,k (θ)νk (x; θ) − ∂θi [e−λj (θ)∆ φj (x; θ)]
X
Bh (x, θ)ij = (3.67)
k=0
and
r=0 s=0
θ
where νi (x; θ) = π∆ (κi )(x), i = 1, . . . , 2N, solve the following triangular system of linear
equations
i
−λi (θ)∆
X
e φi (x; θ) = ai,j (θ)νj (x; θ) i = 1, . . . , 2N, (3.69)
j=0
with ν0 (x; θ) = 1.
Proof: The expressions for Bh and Vh follow from (3.33) and (3.34) when the eigenfunc-
θ
tions are of the form (3.66), and (3.69) follows by applying π∆ to both sides of (3.66).
2
Example 3.16 Consider again the diffusion (3.64) in Example 3.14. We will find the opti-
mal martingale estimating function based on the first non-trivial eigenfunction, sin(x) (where
we have neglected a non-essential multiplicative function of θ) with eigenvalue θ + 1/2. It
follows from (3.33) that
Bh (x; θ) = ∆e−(θ+1/2)∆ sin(x)
because sin(x) does not depend on θ. To find Vh we need Theorem 3.15. The second non-
trivial eigenfunction is 2(θ + 1) sin2 (x) − 1 with eigenvalue 2(θ + 1), so
1 1
ν2 (x; θ) = e−2(θ+1)∆ [sin2 (x) − (θ + 1)−1 ] + (θ + 1)−1 .
2 2
Hence the optimal estimating function is
n 1
sin(X(i−1)∆ )[sin(Xi∆ ) − e−(θ+ 2 )∆ sin(X(i−1)∆ )]
G◦n (θ)
X
= 1 2(θ+1)∆
i=1 2
(e − 1)/(θ + 1) − (e∆ − 1) sin2 (X(i−1)∆ )
27
where a constant has been omitted. When ∆ is small it is a good idea to multiply G◦n (θ) by
∆ because the denominator is then of order ∆.
Note that when ∆ is sufficiently small, we can expand the exponential function in the
numerator to obtain (after multiplication by ∆) the approximately optimal estimating func-
tion 1
n
X sin(X(i−1)∆ )[sin(Xi∆ ) − e−(θ+ 2 )∆ sin(X(i−1)∆ )]
G̃n (θ) = ,
i=1 cos2 (X(i−1)∆ )
which has the explicit solution
Pn !
−1 i=1 tan(X(i−1)∆ ) sin(Xi∆ ))/ cos(X(i−1)∆ ) 1
θ̃n = −∆ log Pn 2 − .
i=1 tan (X(i−1)∆ ) 2
The explicit estimator θ̃ can, for instance, be used as a starting value when finding the
optimal estimator by solving G◦n (θ) = 0 numerically. Note however that for G̃n the square
integrability under Qθ0 (3.12) required in Theorem 3.2 (to ensure the central limit theorem)
is only satisfied when θ0 > 1.5. This problem can be avoided by replacing cos2 (X(i−1)∆ ) in
the numerator by 1, which it is close to when the process is not near the boundaries. In that
way we arrive at the simple estimating function (3.65), which is thus also approximately
optimal.
2
where β > 0, and a, b and c are such that the square root is well defined when Xt is in the
state space. The parameter β > 0 is a scaling of time that determines how fast the diffusion
moves. The parameters α, a, b, and c determine the state space of the diffusion as well as
the shape of the invariant distribution. In particular, α is the expectation of the invariant
distribution. We define θ = (α, β, a, b, c).
In the context of martingale estimating functions, an important property of the Pearson
diffusions is that the generator (3.37) maps polynomials into polynomials. It is therefore
easy to find eigenfunctions among the polynomials
n
pn,j xj .
X
pn (x) =
j=0
or
n n−1 n−2
{λn − aj }pn,j xj + bj+1 pn,j+1xj + cj+2 pn,j+2xj = 0.
X X X
28
where aj = j{1 − (j − 1)a}β, bj = j{α + (j − 1)b}β, and cj = j(j − 1)cβ for j = 0, 1, 2, . . ..
Without loss of generality, we assume pn,n = 1. Thus, equating the coefficients we find that
the eigenvalue is given by
λn = an = n{1 − (n − 1)a}β. (3.71)
If we define pn,n+1 = 0, then the coefficients {pn,j }j=0,...,n−1 solve the linear system
29
The Pearson system is defined as the class of probability densities obtained by solving a
differential equation of this form, see Pearson (1895).
In the following we present a full classification of the ergodic Pearson diffusions, which
shows that all distributions in the Pearson system can be obtained as invariant distributions
for a model in the class of Pearson diffusions. We consider six cases according to whether
the squared diffusion coefficient is constant, linear, a convex parabola with either zero, one
or two roots, or a concave parabola with two roots. The classification problem can be
reduced by first noting that the Pearson class of diffusions is closed under location and
scale-transformations. To be specific, if X is an ergodic Pearson diffusion, then so is X̃
where X̃t = γXt + δ. The parameters of the stochastic differential equation (3.70) for X̃
are ã = a, b̃ = bγ − 2aδ, c̃ = cγ 2 − bγδ + aδ 2 , β̃ = β, and α̃ = γα + δ. Hence, up to
transformations of location and scale, the ergodic Pearson diffusions can take the following
forms. Note that we consider scale transformations in a general sense where multiplication
by a negative real number is allowed, so that to each case of a diffusion with state space
(0, ∞) there corresponds a diffusion with state space (−∞, 0).
Case 1: σ 2 (x) = 2β. The solution to (3.70) is an Ornstein-Uhlenbeck process. The state
space is IR, and the invariant distribution is the normal distribution with mean α and variance
1. The eigenfunctions are the Hermite polynomials.
Case 2: σ 2 (x) = 2βx. The solution to (3.70) is the square root process (CIR process)
(3.27) with state space (0, ∞). Condition 3.1 that ensures ergodicity is satisfied if and only
if α > 1. If 0 < α ≤ 1, the boundary 0 can with positive probability be reached at a finite
time point, but if the boundary is made instantaneously reflecting, we obtain a stationary
process. The invariant distribution is the gamma distribution with scale parameter 1 and
shape parameter α. The eigenfunctions are the Laguerre polynomials.
Case 3: a > 0 and σ 2 (x) = 2βa(x2 +1). The state space is the real line, and the scale density
1
is given by s(x) = (x2 +1) 2a exp(− αa tan−1 x). By Condition 3.1, the solution is ergodic for all
1
a > 0 and all α ∈ IR. The invariant density is given by µθ (x) ∝ (x2 + 1)− 2a −1 exp( αa tan−1 x)
If α = 0 the invariant distribution is a scaled t-distribution with ν = 1 + a−1 degrees of
1
freedom and scale parameter ν − 2 . If α 6= 0 the invariant distribution is skew and has tails
decaying at the same rate as the t-distribution with 1 + a−1 degrees of freedom. A fitting
name for this distribution is the skew t-distribution. It is also known as Pearson’s type IV
distribution. In either case the mean is α and the invariant distribution has moments of
order k for k < 1 + a−1 . With its skew and heavy tailed marginal distribution, the class of
diffusions with α 6= 0 is potentially very useful in many applications, e.g. finance. It was
studied and fitted financial data by Nagahara (1996) using the local linearization method of
Ozaki (1985). We consider this process in more detail below.
Case 4: a > 0 and σ 2 (x) = 2βax2 . The state space is (0, ∞) and the scale density is
1 α
s(x) = x a exp( ax ). Condition 3.1 holds if and only if α > 0. The invariant distribution is
1
α
given by µθ (x) ∝ x− a −2 exp(− ax ), and is thus an inverse gamma distribution with shape
1
parameter 1 + a and scale parameter αa . The invariant distribution has moments of order k
for k < 1 + a1 . This process is sometimes referred to as the GARCH diffusion model. The
polynomial eigenfunctions are known as the Bessel polynomials.
Case 5: a > 0 and σ 2 (x) = 2βax(x + 1).The state space is (0, ∞) and the scale density is
α+1 α
s(x) = (1 + x) a x− a . The ergodicity Condition 3.1 holds if and only if αa ≥ 1. Hence, for all
a > 0 and all µ ≥ a a unique ergodic solution to (3.70) exists. If 0 < α < 1, the boundary 0
30
can with positive probability be reached at a finite time point, but if the boundary is made
instantaneously reflecting, a stationary process is obtained. The density of the invariant
α+1 α
distribution is given by µθ (x) ∝ (1 + x)− a −1 x a −1 . This is a scaled F-distribution with
2α
a
and a2 + 2 degrees of freedom and scale parameter 1+a α
. The invariant distribution has
1
moments of order k for k < 1 + a .
Case 6: a < 0 and σ 2 (x) = 2βax(x − 1). The state space is (0, ∞) and the scale density
1−α α
is s(x) = (1 − x) a x a . Condition 3.1 holds if and only if αa ≤ −1 and 1−α a
≤ −1. Hence,
for all a < 0 and all α > 0 such that min(α, 1 − α) ≥ −a a unique ergodic solution to (3.70)
exists. If 0 < α < −a, the boundary 0 can with positive probability be reached at a finite
time point, but if the boundary is made instantaneously reflecting, a stationary process is
obtained. Similar remarks apply to the boundary 1 when 0 < 1 − α < −a. The invariant
1−α α
distribution is given by µθ (x) ∝ (1 − x)− a −1 x− a −1 and is thus the Beta distribution with
α
shape parameters −a , 1−α
−a
. This class of diffusions will be discussed in more detail below.
It is often referred to as the Jacobi diffusions because the related eigenfunctions are Jacobi
polynomials. Multivariate Jacobi diffusions were considered by Gourieroux & Jasiak (2006).
Example 3.17 The skew t-distribution with mean zero, ν degrees of freedom, and skewness
parameter ρ has (unnormalized) density
f (z) ∝
√ n √ o
{(z/ ν + ρ)2 + 1}−(ν+1)/2 exp ρ(ν − 1) tan−1 z/ ν + ρ ,
√
which is the invariant density of the diffusion Zt = ν(Xt − ρ) with ν = 1 + a−1 and
ρ = α, where X is as in Case 3. An expression for the normalizing constant when ν is
integer valued was derived in Nagahara (1996). By the transformation result above, the
corresponding stochastic differential equation is
q
1
dZt = −βZt dt + 2β(ν − 1)−1 {Zt2 + 2ρν 2 Zt + (1 + ρ2 )ν}dWt . (3.73)
p1 (z) = z,
1
4ρν 2
2 (1 + ρ2 )ν
p2 (z) = z − z− ,
ν−3 ν −2
1 3
3 12ρν 2 2 24ρ2 ν + 3(1+ρ2)ν(ν − 5) 8ρ(1+ρ2 )ν 2
p3 (z) = z − z + z+ ,
ν−5 (ν − 5)(ν − 4) (ν −5)(ν −3)
and
1
24ρν 2 3 144ρ2 ν − 6(1 + ρ2 )ν(ν − 7) 2
p4 (z) = z 4 − z + z
ν−7 (ν − 7)(ν − 6)
3 3 3
8ρ(1 + ρ2 )ν 2 (ν − 7) + 48ρ(1 + ρ2 )ν 2 (ν − 6) − 192ρ3 ν 2
+ z
(ν − 7)(ν − 6)(ν − 5)
3(1 + ρ2 )2 ν(ν − 7) − 72ρ2 (1 + ρ2 )ν 2
+ ,
(ν − 7)(ν − 6)(ν − 4)
31
provided that ν > 4 If ν > 2i the first i eigenfunctions are square integrable and thus
satisfy (3.62). Hence (3.63) holds, and the eigenfunctions can be used to construct explicit
martingale estimating functions. 2
where β > 0 and γ ∈ (−1, 1) has been proposed as a model for the random variation of the
logarithm of an exchange rate in a target zone between realignments by De Jong, Drost &
Werker (2001) (γ = 0) and Larsen & Sørensen (2007). This is a diffusion on the interval
(m − z, m + z) with mean reversion around m + γz. It is a Jacobi diffusion obtained by a
location-scale transformation of the diffusion in Case 6 above. The parameter γ quantifies
the asymmetry of the model. When β(1 − γ) ≥ σ 2 and β(1 + γ) ≥ σ 2 , X is an ergodic
diffusion, for which the stationary distribution is a Beta-distribution on (m − z, m + z) with
parameters κ1 = β(1 − γ)σ −2 and κ2 = β(1 + γ)σ −2 . If the parameter restrictions are not
satisfied, one or both of the boundaries can be hit in finite time, but if the boundaries are
made instantaneously reflecting, a stationary process is obtained.
The eigenfunctions for the generator of the diffusion (3.74) are
(κ1 −1, κ2 −1)
φi (x; β, γ, σ, m, z) = Pi ((x − m)/z), i = 1, 2, . . .
(a,b)
where Pi (x) denotes the Jacobi polynomial of order i given by
i
! !
(a,b) n+a a+b+n+j
2−j (x − 1)j ,
X
Pi (x) = −1 < x < 1.
j=0 n−j j
The eigenvalue of φi is i(β + 21 σ 2 (i − 1)). Since (3.62) is obviously satisfied, (3.63) holds, so
that the eigenfunctions can be used to construct explicit martingale estimating functions. 2
Explicit formulae for the conditional moments of a Pearson diffusion can be obtained
from the eigenfunctions by means on (3.61). Specifically,
n
n X
E(Xtn | X0 = x) = qn,k,ℓ · e−λℓ t · xk ,
X
(3.75)
k=0 ℓ=0
for k, ℓ = 0, . . . , n − 1 with λℓ and pn,j given by (3.71) and (3.72). For details see Forman &
Sørensen (2008).
Also the moments of the Pearson diffusions can, when they exist, be found explicitly by
using the fact that the integral of the eigenfunctions with respect to the invariant probability
measure is zero.We have seen above that E(|Xt |κ ) < ∞ if and only if a < (κ − 1)−1 . Thus if
a ≤ 0 all moments exist, while for a > 0 only the moments satisfying that κ < a−1 + 1 exist.
32
In particular, the expectation always exists. The moments of the invariant distribution can
be found by the recursion
E(Xtn ) = a−1 n−1
n {bn · E(Xt ) + cn · E(Xtn−2 )} (3.76)
where an = n{1 − (n − 1)a}β, bn = n{α + (n − 1)b}β, and cn = n(n − 1)cβ for n = 0, 1, 2, . . ..
The initial conditions are given by E(Xt0 ) = 1, and E(Xt ) = α. This can be found from the
expressions for the eigenfunctions, but is more easily seen as follows. By Ito’s formula
dXtn = −βnXtn−1 (Xt − µ)dt + βn(n − 1)Xtn−2(aXt2 + bXt + c)dt
+nXtn−1 σ(Xt )dWt ,
and if E(Xt2n ) is finite, i.e. if a < (2n − 1)−1 , the integral of the last term is a martingale
with expectation zero.
Example 3.19 Equation (3.76) allows us to find the moments of the skewed t-distribution,
in spite of the fact that the normalizing constant of the density is unknown. In particular,
for the diffusion (3.73),
E(Zt ) = 0,
(1 + ρ2 )ν
E(Zt2 ) = ,
ν −2
3
3 4ρ(1 + ρ2 )ν 2
E(Zt ) = ,
(ν − 3)(ν − 2)
24ρ2 (1 + ρ2 )ν 2 + 3(ν − 3)(1 + ρ2 )2 ν 2
E(Zt4 ) = .
(ν − 4)(ν − 3)(ν − 2)
2
For a diffusion T (X) obtained from a solution X to (3.70) by a twice differentiable and
invertible transformation T , the eigenfunctions of the generator are pn {T −1 (x)}, where pn
is an eigenfunction of the generator of X. The eigenvalues are the same as for the original
eigenfunctions. Since the original eigenfunctions are polynomials, the eigenfunctions of T (X)
are of the form (3.66) with κ = T −1 . Hence explicit optimal martingale estimating functions
are also available for transformations of Pearson diffusions, which is a very large and flexible
class of diffusion processes. Their stochastic differential equations can, of course, be found
by Ito’s formula.
33
which has invariant distribution with density f = F ′ . A particular example is the logistic
distribution
ex
F (x) = x ∈ IR,
1 + ex
for which n o q
4
dYt = −β sinh(x) + 8 cosh (x/2) dt + 2 β cosh(x/2)dWt .
If the same transformation F −1 (y) = log(y/(1 −y)) is applied to the general Jacoby diffusion
(case 6), then we obtain
n o
dXt = −β 1 − 2µ + (1 − µ)ex − µe−1 − 8a cosh4 (x/2) dt
q
+2 −aβ cosh(x/2)dWt ,
a diffusion for which the invariant distribution is the generalized logistic distribution with
density
eκ1 x
f (x) = , x ∈ IR,
(1 + ex )κ1 +κ2 B(κ1 , κ2 )
where κ1 = −(1 − α)/a, κ2 = α/a and B denotes the Beta-function. This distribution was
introduced and studied in Barndorff-Nielsen, Kent & Sørensen (1982).
Example 3.21 Let again X be a general Jacobi-diffusion (case 6). If we apply the trans-
formation T (x) = sin−1 (2x − 1) to Xt we obtain the diffusion
sin(Yt ) − ϕ q
dYt = −ρ dt + −aβ/2dWt ,
cos(Yt )
where ρ = β(1 + a/4) and ϕ = (2α − 1)/(1 + a/4). The state space is (−π/2, π/2). Note
that Y has dynamics that are very different from those of the Jacobi diffusion: the drift is
non-linear and the diffusion coefficient is constant. This model was considered in Example
3.14.
34
generator of the diffusion. Bayesian estimators with the same asymptotic properties as the
maximum likelihood estimator can be obtained by Markov chain Monte Carlo methods, see
Elerian, Chib & Shephard (2001), Eraker (2001), and Roberts & Stramer (2001). Finally,
exact and computationally efficient likelihood-based estimation methods were presented by
Beskos et al. (2006).
converges and is strictly positive definite, and that Qθ0 (gi (θ)2+ǫ ) < ∞, i = 1, . . . , p for some
ǫ > 0, see e.g. Doukhan (1994). To define the concept of α-mixing, let Ft denote the σ-
field generated by {Xs | s ≤ t} and let F t denote the σ-field generated by {Xs | s ≥ t}. A
stochastic process X is said to be α-mixing, if
for all t > 0 and u > 0, where α(u) → 0 as u → ∞. This means that Xt and Xt+u are
almost independent, when u is large. If there exit positive constants c1 and c2 such that
α(u) ≤ c1 e−c2 u ,
for all u > 0, then the process X is called geometrically α-mixing. For one-dimensional
diffusions there are simple conditions for geometric α-mixing. If all non-zero eigenvalues of
the generator (3.37) are larger than λ > 0, then the diffusion is geometrically α-mixing with
c2 = λ. This is for instance the case if the spectrum of the generator is discrete. Ergodic
diffusions with a linear drift −β(x − α), β > 0, as for instance the Pearson diffusions, are
geometrically α-mixing with c2 = β; see Hansen, Scheinkman & Touzi (1998).
Genon-Catalot, Jeantheau & Larédo (2000) gave the following simple sufficient condition
for the one-dimensional diffusion that solves (3.1) to be geometrically α-mixing.
Condition 5.1
(i) The function b is continuously differentiable with respect to x and σ is twice continuously
differentiable with respect to x, σ(x; θ) > 0 for all x ∈ (ℓ, r), and there exists a constant
Kθ > 0 such that |b(x; θ)| ≤ Kθ (1 + |x|) and σ 2 (x; θ) ≤ Kθ (1 + x2 ) for all x ∈ (ℓ, r).
35
(ii) σ(x; θ)µθ (x) → 0 as x ↓ ℓ and x ↑ r.
(iii) 1/γ(x; θ) has a finite limit as x ↓ ℓ and x ↑ r, where γ(x; θ) = ∂x σ(x; θ)−2b(x; θ)/σ(x; θ).
Other conditions for geometric α-mixing were given by Veretennikov (1987), Hansen &
Scheinkman (1995), and Kusuoka & Yoshida (2000).
For geometrically α-mixing diffusions processes and estimating functions Gn satisfying
Condition 2.1 the existence of a θ̄-consistent and asymptotically normal Gn -estimator follows
from Theorem 2.2, which also contains a result about eventual uniqueness of the estimator.
and Z r
W = µθ0 (∂θT h(θ0 )) = ∂θT h(x; θ0 )µθ0 (x)dx.
ℓ
The condition for eventual uniqueness of the Gn -estimator (2.7) is here that θ0 is the only
root of µθ0 (h(θ)).
Kessler (2000) proposed
h(x; θ) = ∂θ log µθ (x), (5.4)
which is the score functions (the derivative of the log-likelihood functions) if we pretend
that the observations are an i.i.d. sample from the stationary distribution. If ∆ is large, this
might be a reasonable approximation. That (5.3) is satisfied for this specification of h follows
under standard conditions that allow the interchange of differentiation and integration.
Z r Z r Z r
(∂θ log µθ (x)) µθ (x)dx = ∂θ µθ (x)dx = ∂θ µθ (x)dx = 0.
ℓ ℓ ℓ
Hansen & Scheinkman (1995) and Kessler (2000) proposed and studied the generally
applicable specification
hj (x; θ) = Aθ fj (x; θ), (5.5)
36
where Aθ is the generator (3.37), and fj , j = 1, . . . , d are twice differentiable functions chosen
such that Condition 2.1 holds. The estimating function with h given by (5.5) can easily be
applied to multivariate diffusions, because an explicit expression for the invariant density
µθ is not needed. The following lemma for one-dimensional diffusions shows that only weak
conditions are needed to ensure (5.3).
lim f ′ (x)σ 2 (x; θ)µθ (x) = lim f ′ (x)σ 2 (x; θ)µθ (x). (5.6)
x→r x→ℓ
Then Z r
(Aθ f )(x)µθ (x)dx = 0.
ℓ
Proof: Note that by (3.10), the function ν(x; θ) = 21 σ 2 (x; θ)µθ (x) satisfies that ν ′ (x; θ) =
b(x; θ)µθ (x). In this proof all derivatives are with respect to x. It follows that
Z r
(Aθ f )(x)µθ (x)dx
ℓ
Z r
= b(x; θ)f ′ (x) + 21 σ 2 (x; θ)f ′′ (x) µθ (x)dx
Zℓ r Z r
′ ′ ′′ ′
= (f (x)ν (x; θ) + f (x)ν(x; θ)) dx = (f ′ (x)ν(x; θ)) dx
ℓ ℓ
′ 2 ′ 2
= lim f (x)σ (x; θ)µθ (x) − lim f (x)σ (x; θ)µθ (x) = 0.
x→r x→ℓ
Example 5.3 Consider the square-root process (3.27) with σ = 1. For f1 (x) = x and
f2 (x) = x2 , we see that !
−β(x − α)
Aθ f (x) = ,
−2β(x − α)x + x
which gives the simple estimators
n
1X
n
Xi∆
1X n i=1
α̂n = Xi∆ , β̂n = !2 .
n i=1 1X n
1X n
2
2 Xi∆ − Xi∆
n i=1 n i=1
The condition (5.6) is obviously satisfied because the invariant distribution is a normal
distribution.
2
Sørensen (2001) derived the estimating function of the form (5.2) with
37
As mentioned above, an estimating function of the form (5.2) cannot be expected to
yield as efficient estimators as an estimating function that depends on pairs of consecutive
observations, and thus can use the information contained in the transitions. Hansen &
Scheinkman (1995) proposed non-martingale estimating functions of the form (3.2) with g
given by
gj (∆, x, y; θ) = hj (y)Aθ fj (x) − fj (x)Âθ hj (y), (5.8)
where the functions fj and hj satisfy weak regularity conditions ensuring that (2.4) holds
for θ̄ = θ0 . The differential operator Âθ is the generator of the time reversal of the observed
diffusion X. For a multivariate diffusion it is given by
d d
Ckℓ (x; θ)∂x2k xℓ f (x),
X X
1
Âθ f (x) = b̂k (x; θ)∂xk f (x) + 2
k=1 k,ℓ=1
where C = σσ T and
d
1 X
b̂k (x; θ) = −bk (x; θ) + ∂x (µθ Ckl ) (x; θ).
µθ (x) ℓ=1 ℓ
For one-dimensional ergodic diffusions, Âθ = Aθ . Obviously, the estimating function of the
form (5.2) with hj (x; θ) = Aθ fj (x) is a particular case of (5.8) with hj (y) = 1.
38
We assume that the solution is unique. Using the expansion (3.36), we find that
θ0
Qθ0 (g(θ)) = µθ0 a(θ)[π∆ f − f − ∆Aθ f ]
= ∆µθ0 a(θ)[Aθ0 f − Aθ f + 21 ∆A2θ0 f ] + O(∆3 )
= (θ0 − θ)∆µθ0 (a(θ0 )∂θ Aθ0 f ) + 21 ∆2 µθ0 a(θ0 )A2θ0 f
+O(∆|θ − θ0 |2 ) + O(∆2 |θ − θ0 |) + O(∆3 ).
If we neglect all O-terms, we obtain that
.
θ̄ = θ0 + ∆ 21 µθ0 a(θ0 )A2θ0 f /µθ0 (a(θ0 )∂θ Aθ0 f ) ,
which indicates that when ∆ is small, the asymptotic bias is of order ∆. However, the bias
can be huge when ∆ is not sufficiently small as the following example shows.
Example 5.4 Consider again a diffusion with linear drift, b(x; θ) = −β(x − α). In this case
(5.9) with f (x) = x gives the estimating function
n
X
Gn (θ) = a(X∆(i−1) ; θ)[X∆i − X∆(i−1) + β X∆(i−1) − α ∆],
i=1
where a is 2-dimensional. For a diffusion with linear drift, we found in Example 3.7 that
F (x; α, β) = −β(x − α)∆. Using this, we obtain that
Qθ0 (g(θ)) = c1 (e−β0 ∆ − 1 + β∆) + c2 β(α0 − α),
where Z
c1 = a(x)xµθ0 (dx) − µθ0 (a)α0 , c2 = µθ0 (a)∆.
D
Thus
ᾱ = α0
and
1 − e−β0 ∆ 1
β̄ = ≤ .
∆ ∆
We see that the estimator of α is consistent, while the estimator of β will tend to be small
if ∆ is large, irrespective of the value of β0 . We see that what determines how well β̂ works
is the magnitude of β0 ∆, so it is not enough to know that ∆ is small. Moreover, we cannot
use β̂∆ to evaluate whether there is a problem, because this quantity will always tend to be
smaller than one. If β0 ∆ actually is small, then the bias is proportional to ∆ as expected
β̄ = β0 − 21 ∆β02 + O(∆2 ).
We can get an impression of what can happen when estimating the parameter β by means
of the dangerous estimating function given by (5.9) from the simulation study in Bibby &
Sørensen (1995) for the square root process (3.27). The result is given in Table 5.1. For
the function a the approximately optimal weight function was used, cf. Example 3.7. For
different values of ∆ and the sample size, 500 independent datasets were simulated, and
the estimators were calculated for each dataset. The expectation of the estimator β̂ was
determined as the average of the simulated estimators. The parameter values were α = 10,
β = 1 and τ = 1, and the initial value was x0 = 10. When ∆ is large, the behaviour of the
estimator is bizarre. 2
39
∆ # obs. mean ∆ # obs. mean
0.5 200 0.81 1.5 200 0.52
500 0.80 500 0.52
1000 0.79 1000 0.52
1.0 200 0.65 2.0 200 0.43
500 0.64 500 0.43
1000 0.63 1000 0.43
Table 5.1: Empirical mean of 500 estimates of the parameter β in the CIR model. The true
parameter values are α = 10, β = 1, and τ = 1.
The asymptotic bias given by (5.10) is small when ∆ is sufficiently small, and the results
in the following section on high frequency asymptotics show that in this case the approximate
martingale estimating functions work well. However, how small ∆ needs to be depends on
the parameter values, and without prior knowledge about the parameters, it is safer to use
an exact martingale estimating function, which gives consistent estimators at all sampling
frequencies.
6 High-frequency asymptotics
A large number of estimating functions have been proposed for diffusion models, and a large
number of simulation studies have been performed to compare their relative merits, but the
general picture has been rather confusing. By considering the high frequency scenario,
n → ∞, ∆n → 0, n∆n → ∞, (6.1)
Sørensen (2007) obtained simple conditions for rate optimality and efficiency for ergodic
diffusions, which allow identification of estimators that work well when the time between
observations, ∆n , is not too large. For financial data the speed of reversion is usually slow
enough that this type of asymptotics works for daily, sometimes even weekly observations.
A main result of this theory is that under weak conditions optimal martingale estimating
functions give rate optimal and efficient estimators.
To simplify the exposition, we restrict attention to a one-dimensional diffusion given by
where θ = (α, β) ∈ Θ ⊆ IR2 . The results below can be generalized to multivariate diffusions
and parameters of higher dimension. We consider estimating functions of the general form
(2.1), where the two-dimensional function g = (g1 , g2 ) for some κ ≥ 2 and for all θ ∈ Θ
satisfies
Eθ (g(∆n , X∆n i , X∆n(i−1) ; θ) | X∆n(i−1) ) = ∆κn R(∆n , X∆n (i−1) ; θ). (6.3)
Here and later R(∆, y, x; θ) denotes a function such that |R(∆, y, x; θ)| ≤ F (y, x; θ), where
F is of polynomial growth in y and x uniformly for θ in a compact set1 . We assume that the
1
For any compact subset K ⊆ Θ, there exist constants C1 , C2 , C3 > 0 such that supθ∈K |F (y, x; θ)| ≤
C1 (1 + |x|C C
2 + |y|3 ) for all x and y in the state space of the diffusion.
40
diffusion and the estimating functions satisfy the technical regularity Condition 6.2 given
below.
Martingale estimating functions obviously satisfy (6.3) with R = 0, but for instance the
approximate martingale estimating functions discussed at the end of the previous section
satisfy (6.3) too.
∂y g2 (0, x, x; θ) = 0, (6.4)
∂y g1 (0, x, x; θ) = ∂α b(x; α)/σ 2 (x; β), (6.5)
∂y2 g2 (0, x, x; θ) = ∂β σ 2 (x; β)/σ 2 (x; β)2 , (6.6)
for all x ∈ (ℓ, r) and θ ∈ Θ. Assume, moreover, that the following identifiability condition
is satisfied
Z r
[b(x, α0 ) − b(x, α)]∂y g1 (0, x, x; θ)µθ0 (x)dx 6= 0 when α 6= α0 ,
ℓ
Z r
[σ 2 (x, β0 ) − σ 2 (x, β)]∂y2 g2 (0, x, x; θ)µθ0 (x)dx 6= 0 when β 6= β0 ,
ℓ
and that
r (∂α b(x; α0 ))2
Z
W1 = µθ0 (x)dx 6= 0,
ℓ σ 2 (x; β0 )
#2
∂β σ 2 (x; β0 )
"
Z r
W2 = µθ0 (x)dx 6= 0.
ℓ σ 2 (x; β0 )
Then a consistent Gn –estimator θ̂n = (α̂n , β̂n ) exists and is unique in any compact subset of
Θ containing θ0 with probability approaching one as n → ∞. For a martingale estimating
function, or more generally if n∆2(κ−1) → 0,
√
W1−1
! ! !!
n∆n (α̂n − α0 ) D 0 0
√ −→ N2 , . (6.7)
n(β̂n − β0 ) 0 0 W2−1
An estimator satisfying (6.7) is rate optimal and efficient, cf. Gobet (2002), who showed
that the model considered here is locally asymptotically normal. Note that the estimator
of the diffusion coefficient parameter, β, converges faster than the estimator of the drift
parameter, α. Condition (6.4) implies rate optimality. If this condition is not √ satisfied,
the estimator of the diffusion coefficient parameter converges at the slower rate n∆n . This
condition is called the Jacobsen condition, because it appears in the theory of small ∆-optimal
estimation developed in Jacobsen (2001) and Jacobsen (2002). In this theory the asymptotic
covariance matrix in (3.15) is expanded in powers of ∆, the time between observations. The
leading term is minimal when (6.5) and (6.6) are satisfied. The same expansion of (3.15)
was used by Aı̈t-Sahalia & Mykland (2004).
The assumption n∆n → ∞ in (6.1) is needed to ensure that the drift parameter, α, can
be consistently estimated. If the drift is known and only the diffusion coefficient parameter,
β, needs to be estimated, this condition can be omitted, see Genon-Catalot & Jacod (1993).
Another situation where the infinite observation horizon, n∆n → ∞, is not needed for
41
consistent estimation of α is when the high frequency asymptotic scenario is combined with
the small diffusion scenario, where σ(x; β) = ǫn ζ(x; β) and ǫn → 0, see Genon-Catalot (1990),
Sørensen & Uchida (2003) and Gloter & Sørensen (2008).
The reader is reminded of the trivial fact that for any non-singular 2 × 2 matrix, Mn ,
the estimating functions Mn Gn (θ) and Gn (θ) give exactly the same estimator. We call them
versions of the same estimating function. The matrix Mn may depend on ∆n . Therefore a
given version of an estimating function needs not satisfy (6.4) – (6.6). The point is that a
version must exist which satisfies these conditions.
It follows from results in Jacobsen (2002) that to obtain a rate optimal and efficient
estimator from an estimating function of the form (3.31), we need that N ≥ 2 and that the
matrix
∂x f1 (x; θ) ∂x2 f1 (x; θ)
!
D(x) =
∂x f2 (x; θ) ∂x2 f2 (x; θ)
is invertible for µθ -almost all x. Under these conditions, Sørensen (2007) showed that
Godambe-Heyde optimal martingale estimating functions give rate optimal and efficient es-
timators. For a d-dimensional diffusion, Jacobsen (2002) gave the conditions N ≥ d(d+3)/2,
and that the N × (d + d2 )-matrix D(x) = (∂x f (x; θ) ∂x2 f (x; θ)) has full rank d(d + 3)/2.
We conclude this section by stating technical conditions under which the results in this
section hold. The assumptions about polynomial growth are far too strong, but simplify the
proofs. These conditions can most likely be weakened very considerably in a way similar to
the proofs in Gloter & Sørensen (2008).
Condition 6.2 The diffusion is ergodic and the following conditions hold for all θ ∈ Θ:
Rr
(1) ℓ xk µθ (x)dx < ∞ for all k ∈ IN.
g(∆, y, x; θ) =
g(0, y, x; θ) + ∆g (1) (y, x; θ) + 12 ∆2 g (2) (y, x; θ) + ∆3 R(∆, y, x; θ),
where
We define Cp,k1 ,k2 ,k3 (IR+ × (ℓ, r)2 × Θ) as the class of real functions f (t, y, x; θ) satisfying
that
42
(ii) f and all partial derivatives ∂ti1 ∂yi2 ∂αi3 ∂βi4 f , ij = 1, . . . kj , j = 1, 2, i3 + i4 ≤ k3 , are of
polynomial growth in x and y uniformly for θ in a compact set (for fixed t).
The classes Cp,k1 ,k2 ((ℓ, r) × Θ) and Cp,k1,k2 ((ℓ, r)2 × Θ) are defined similarly for functions
f (y; θ) and f (y, x; θ), respectively.
7 Non-Markovian models
In this section we consider estimating functions that can be used when the observed process
is not a Markov process. In this situation, it is usually not easy to find a tractable mar-
tingale estimating function. For instance a simple estimating function of the form (3.31)
is not a martingale. To obtain a martingale, the conditional expectation given X(i−1)∆ in
(3.31) must be replaced by the conditional expectation given all previous observations, which
can only very rarely be found explicitly, and which it is rather hopeless to find by simula-
tion. Instead we will consider a generalization of the martingale estimating functions, called
the prediction-based estimating functions, which can be interpreted as approximations to
martingale estimating functions.
To clarify our thoughts, we will consider a concrete model type. Let the D-dimensional
process X be the stationary solution to the stochastic differential equation
(i−1)
where Πj (θ) is a p-dimensional vector, the coordinates of which belong to Pi−1,j , and
(i−1)
π̆j (θ) is the minimum mean square error predictor in Pi−1,j of fj (Yi , . . . , Yi−s ; θ) under
Pθ . When s = 0 and Pi−1,j is the set of all functions of Y1 , . . . , Yi−1 with finite variance,
43
(i−1)
π̆j (θ) is the conditional expectation under Pθ of fj (Yi ; θ) given Y1 , . . . , Yi−1 , so in this
case we obtain a martingale estimating function. Thus for a Markov process, a martingale
estimating function of the form (3.31) is a particular case of a prediction-based estimating
function.
The minimum mean square error predictor of fj (Yi , . . . , Yi−s ; θ) is the projection of
fj (Yi , . . . , Yi−s ; θ) onto the subspace Pi−1,j of the L2 -space of all functions of Y1 , . . . , Yi with
(i−1)
finite variance under Pθ . Therefore π̆j (θ) satisfies the normal equation
n o
(i−1) (i−1)
Eθ πj fj (Yi, . . . , Yi−s ; θ) − π̆j (θ) =0 (7.3)
(i−1)
for all πj ∈ Pi−1,j . This implies that a prediction-based estimating function satisfies that
We can interpret the minimum mean square error predictor as an approximation to the
conditional expectation of fj (Yi , . . . , Yi−s ; θ) given X1 , . . . , Xi−1 , which is the projection of
fj (Yi , . . . , Yi−s ; θ) onto the subspace of all functions of X1 , . . . , Xi−1 with finite variance.
To obtain estimators that can relatively easily be calculated in practice, we will from
now on restrict attention to predictor sets, Pi−1,j , that are finite dimensional. Let hjk , j =
1, . . . , N, k = 0, . . . , qj be functions from IRr into IR (r ≥ s), and define (for i ≥ r + 1)
random variables by
(i−1)
Zjk = hjk (Yi−1 , Yi−2 , . . . , Yi−r ).
(i−1)
We assume that Eθ ((Zjk )2 ) < ∞ for all θ ∈ Θ, and let Pi−1,j denote the subspace spanned
(i−1) (i−1)
by Zj0 , . . . , Zjqj . We set hj0 = 1 and make the natural assumption that the functions
(i−1)
hj0 , . . . , hjqj are linearly independent. We write the elements of Pi−1,j in the form aT Zj ,
where aT = (a0 , . . . , aqj ) and
(i−1) T
(i−1) (i−1)
Zj = Zj0 , . . . , Zjqj
are qj + 1-dimensional vectors. With this specification of the predictors, the estimating
function can only include terms with i ≥ r + 1:
n X
N h i
X (i−1) (i−1)
Gn (θ) = Πj (θ) fj (Yi, . . . , Yi−s ; θ) − π̆j (θ) (7.5)
i=r+1 j=1
(i−1)
It is well-known that the minimum mean square error predictor, π̆j (θ), is found by solv-
(r) (r)
ing the normal equations (7.3). We define Cj (θ) as the covariance matrix of (Zj1 , . . . , Zjqj )T
under Pθ , and bj (θ) as the vector for which the ith coordinate is
(r)
bj (θ)i = Covθ (Zji , fj (Yr+1 , . . . , Yr+1−s ; θ)), (7.6)
i = 1, . . . , qj . Then we have
(i−1) (i−1)
π̆j (θ) = ăj (θ)T Zj
where ăj (θ)T = (ăj0 (θ), ăj∗(θ)T ) with
44
and qj
X (r)
ăj0 (θ) = Eθ (fj (Ys+1, . . . , Y1 ; θ)) − ăjk (θ)Eθ (Zjk ). (7.8)
k=1
That Cj (θ) is invertible follows from the assumption that the functions hjk are linearly
independent. If fj (Yi, . . . , Yi−s ; θ) has mean zero under Pθ for all θ ∈ Θ, we need not include
(i−1) (i−1)
a constant in the space of predictors, i.e. we need only the space spanned by Zj1 , . . . , Zjqj .
φℓ,1(θ) φℓ−1,1(θ) φℓ−1,ℓ−1 (θ)
..
=
..
− φℓ,ℓ (θ) ..
. . .
φℓ,ℓ−1(θ)) φℓ−1,ℓ−1(θ)) φℓ−1,1 (θ))
and
vℓ (θ) = vℓ−1 (θ) 1 − φℓ,ℓ (θ)2 .
The algorithm is run for ℓ = 2, . . . , r. Then
ă∗ (θ) = (φr,1 (θ), . . . , φr,r (θ)),
45
while ă0 can be found from (7.8), which simplifies to
r
!
X
ă0 (θ) = Eθ (Y1 ) 1 − φr,k (θ) .
k=1
The quantity vr (θ) is the prediction error Eθ (Yi − π̆ (i−1) ). Note that if we want to include a
further lagged value of Y in the predictor, we just iterate the algorithm once more.
2
We will now find the optimal prediction-based estimating function of the form (7.5) in
the sense explained in Section 9. First we express the estimating function in a more compact
(i−1)
way. The ℓth coordinate of the vector Πj (θ) can be written as
qj
(i−1) X (i−1)
πℓ,j (θ) = aℓjk (θ)Zjk , ℓ = 1, . . . , p.
k=0
where
a110 (θ) · · · a11q1 (θ) · · · · · · a1N 0 (θ) · · · a1N qN (θ)
A(θ) = .. .. .. ..
,
. . . .
ap10 (θ) · · · ap1q1 (θ) · · · · · · apN 0 (θ) · · · apN qN (θ)
and
H (i) (θ) = Z (i−1) F (Yi , . . . , Yi−s ; θ) − π̆ (i−1) (θ) , (7.11)
(i−1) (i−1)
with F = (f1 , . . . , fN )T , π̆ (i−1) (θ) = (π̆1 (θ), . . . , π̆N (θ))T , and
(i−1)
Z
1
0q 1 ··· 0q 1
(i−1)
Z (i−1) = 0q 2 Z2 ··· 0q 2
. (7.12)
.. .. ..
. . .
(i−1)
0q N 0q N · · · ZN
Here 0qj denotes the qj -dimensional zero-vector. When we have chosen the functions fj and
the predictor spaces, the quantities H (i) (θ) are completely determined, whereas we are free
to choose the matrix A(θ) in an optimal way, i.e. such that the asymptotic variance of the
estimators is minimized.
We will find en explicit expression for the optimal weight matrix, A∗ (θ), under the fol-
lowing condition, in which we need one further definition:
ă(θ) = (ă10 (θ), . . . , ă1q1 (θ), . . . , ăN 0 (θ), . . . ăN qN (θ))T , (7.13)
where the ăjk s define the minimum mean square error predictor. Specifically, π̆ (i−1) (θ) =
(Z (i−1) )T ă(θ).
46
Condition 7.2
(1) The function F (y1 , . . . , ys+1; θ) and the coordinates of ă(θ) are continuously differentiable
functions of θ.
(2) p ≤ p̄ = N + q1 + · · · + qN .
(3) The p̄ × p-matrix ∂θT ᾰ(θ) has rank p.
(4) The functions 1, f1 , . . . , fN are linearly independent (for fixed θ) on the support of the
conditional distribution of (Yi , . . . , Yi−s ) given (Xi−1 , . . . , Xi−r ).
(5) The p × p-matrix
U(θ)T = Eθ Z (i−1) ∂θT F (Yi , . . . , Yi−s ; θ) (7.14)
exists.
If we denote the optimal prediction-based estimating function by G∗n (θ), then
Eθ Gn (θ)G∗n (θ)T = (n − r)A(θ)M̄n (θ)A∗n (θ)T ,
where
M̄n (θ) = Eθ H (r+1) (θ)H (r+1) (θ)T (7.15)
n−r−1
(n − r − k) n (r+1)
(θ)H (r+1+k) (θ)T
X
+ Eθ H
k=1 (n − r)
o
+ Eθ H (r+1+k) (θ)H (r+1) (θ)T ,
√
which is the covariance matrix of ni=r+1 H (i) (θ)/ n − r. The sensitivity function (9.1) is
P
given by
SGn (θ) = (n − r)A(θ)(U(θ)T − D(θ)∂θT ă(θ))
where the p̄ × p̄-matrix D(θ) is given by
D(θ) = Eθ Z (i−1) (Z (i−1) )T (7.16)
It follows from Theorem 9.1 that A∗n (θ) is optimal if Eθ Gn (θ)G∗n (θ)T = SGn (θ). Under
Condition 7.2 (4) the matrix V̄n (θ) is invertible, see Sørensen (2000), so it follows that
A∗n (θ) = (U(θ) − ∂θ ă(θ)T D(θ))M̄n (θ)−1 , (7.17)
so that the estimating function
n
G∗n (θ) = A∗n (θ) Z (i−1) F (Yi, . . . , Yi−s ; θ) − π̆ (i−1) (θ) ,
X
(7.18)
i=s+1
is Godambe optimal. When the function F does not depend on θ, the expression for A∗n (θ)
simplifies slightly as in this case U(θ) = 0.
Example 7.3 Consider again the type of prediction-based estimating function discussed in
Example 7.1. In order to calculate (7.15), we need mixed moments of the form
Eψ [Y1k1 Ytk12 Ytk23 Ytk3 4 ], (7.19)
for 1 ≤ t1 ≤ t2 ≤ t3 and k1 + k2 + k3 + k4 ≤ 4N, where ki , i = 1, . . . , 4 are non-negative
integers.
2
47
7.2 Asymptotics
A prediction-based estimating function of the form (7.10) gives consistent and asymptotically
normal estimators under the following condition, where θ0 as usual is the true parameter
value.
Condition 7.4
(1) The diffusion process X is stationary and geometrically α-mixing.
and
(r) (r) 2+δ
Eθ0 Zjk Zjℓ < ∞,
for j = 1, . . . , N, k, ℓ = 0, . . . qj .
(3) The function F (y1, . . . , ys+1; θ) and the components of A(θ) and ă(θ), given by (7.13) are
continuously differentiable functions of θ.
(4) The matrix W = A(θ0 )(U(θ0 ) − D(θ0 )∂θT ă(θ0 )) has full rank p. The matrices U(θ) and
D(θ) are given by (7.14) and (7.16).
(5)
A(θ) Eθ0 Z (i−1) F (Yi, . . . , Yi−s ; θ) − D(θ0 )∂θT ă(θ)) 6= 0
for all θ 6= θ0 .
Condition 7.4 (1) and (2) ensures that the central limit theorem (2.3) holds and that
M̄n (θ0 ) → M(θ0 ), where
M(θ) = Eθ H (r+1) (θ)H (r+1) (θ)T
∞ n
Eθ H (r+1) (θ)H (r+1+k) (θ)T
X
+
k=1
o
+ Eθ H (r+1+k) (θ)H (r+1) (θ)T .
The concept of geometric α-mixing was explained in Subsection 5.1, where also conditions
for geometric α-mixing were discussed. It is not difficult to see that if the basic diffusion
process X is geometrically α-mixing, then the observed process Y inherits this property.
As explained in Subsection 5.1, we only need to check Condition 2.1 with θ̄ = θ0 to obtain
asymptotic results for prediction-based estimators. The condition (2.4) is satisfied because
of (7.4). It is easy to see that Condition 7.4 (3) and (4) implies that θ 7→ g(y1, . . . , yr+1 )
is continuously differentiable and that g as well as ∂thetaT g are locally dominated integrable
under Pθ0 . Finally, the condition (2.7) is identical to Condition 7.4 (5). Therefore it follows
from Theorem 2.2 that a consistent Gn –estimator θ̂n exists and is the unique Gn –estimator
on any bounded subset of Θ containing θ0 with probability approaching one as n → ∞. The
estimator satisfies that
√ L
−1
n(θ̂n − θ0 ) −→ Np 0, W −1 A(θ0 )M(θ0 )A(θ0 )T W T .
48
7.3 Integrated diffusions
Sometimes a diffusion process cannot be observed directly, but that data of the form
1 i∆
Z
Yi = Xs ds, i = 1, . . . , n (7.20)
∆ (i−1)∆
are available for some fixed ∆. Such observations might be obtained when the process X
is observed after passage through an electronic filter. Another example is provided by ice-
core records. The isotope ratio 18 O/16 O in the ice, measured as an average in pieces of ice,
each piece representing a time interval with time increasing as a function of the depth, is a
proxy for paleo-temperatures. The variation of the paleo-temperature can be modelled by a
stochastic differential equation, and it is natural to model the ice-core data as an integrated
diffusion process, see Ditlevsen, Ditlevsen & Andersen (2002). Estimation based on this
type of data was considered by Gloter (2000), Bollerslev & Wooldridge (1992), Ditlevsen &
Sørensen (2004), and Gloter (2006).
The model for data of the type (7.20) is a particular case of (7.1) with
! !
b1 (x1 ; θ) σ1 (x1 ; θ) 0
b(x; θ) = , σ(x; θ) =
x1 0 0
with X2,0 = 0, where only the second coordinate is observed. A stochastic differential
equation of this form is called hypoelliptic. Clearly the second coordinate is not stationary,
but if the first coordinate is a stationary process, then the observed increments Yi = (X2,i∆ −
X2,(i−1)∆ )/∆ form a stationary sequence. In the following we will again denote the basic
diffusion by X (rather than X1 ).
Suppose that 4N’th moment of Xt is finite. The moments (7.9) and (7.19) can be
calculated by
h i
E Y1k1 Ytk12 Ytk23 Ytk3 4 =
R
A E[Xv1 · · · Xvk1 Xu1 · · · Xuk2 Xs1 · · · Xsk3 Xr1 · · · Xrk4 ] dt
∆k1 +k2 +k3 +k4
where 1 ≤ t1 ≤ t2 ≤ t3 , A = [0 , ∆]k1 × [(t1 − 1)∆ , t1 ∆]k2 × [(t2 − 1)∆ , t2 ∆]k3 × [(t3 −
1)∆ , t3 ∆]k4 , and dt = drk4 · · · dr1 dsk3 · · · ds1 duk2 · · · du1 dvk1 · · · dv1 . The domain of inte-
gration can be reduced considerably by symmetry arguments, but here the point is that we
need to calculate mixed moments of the type E(Xtκ11 · · · Xtκkk ), where t1 < · · · < tk . For
the Pearson diffusions discussed in Subsection 3.7, these mixed moments can be calculated
by a simple iterative formula obtained from (3.75) and (3.76). Moreover, for the Pearson
diffusions, E(Xtκ11 · · · Xtκkk ) depends on t1 , . . . , tk through sums and products of exponential
functions, cf. (3.75). Therefore the integral above can be explicitly calculated, so that ex-
plicit optimal estimating functions of the type considered in Example 7.1 are available for
observations of integrated Pearson diffusions.
Example 7.5 Consider observation of an integrated square root process (3.27) and a prediction-
(i−1)
based estimating function with f1 (x) = x and f2 (x) = x2 with predictors given by π1 =
(i−1)
α1,0 + α1,1 Yi−1 and π2 = α2,0 . Then the minimal mean square error predictors are
(i−1)
π̆1 (Yi−1 ; θ) = µ (1 − ă(β)) + ă(β)Yi−1 ,
(i−1)
π̆2 (θ) = α2 + ατ 2 β −3 ∆−2 (e−β∆ − 1 + β∆)
49
with
(1 − e−β∆ )2
ă(β) = .
2(β∆ − 1 + e−β∆ )
The optimal prediction-based estimating function is
n 1 n 0
X (i−1) X 2 (i−1)
Yi−1 [Yi − π̄1
(Yi−1 ; θ)] + 0 [Yi − π̄2
(θ)],
i=1 0 i=1 1
2 Pn
β̂ 3 ∆ 2 2
i=2 (Yi − α̂ )
σ̂ 2 = .
(n − 1)α̂(e−β̂∆ − 1 + β̂∆)
The estimators are explicit apart from β̂, which can be found by solving a non-linear equation
in one variable. Details can be found in Ditlevsen & Sørensen (2004).
2
βi > 0, is found in many time series data. Examples are financial time series, Barndorff-
Nielsen & Shephard (2001), and turbulence, Barndorff-Nielsen, Jensen & Sørensen (1990)
and Bibby, Skovgaard & Sørensen (2005).
A simple model with autocorrelation function of the form (7.21) is the sum of diffusions
Yt = X1,t + . . . + XD,t
where
dXi,t = −βi (Xi,t − αi ) + σi (Xi,t )dWi,t , i = 1, . . . , D,
are independent. In this case
Var(Xi,t )
φi = .
Var(X1,t ) + · · · + Var(XD,t )
Sums of diffusions of this type with a pre-specified marginal distribution of Y were considered
by Bibby & Sørensen (2003) and Bibby, Skovgaard & Sørensen (2005). The same type of
autocorrelation function is obtained for sums of independent Ornstein-Uhlenbeck processes
driven by Lévy processes. This class of models was introduced and studied in Barndorff-
Nielsen, Jensen & Sørensen (1998).
50
Example 7.6 Sum of square root processes. If σi2 (x) = 2βi bx and αi = κi b for some b >
0, then the stationary distribution of Yt is a gamma-distribution with shape parameter
κ1 + · · · + κD and scale parameter b. The weights in the autocorrelation function are φi =
κi /(κ1 + · · · + κD ).
2
For sums of the Pearson diffusions presented in Subsection 3.7, we have explicit formulae
that allow calculation of (7.9) and (7.19), provided these mixed moments exists. Thus for
sums of Pearson diffusions we have explicit optimal prediction-based estimating functions of
the type considered in Example 7.1. By the multinomial formula,
E(Ytκ1 Ytν2 ) =
! !
XX κ ν κ1 ν1 κD νD
E(X1,t X1,t ) . . . E(XD,t XD,t )
κ1 , . . . , κD ν1 , . . . , νD 1 2 1 2
where !
κ κ!
=
κ1 , . . . , κD κ1 ! · · · κD !
is the multinomial coefficient, and where the first sum is over 0 ≤ κ1 , . . . , κD such that
κ1 + . . . κD = κ, and the second sum is analogous for the νi s. Higher order mixed moments
of the form (7.19) can be found by a similar formula with four sums and four multinomial
coefficients. Such formulae may appear daunting, but are easy to program. For a Pearson
diffusion, mixed moments of the form E(Xtκ11 · · · Xtκkk ) can be calculated by a simple iterative
formula obtained from (3.75) and (3.76).
with
ν1 ν2
σ 2 = Var(Yi ) = (1 + ρ2 ) + ,
ν1 − 2 ν2 − 2
( √ √
Cov(Yi−1 , Yi2 )
)
ν1 −β1 ∆ ν2 −β2 ∆
ζ21 = = 4ρ φ1 e + φ2 e .
Var(Yi ) ν1 − 3 ν2 − 3
51
Solving equation (7.22) for ζ21 and σ 2 we get
1 Pn 1 n 1 Pn
Yi−1 Yi2 − ( n−1 i=2 Yi−1 )( n−1 Yi2 )
P
n−1 i=2 i=2
ζ̂21 = 1 Pn 2 1 Pn 2
,
n−1 i=2 Yi−1 − ( n−1 i=2 Yi−1 )
σ̂ 2 = 1 Pn 2 1 Pn
n−1 i=2 Yi + ζ̂21 n−1 i=2 Yi−1 .
and insert σ̂ 2 for σ 2 . Thus, we get a one-dimensional estimating equation, ζ21 (β, φ, σ̂ 2, ρ) =
2
ζ̂21 , which can be solved numerically. Finally by inverting φi = 1+ρ νi
σ2 νi −2
we find the estimates
2φi σ̂2
ν̂i = φi σ̂2 −(1+ρ̂2 )
, i = 1, 2.
2
where only a subset of the coordinates are observed. Here B(θ) is a D × D-matrix, b(θ)
is a D-dimensional vector, σ(x; θ) is a D × D-matrix, and W a D-dimensional standard
Wiener process. Compartment models are used to model the dynamics of the flow of a
certain substance between different parts (compartments) of, for instance, an ecosystem
or the body of a human being or an animal. The process Xt is the concentration in the
compartments, and flow from a given compartment into other compartments is proportional
to the concentration in the given compartment modified by the random perturbation given
by the diffusion term. The vector b(θ) represents input to of output from the system.
where all parameters are positive, was used by Bibby (1995) to model how a radioactive
tracer moved between the water and the biosphere in a certain ecosystem. Samples could
only be taken from the water, the first compartment, so Yi = X1,ti . The model is Gaussian,
so likelihood inference is feasible and was studied by Bibby (1995). All mixed moments (7.9)
and (7.19) can be calculated explicitly, so also an explicit optimal prediction-based estimating
function of the type considered in Example 7.1 is available to estimate the parameters and
was studied by Düring (2002).
2
52
Example 7.9 A non-Gaussian diffusion compartment model is obtained by the specification
√ √
σ(x, θ) = diag(τ1 x1 , . . . xD ). This multivariate version of the square root process was
studied by Düring (2002), who used methods in Down, Meyn & Tweedie (1995) to show
that the D-dimensional process is geometrically α-mixing and established the asymptotic
normality of prediction-based estimators of the type considered in Example 7.1 when the
first compartment is observed, i.e. when Yi = X1,ti . In this case, the mixed moments (7.9)
and (7.19) must be calculated numerically.
2
Definition 8.1 a) The domain of Gn -estimators (for a given n) is the set An of all obser-
vations x = (x1 , . . . , xn ) for which Gn (θ) = 0 for at least one value θ ∈ Θ.
b) A Gn -estimator, θ̂n (x) is any function of the data with values in Θδ , such that for
P –almost all observations we have either θ̂n (x) ∈ Θ and Gn (θ̂n (x), x) = 0 if x ∈ An , or
θ̂n (x) = δ if x ∈
/ An .
We usually suppress the dependence on the observations in the notation and write θ̂n .
The following theorem gives conditions that ensure that, for n large enough, the estimat-
ing equation (1.1) has a solution that converges to a particular parameter value θ̄. When
the statistical model contains the true model, the estimating function should preferably be
chosen such that θ̄ = θ0 . To facilitate the following discussion, we will refer to an estimator
that converges to θ̄ in probability as a θ̄–consistent estimator, meaning that it is a (weakly)
53
consistent estimator of θ̄. We assume that Gn (θ) is differentiable with respect to θ and
denote by ∂θT Gn (θ) the p × p-matrix, where the ijth entry is ∂θj Gn (θ)i .
Theorem 8.2 Suppose the existence of a parameter value θ̄ ∈ int Θ (the interior of Θ),
a connected neighbourhood M of θ̄, and a (possibly random) function W on M taking its
values in the set of p × p matrices, such that the following holds:
P
(i) Gn (θ̄) → 0 (convergence in probability, w.r.t. the true measure P ) as n → ∞.
(ii) Gn (θ) is continuously differentiable on M for all n, and
P
sup k ∂θT Gn (θ) − W (θ) k → 0. (8.1)
θ∈M
Note that (8.1) implies the existence of a subsequence {nk } such that ∂θT Gnk (θ) converges
uniformly to W (θ) on M with probability one. Hence W is continuous (up to a null set)
and it follows from elementary calculus that outside some P –null set there exists a unique
continuously differentiable function G satisfying ∂θT G(θ) = W (θ) for all θ ∈ M and G(θ̄) = 0.
When M is a bounded set, (8.1) implies that
P
sup |Gn (θ) − G(θ)| → 0. (8.2)
θ∈M
This observation casts light on the result of Theorem 8.2. Since Gn (θ) can be made arbitrarily
close to G(θ) by choosing n large enough, and since G(θ) has a zero at θ̄, it is intuitively
clear that Gn (θ) must have a zero near θ̄ when n is sufficiently large
If we impose an identifiability condition, we can give a stronger result on any sequence
of Gn –estimators. By B̄ǫ (θ) we denote the closed ball with radius ǫ centered at θ.
Theorem 8.3 Assume (8.2) for some subset M of θ containing θ̄, and that
!
P inf |G(θ)| > 0 = 1 (8.3)
M \B̄ǫ (θ̄)
54
Theorem 8.4 Assume the estimating function Gn satisfies the conditions of Theorem 8.2
and that there is a sequence of real numbers an > 0 increasing to ∞ such that
! !
an Gn (θ̄) L Z
−→ , (8.5)
∂θT Gn (θ̄) W (θ̄)
If moreover W (θ̄) is non-random, then the limit distribution is a normal distribution with
expectation zero and covariance matrix W (θ̄)−1 V W (θ̄)∗−1 .
55
the equation Gn (θ) = 0 tends to have a solution near the true parameter value, where the
expectation of Gn (θ) is equal to zero. Thus a good estimating function is one with a large
absolute value of the sensitivity.
Ideally, we would base the statistical inference on the likelihood function Ln (θ), and
hence use the score function Un (θ) = ∂θ log Ln (θ) as our estimating function. This usually
yields an efficient estimator. However, when Ln (θ) is not available or is difficult to calculate,
we might prefer to use an estimating function that is easier to obtain and is in some sense
close to the score function. Suppose that both Un (θ) and Gn (θ) have finite variance. Then
it can be proven under usual regularity conditions that
SGn (θ) = −Covθ (Gn (θ), Un (θ)).
Thus we can find an estimating function Gn (θ) that maximizes the absolute value of the
correlation between Gn (θ) and Un (θ) by finding one that maximizes the quantity
KGn (θ) = SGn (θ)2 /Varθ (Gn (θ)) = SGn (θ)2 /Eθ (Gn (θ)2 ), (9.2)
which is known as the Godambe information. This makes intuitive sense: the ratio KGn (θ)
is large when the sensitivity is large and when the variance of Gn (θ) is small. The Godambe
information is a natural generalization of the Fisher information. Indeed, KUn (θ) is the
Fisher information. For a discussion of information quantities in a stochastic process setting,
see Barndorff-Nielsen & Sørensen (1994). In a short while, we shall see that the Godambe
information has a large sample interpretation too. An estimating function G∗n ∈ Gn is called
Godambe-optimal in Gn if
KG∗n (θ) ≥ KGn (θ) (9.3)
for all θ ∈ Θ and for all Gn ∈ Gn .
When the parameter θ is multivariate (p > 1), the sensitivity function is the p × p-matrix
SGn (θ) = Eθ (∂θT Gn (θ)). (9.4)
For a multivariate parameter, the Godambe information is the p × p-matrix
−1
KGn (θ) = SGn (θ)T Eθ Gn (θ)Gn (θ)T SGn (θ), (9.5)
and an optimal estimating function G∗n can be defined by (9.3) with the inequality referring to
the partial ordering of the set of positive semi-definite p×p-matrices. Whether an Godambe-
optimal estimating function exists and whether it is unique depends on the class Gn . In
any case, it is only unique up to multiplication by a regular matrix that might depend on
θ. Specifically, if G∗n (θ) satisfies (9.3), then so does Mθ G∗n (θ) where Mθ is an invertible
deterministic p × p-matrix. Fortunately, the two estimating functions give rise to the same
estimator(s), and we refer to them as versions of the same estimating function. For theoretical
purposes a standardized version of the estimating functions is useful. The standardized
version of Gn (θ) is given by
−1
G(s) T
n (θ) = −SGn (θ) Eθ Gn (θ)Gn (θ)
T
Gn (θ).
56
an identity usually satisfied by the score function. The standardized estimating function
G(s)
n (θ) is therefore more directly comparable to the score function. Note that when the
second Bartlett identity is satisfied, the Godambe information equals minus the sensitivity
matrix.
An Godambe-optimal estimating function is close to the score function Un in an L2 -sense.
Suppose G∗n is Godambe-optimal in Gn . Then the standardized version G∗(s)
n (θ) satisfies the
inequality
Eθ (G(s) T (s)
n (θ) − Un (θ)) (Gn (θ) − Un (θ))
≥ Eθ (G∗(s) T ∗(s)
n (θ) − Un (θ)) (Gn (θ) − Un (θ))
for all θ ∈ Θ and for all Gn ∈ Gn , see Heyde (1988). In fact, if Gn is a closed subspace of the
L2 -space of all square integrable functions of the data, then the quasi-score function is the
orthogonal projection of the score function onto Gn . For further discussion of this Hilbert
space approach to estimating functions, see McLeish & Small (1988). The interpretation
of an optimal estimating function as an approximation to the score function is important.
By choosing a sequence of classes Gn that, as n → ∞, converges to a subspace containing
the score function Un , a sequence of estimators that is asymptotically fully efficient can be
constructed.
The following result by Heyde (1988) can often be used to find the optimal estimating
function.
The condition (9.7) can often be verified by showing that Eθ (Gn (θ)G∗n (θ)T ) = −Eθ (∂θT Gn (θ))
for all θ ∈ Θ and for all Gn ∈ Gn . In such situations, G∗n satisfies the second Bartlett-identity,
(9.6), so that
KG∗n (θ) = Eθ G∗n (θ)G∗n (θ)T .
57
estimating functions. Suppose the estimating function Gn (θ) satisfies the conditions of the
central limit theorem for martingales and let θ̂n be a solution of the equation Gn (θ) = 0.
Under the regularity conditions of the previous section, it can be proved that
−1 D
hG(θ)in 2 Ḡn (θ)(θ̂n − θ0 ) −→ N(0, Ip ). (9.8)
Here hG(θ)in is the quadratic characteristic of Gn (θ) defined by
n
Eθ (Gi (θ) − Gi−1 (θ))(Gi (θ) − Gi−1 (θ))T |Fi−1 ,
X
hG(θ)in =
i=1
P
using the extra assumption that Ḡn (θ)−1 ∂θT Gn (θ) −→
θ
Ip . Details can be found in Heyde
(1988). We see that the inverse of the data-dependent matrix
IGn (θ) = Ḡn (θ)T hG(θ)i−1
n Ḡn (θ) (9.9)
estimates the co-variance matrix of the asymptotic distribution of the estimator θ̂n . Therefore
IGn (θ) can be interpreted as an information matrix, called the Heyde-information. It gener-
alizes the incremental expected information of the likelihood theory for stochastic processes,
see Barndorff-Nielsen & Sørensen (1994). Since Ḡn (θ) estimates the sensitivity function,
and hG(θ)in estimates the variance of the asymptotic distribution of Gn (θ), the Heyde-
information has a heuristic interpretation similar to that of the Godambe-information. In
fact,
Eθ Ḡn (θ) = SGn (θ) and Eθ (hG(θ)in ) = Eθ Gn (θ)Gn (θ)T .
We can thus think of the Heyde-information as an estimated version of the Godambe infor-
mation.
Let Gn be a class of martingale estimating functions with finite variance. We say that a
martingale estimating function G∗n is Heyde-optimal in Gn if
IG∗n (θ) ≥ IGn (θ) (9.10)
Pθ -almost surely for all θ ∈ Θ and for all Gn ∈ Gn .
The following useful result from Heyde (1988) is similar to Theorem 9.1. In order to
formulate it, we need the concept of the quadratic co-characteristic of two martingales, G
and G̃, both of which are assumed to have finite variance:
n
E (Gi − Gi−1 )(G̃i − G̃i−1 )T |Fi−1 .
X
hG, G̃in = (9.11)
i=1
58
Since in many situations condition (9.12) can be verified by showing that hG(θ), G∗(θ)in =
−Ḡn (θ) for all Gn ∈ Gn , it is in practice often the case that Heyde-optimality implies
Godambe-optimality.
Example 9.3 Let us consider a common type of estimating functions. To simplify the
exposition we assume that the observed process is Markovian. For Markov processes it is
natural to base estimating functions on functions hij (y, x; θ), j = 1, . . . , N, i = 1, . . . , n
satisfying that
Eθ (hij (Xi , Xi−1 ; θ)|Fi−1 ) = 0. (9.13)
Such functions define relationships (dependent on θ) between consecutive observation Xi and
Xi−1 that are, on average, equal to zero. It is natural to use such relationships to estimate
θ by solving the equations ni=1 hij (Xi , Xi−1 ; θ) = 0. In order to estimate θ it is necessary
P
that N ≥ p, but if N > p we have too many equations. The theory of optimal estimating
functions tells us how to combine the N functions in an optimal way. We consider the class
of p-dimensional estimating functions of the form
n
X
Gn (θ) = ai (Xi−1 ; θ)hi (Xi , Xi−1 ; θ), (9.14)
i=1
where hi denotes the N-dimensional vector (hi1 , . . . , hiN )T , and ai (x; θ) is a function from
IR × Θ into the set of p × N-matrices that is differentiable with respect to θ. It follows from
(9.13) that Gn (θ) is a p-dimensional unbiased martingale estimating function.
We will now find the matrices ai that combine the N functions hij in an optimal way.
Let Gn be the class of martingale estimating functions of the form (9.14) that have finite
variance. Then n X
Ḡn (θ) = ai (Xi−1 ; θ)Eθ (∂θT hi (Xi , Xi−1 ; θ)|Fi−1 )
i=1
and n
hG(θ), G∗ (θ)in = ai (Xi−1 ; θ)Vhi (Xi−1 ; θ)a∗i (Xi−1 ; θ)T ,
X
i=1
where n
G∗n (θ) = a∗i (Xi−1 ; θ)hi (Xi , Xi−1 ; θ),
X
(9.15)
i=1
and
Vhi (Xi−1 ; θ) = Eθ hi (Xi , Xi−1 ; θ)hi (Xi , Xi−1 ; θ)T |Fi−1
is the conditional covariance matrix of the random vector hi (Xi , Xi−1 ; θ) given Fi−1 . If we
assume that Vhi (Xi−1 ; θ) is invertible and define
a∗i (Xi−1 ; θ) = −Eθ (∂θT hi (Xi , Xi−1 ; θ)|Fi−1)T Vhi (Xi−1 ; θ)−1 , (9.16)
then the condition (9.12) is satisfied. Hence by Theorem 9.2 the estimating function G∗n (θ)
with a∗i given by (9.16) is Heyde-optimal - provided, of course, that it has finite variance.
Since Ḡ∗n (θ)−1 hG∗ (θ)in = −Ip is non-random, the estimating function G∗n (θ) is also Godambe-
optimal. If a∗i were defined without the minus, G∗n (θ) would obviously also be optimal. The
reason for the minus will be clear in the following.
59
We shall now see, in exactly what sense the optimal estimating function (9.15) approxi-
mates the score function. The following result was first given by Kessler (1996). Let pi (y; θ|x)
denote the conditional density of Xi given that Xi−1 = x. Then the likelihood function for
θ based on the data (X1 , . . . , Xn ) is
n
Y
Ln (θ) = pi (Xi ; θ|Xi−1 )
i=1
(with p1 denoting the unconditional density of X1 ). If we assume that all pi s are differentiable
with respect to θ, the score function is
n
X
Un (θ) = ∂θ log pi (Xi ; θ|Xi−1 ). (9.17)
i=1
Let us fix i, xi−1 and θ and consider the L2 -space Ki (xi−1 , θ) of functions f : IR 7→ IR for
which f (y)2pi (y; θ|xi−1 )dy < ∞. We equip Ki (xi−1 , θ) with the usual inner product
R
Z
hf, gi = f (y)g(y)pi(y; θ|xi−1)dy,
and let Hi (xi−1 , θ) denote the N-dimensional subspace of Ki (xi−1 , θ) spanned by the functions
y 7→ hij (y, xi−1 ; θ), j = 1, . . . , N. That the functions are linearly independent in Ki (xi−1 , θ)
follows from the earlier assumption that the covariance matrix Vhi (xi−1 ; θ) is regular.
Now, assume that ∂θj log pi (y|xi−1 ; θ) ∈ Ki (xi−1 , θ) for j = 1, . . . , p, denote by gij∗ the
orthogonal projection with respect to h·, ·i of ∂θj log pi onto Hi (xi−1 , θ), and define a p-
dimensional function by gi∗ = (gi1 ∗ ∗ T
, . . . , gip ) . Then (under weak regularity conditions)
where a∗i is the matrix defined by (9.16). To see this, note that g ∗ must have the form (9.18)
with a∗i satisfying the normal equations
60
Acknowledgements
The research was supported by the Danish Center for Accounting and Finance funded by
the Danish Social Science Research Council and by the Center for Research in Econometric
Analysis of Time Series funded by the Danish National Research Foundation.
References
Aı̈t-Sahalia, Y. (2002). “Maximum likelihood estimation of discretely sampled diffusions: a
closed-form approximation approach”. Econometrica, 70:223–262.
Aı̈t-Sahalia, Y. & Mykland, P. (2003). “The effects of random and discrete sampling when
estimating continuous-time diffusions”. Econometrica, 71:483–549.
Beskos, A.; Papaspiliopoulos, O.; Roberts, G. O. & Fearnhead, P. (2006). “Exact and
computationally efficient likelihood-based estimation for discretely observed diffusion
processes”. J. Roy. Statist. Soc. B, 68:333–382.
Bibby, B. M. (1995). Inference for diffusion processes with particular emphasis on compart-
mental diffusion processes. PhD thesis, University of Aarhus.
Bibby, B. M.; Skovgaard, I. M. & Sørensen, M. (2005). “Diffusion-type models with given
marginals and autocorrelation function”. Bernoulli, 11:191–220.
Bibby, B. M. & Sørensen, M. (1995). “Martingale estimation functions for discretely observed
diffusion processes”. Bernoulli, 1:17–39.
61
Bibby, B. M. & Sørensen, M. (1996). “On estimation for discretely observed diffusions: a
review”. Theory of Stochastic Processes, 2:49–56.
Bibby, B. M. & Sørensen, M. (2003). “Hyperbolic processes in finance”. In Rachev, S., editor,
Handbook of Heavy Tailed Distributions in Finance, pages 211–248. Elsevier Science.
Billingsley, P. (1961). “The Lindeberg-Lévy theorem for martingales”. Proc. Amer. Math.
Soc., 12:788–792.
Brockwell, P. J. & Davis, R. A. (1991). Time Series: Theory and Methods. Springer-Verlag,
New York.
Chan, K. C.; Karolyi, G. A.; Longstaff, F. A. & Sanders, A. B. (1992). “An empirical
comparison of alternative models of the short-term interest rate”. Journal of Finance,
47:1209–1227.
De Jong, F.; Drost, F. C. & Werker, B. J. M. (2001). “A jump-diffusion model for exchange
rates in a target zone”. Statistica Neerlandica, 55:270–300.
Ditlevsen, P. D.; Ditlevsen, S. & Andersen, K. K. (2002). “The fast climate fluctuations
during the stadial and interstadial climate states”. Annals of Glaciology, 35:457–462.
Doukhan, P. (1994). Mixing, Properties and Examples. Springer, New York. Lecture Notes
in Statistics 85.
Down, D.; Meyn, S. & Tweedie, R. (1995). “Exponential and uniform ergodicity of Markov
processes”. Annals of Probability, 23:1671–1691.
62
Durham, G. B. & Gallant, A. R. (2002). “Numerical techniques for maximum likelihood
estimation of continuous-time diffusion processes”. J. Business & Econom. Statist.,
20:297–338.
Elerian, O.; Chib, S. & Shephard, N. (2001). “Likelihood inference for discretely observed
non-linear diffusions”. Econometrica, 69:959–993.
Fisher, R. A. (1935). “The logic of inductive inference”. J. Roy. Statist. Soc., 98:39–54.
Genon-Catalot, V. (1990). “Maximum contrast estimation for diffusion processes from dis-
crete observations”. Statistics, 21:99–116.
Genon-Catalot, V. & Jacod, J. (1993). “On the estimation of the diffusion coefficient for
multi-dimensional diffusion processes”. Ann. Inst. Henri Poincaré, Probabilités et Statis-
tiques, 29:119–151.
Gloter, A. & Sørensen, M. (2008). “Estimation for stochastic differential equations with a
small diffusion coefficient”. Stoch. Proc. Appl. To appear.
Gobet, E. (2002). “LAN property for ergodic diffusions with discrete observations”. Ann.
Inst. Henri Poincaré, Probabilités et Statistiques, 38:711–737.
63
Godambe, V. P. & Heyde, C. C. (1987). “Quasi likelihood and optimal estimation”. Inter-
national Statistical Review, 55:231–244.
Gourieroux, C. & Jasiak, J. (2006). “Multivariate Jacobi process and with application to
smooth transitions”. Journal of Econometrics, 131:475–505.
Gradshteyn, I. S. & Ryzhik, I. M. (1965). Table of Integrals, Series, and Products, 4th
Edition. Academic Press, New-York.
Hall, P. & Heyde, C. C. (1980). Martingale Limit Theory and Its Applications. Academic
Press, New York.
Hansen, L. P. (1985). “A method for calculating bounds on the asymptotic covariance matri-
ces of generalized method of moments estimators”. Journal of Econometrics, 30:203–238.
Hansen, L. P.; Heaton, J. C. & Ogaki, M. (1988). “Efficiency bounds implied by multiperiod
conditional restrictions”. Journal of the American Statistical Association, 83:863–871.
Hansen, L. P. & Scheinkman, J. A. (1995). “Back to the future: generating moment impli-
cations for continuous-time Markov processes”. Econometrica, 63:767–804.
Hansen, L. P.; Scheinkman, J. A. & Touzi, N. (1998). “Spectral methods for identifying
scalar diffusions”. Journal of Econometrics, 86:1–32.
Heyde, C. C. (1988). “Fixed sample and asymptotic optimality for classes of estimating
functions”. Contemporary Mathematics, 80:241–247.
Jacod, J. & Sørensen, M. (2008). “Aspects of asymptotic statistical theory for stochastic
processes.”. Preprint, Department of Mathematical Sciences, University of Copenhagen.
In preparation.
Kelly, L.; Platen, E. & Sørensen, M. (2004). “Estimation for discretely observed diffusions
using transform functions”. J. Appl. Prob., 41:99–118.
64
Kessler, M. (1997). “Estimation of an ergodic diffusion from discrete observations”. Scand.
J. Statist., 24:211–229.
Kessler, M. (2000). “Simple and explicit estimating functions for a discretely observed
diffusion process”. Scand. J. Statist., 27:65–82.
Kessler, M. & Paredes, S. (2002). “Computational aspects related to martingale estimating
functions for a discretely observed diffusion”. Scand. J. Statist., 29:425–440.
Kessler, M. & Sørensen, M. (1999). “Estimating equations based on eigenfunctions for a
discretely observed diffusion process”. Bernoulli, 5:299–314.
Kimball, B. F. (1946). “Sufficient statistical estimation functions for the parameters of the
distribution of maximum values”. Ann. Math. Statist., 17:299–309.
Kloeden, P. E. & Platen, E. (1999). Numerical Solution of Stochastic Differential Equations.
3rd revised printing. Springer-Verlag, New York.
Kusuoka, S. & Yoshida, N. (2000). “Malliavin calculus, geometric mixing, and expansion of
diffusion functionals”. Probability Theory and Related Fields, 116:457–484.
Larsen, K. S. & Sørensen, M. (2007). “A diffusion model for exchange rates in a target
zone”. Mathematical Finance, 17:285–306.
Li, B. (1997). “On the consistency of generalized estimating equations”. In Basawa, I. V.;
Godambe, V. P. & Taylor, R. L., editors, Selected Proceedings of the Symposium on
Estimating Functions, pages 115–136. Hayward: Institute of Mathematical Statistics.
IMS Lecture Notes – Monograph Series, Vol. 32.
Liang, K.-Y. & Zeger, S. L. (1986). “Longitudinal data analysis using generalized linear
model”. Biometrika, 73:13–22.
McLeish, D. L. & Small, C. G. (1988). The Theory and Applications of Statistical Inference
Functions. Springer-Verlag, New York. Lecture Notes in Statistics 44.
Nagahara, Y. (1996). “Non-Gaussian distribution for stock returns and related stochastic
differential equation”. Financial Engineering and the Japanese Markets, 3:121–149.
Overbeck, L. & Rydén, T. (1997). “Estimation in the Cox-Ingersoll-Ross model”. Econo-
metric Theory, 13:430–461.
Ozaki, T. (1985). “Non-linear time series models and dynamical systems”. In Hannan, E. J.;
Krishnaiah, P. R. & Rao, M. M., editors, Handbook of Statistics, Vol. 5, pages 25–83.
Elsevier Science Publishers.
Pearson, K. (1895). “Contributions to the Mathematical Theory of Evolution II. Skew
Variation in Homogeneous Material”. Philosophical Transactions of the Royal Society
of London. A, 186:343–414.
Pedersen, A. R. (1994). “Quasi-likelihood inference for discretely observed diffusion pro-
cesses”. Research Report No. 295, Department of Theoretical Statistics, Institute of
Mathematics, University of Aarhus.
65
Pedersen, A. R. (1995). “A new approach to maximum likelihood estimation for stochastic
differential equations based on discrete observations”. Scand. J. Statist., 22:55–71.
Prakasa Rao, B. L. S. (1988). “Statistical inference from sampled data for stochastic pro-
cesses”. Contemporary Mathematics, 80:249–284.
Prentice, R. L. (1988). “Correlated binary regression with covariates specific to each binary
observation”. Biometrics, 44:1033–1048.
Roberts, G. O. & Stramer, O. (2001). “On inference for partially observed nonlinear diffusion
models using Metropolis-Hastings algorithms”. Biometrika, 88:603–621.
Sørensen, M. (2007). “Efficient estimation for ergodic diffusions sampled at high frequency”.
Preprint, Department of Mathematical Sciences, University of Copenhagen.
Veretennikov, A. Y. (1987). “Bounds for the mixing rate in the theory of stochastic equa-
tions”. Theory of Probability and its Applications, 32:273–281.
Yoshida, N. (1992). “Estimation for diffusion processes from discrete observations”. Journal
of Multivariate Analysis, 41:220–242.
66