Statistics Diffusions

Estimating functions for diffusion-type processes
Michael Sørensen
Department of Mathematical Sciences
University of Copenhagen
Universitetsparken 5, DK-2100 Copenhagen Ø, Denmark
1
1 Introduction
In this chapter we consider parametric inference based on discrete time observations X0 , Xt1 ,
. . . , Xtn from a d-dimensional stochastic process. In most of the chapter the statistical model
for the data will be a diffusion model given by a stochastic differential equation. We shall,
however, also consider some examples of non-Markovian models, where we typically assume
that the data are partial observations of a multivariate stochastic differential equation. We
assume that the statistical model is indexed by a p-dimensional parameter θ.
The focus will be on estimating functions. An estimating function is a p-dimensional
function of the parameter θ and the data:
Gn (θ; X0 , Xt1 , . . . , Xtn ).
Usually we suppress the dependence on the observations in the notation and write Gn (θ).
We obtain an estimator by solving the equation
Gn (θ) = 0. (1.1)
Estimating functions provide a general framework for finding estimators and studying their
properties in many different kinds of statistical models. The estimating function approach
has turned out to be very useful for discretely sampled parametric diffusion-type models,
where the likelihood function is usually not explicitly known. Estimating functions are
typically constructed by combining relationships (dependent on the unknown parameter)
between an observation and one or more of the previous observations that are informative
about the parameters.
As an example, suppose the statistical model for the data X0 , X∆ , X2∆ , . . . , Xn∆ is the
one-dimensional stochastic differential equation
dXt = −θ tan(Xt )dt + dWt ,
where θ > 0 and W is a Wiener process. The state-space is (−π/2, π/2). This model will
be considered in more detail in Subsection 3.6. For this process Kessler & Sørensen (1999)
proposed the estimating function
n
1
h i
sin(X(i−1)∆ ) sin(Xi∆ ) − e−(θ+ 2 )∆ sin(X(i−1)∆ ) ,
X
Gn (θ) =
i=1
which can be shown to be a martingale, when θ is the true parameter. For such martingale
estimating functions, asymptotic properties of the estimators as the number of observations
tends to infinity can be studied by means of martingale limit theory, see Subsection 3.2. An
explicit estimator θ̂n of the parameter θ is obtained by solving the estimating equation (1.1):
Pn !
−1 i=1 sin(X(i−1)∆ ) sin(Xi∆ ) 1
θ̂n = ∆ log Pn 2
− ,
i=1 sin(X(i−1)∆ ) 2
provided that
n
X
sin(X(i−1)∆ ) sin(Xi∆ ) > 0. (1.2)
i=1
2
If this condition is not satisfied, the estimating equation (1.1) has no solution, but fortunately
it can be shown that the probability that (1.2) holds tends to one as n tends to infinity. As
illustrated by this example, it is quite possible that the estimating equation (1.1) has no
solution. We shall give general conditions that ensure the existence of a unique solution
when enough data are available.
The idea of using estimating equations is an old one and goes back at least to Karl
Pearson’s introduction of the method of moments. The term estimating function may have
been coined by Kimball (1946).
2 Low frequency asymptotics

In this section, we assume that observations have been made at the equidistant time points
i∆, i = 1, . . . , n, and consider the classical asymptotic scenario, where the time between
observations, ∆, is fixed, and the number of observations, n, goes to infinity. Since ∆ is fixed,
we simplify the notation and denote the observations by X1 , X2 , . . . , Xn . We will generally
suppress ∆ in the notation in this section. As usual, we assume that the statistical model
is indexed by a p-dimensional parameter θ, which we want to estimate. The corresponding
probability measures are denoted by (Pθ ). The distribution of the data is given by the true
probability measure, which we denote by P .
We study the asymptotic properties of an estimator θ̂n obtained by solving the estimating
equation (1.1) when Gn is an estimating function of the general form
n
1X
Gn (θ) = g(Xi−r+1, . . . , Xi; θ), (2.1)
n i=r
where r is a fixed integer smaller than n, and g is a suitable function with values in IRp .
All estimators discussed in this chapter can be represented in this way. A priori there is no
guarantee that a unique solution to (1.1) exists. By a Gn –estimator, we mean an estimator,
θ̂n , which solves (1.1) when the data belongs to a subset An ⊆ D n , and is otherwise given a
value δ ∈ / Θ. We give results ensuring that, as n → ∞, the probability of An tends to one.
Also a uniqueness result will be given.
We assume that, under the true probability measure, {Xi } is a stationary process with
state space D ⊆ IRd . We let Q denote the joint distribution of (X1 , . . . , Xr ), and Q(f ) the
expectation of f (X1 , . . . , Xr ) for a function f : D r 7→ IR. To obtain our asymptotic results
about Gn –estimators, we need to assume that a law of large numbers (an ergodic theorem)
as well as a central limit theorem hold. Specifically, we assume that as n → ∞
n
1X P
f (Xi−r+1 , . . . , Xi ) −→ Q(f ) (2.2)
n i=r
for any function f : D r 7→ IR such that Q(|f |) < ∞, and that the estimating function (2.1)
satisfies n
1 X L
√ g(Xi−r+1 , . . . , Xi ; θ) −→ N(0, V (θ)) (2.3)
n i=r
under P for any θ for which Q(g(θ)) = 0. Here V (θ) is a positive definite p × p-matrix.
Moreover, g(θ) denotes the function (x1 , . . . , xr ) 7→ g(x1 , . . . , xr ; θ), convergence in probabil-
P L
ity under P is indicated by −→, and −→ denotes convergence in distribution.
3
The following condition ensures the existence of a consistent Gn –estimator. We denote
transposition of matrices by T , and ∂θT Gn (θ) denotes the p × p-matrix, where the ijth entry
is ∂θj Gn (θ)i .
Condition 2.1 There is a parameter value θ̄ ∈ int Θ and a neighbourhood N of θ̄ in Θ,

such that:
(1) The function g(θ) : (x1 , . . . , xr ) 7→ g(x1 , . . . , xr ; θ) is integrable with respect to Q for all
θ ∈ N, and
Q g(θ̄) = 0. (2.4)
(2) The function θ 7→ g(x1 , . . . , xr ; θ) is continuously differentiable on N for all (x1 , . . . , xr ) ∈
Dr .
(3) The function (x1 , . . . , xr ) 7→ k∂θT g(x1 , . . . , xr ; θ)k is dominated for all θ ∈ N by a
function which is integrable with respect to Q.
(4) The p × p matrix
W = Q ∂θT g(θ̄) (2.5)
is invertible.
Here and later Q(g(θ)) denotes the vector (Q(gj (θ)))j=1,...,p , where gj is the jth coordinate

of g, and Q (∂θT g(θ)) is the matrix {Q ∂θj gi (θ) }i,j=1,...,p .
To formulate the uniqueness result in the following theorem, we need the concept of
locally dominated integrability. A function f : D r × Θ 7→ IRq is called locally dominated
integrable with respect to Q if for each θ′ ∈ Θ there exists a neighbourhood Uθ′ of θ′ and a non-
negative Q-integrable function hθ′ : D r 7→ IR such that | f (x1 , . . . , xr ; θ) | ≤ hθ′ (x1 , . . . , xr )
for all (x1 , . . . , xr , θ) ∈ D r × Uθ′ .
Theorem 2.2 Assume Condition 2.1 and (2.3). Then a θ̄-consistent Gn –estimator θ̂n ex-
ists, and
√ L
−1

n(θ̂n − θ0 ) −→ Np 0, W −1V W T (2.6)
under P , where V = V (θ̄). If, moreover, the function g(x1 , . . . , xr ; θ) is locally dominated
integrable with respect to Q and
Q(g(θ)) 6= 0 for all θ 6= θ̄, (2.7)
then the estimator θ̂n is the unique Gn –estimator on any bounded subset of Θ containing θ̄
with probability approaching one as n → ∞.
Remark: By a θ̄-consistent estimator is meant that θ̂n converges in probability to θ̄ as

n → ∞. If the true model belongs to the statistical model, i.e. if P = Pθ0 for some θ0 ∈ Θ,
then θ̂n is most useful if Theorem 2.2 holds with θ̄ = θ0 . Note that because θ̄ ∈ Θ, a θ̄-
consistent estimator Gn –estimator θ̂n will satisfy Gn (θ̂n ) = 0 with probability approaching
one as n → ∞.
In order to prove Theorem 2.2, we need the following uniform law of large numbers.
4
Lemma 2.3 Consider a function f : D r × K 7→ IRq , where K is a compact subset of Θ.
Suppose f is a continuous function of θ for all (x1 , . . . , xr ) ∈ D r , and that there exists a
Q-integrable function h : D r 7→ IR such that kf (x1 , . . . , xr ; θ)k ≤ h(x1 , . . . , xr ) for all θ ∈ K.
Then θ 7→ Q(f (θ)) is continuous, and
n
1X P
sup k f (Xi−r+1 , . . . , Xi ; θ) − Q(f (θ)) k → 0. (2.8)
θ∈K n i=r
Proof: That Q(f (θ)) is continuous follows from the dominated convergence theorem.
To prove (2.8), define for η > 0:
k(η; x1 , . . . , xr ) = sup kf (x1 , . . . , xr ; θ′ ) − f (x1 , . . . , xr ; θ)k,

θ,θ ′ ∈M :kθ ′ −θk≤η
and let k(η) denote the function (x1 , . . . , xr ) 7→ k(η; x1 , . . . , xr ). Since k(η) ≤ 2h, it follows
from the dominated convergence theorem that Q(k(η)) → 0 as η → 0. Moreover, Q(f (θ)) is
uniformly continuous on the compact set K. Hence for any given ǫ > 0 we can find η > 0
such that Q(k(η)) ≤ ǫ and kθ − θ′ k < η implies that kQ(f (θ)) − Q(f (θ′ ))k ≤ ǫ. Define the
balls Bη (θ) = {θ′ : kθ − θ′ k < η}. Since K is compact, there exists a finite covering
m
[
K⊆ Bη (θj ),
j=1
where θ1 , . . . , θm ∈ K, so for every θ ∈ K we can find θℓ (ℓ ∈ {1, . . . , m}) such that

θ ∈ Bη (θℓ ). Thus with
n
1X
Fn (θ) = f (Xi−r+1 , . . . , Xi ; θ)
n i=r
we have
kFn (θ) − Q(f (θ))k
≤ kFn (θ) − Fn (θℓ )k + kFn (θℓ ) − Q(f (θℓ ))k + kQ(f (θℓ )) − Q(f (θ))k
n
1X
≤ k(η; Xν−r+1, . . . , Xν ) + kFn (θℓ ) − Q(f (θℓ ))k + ǫ
n ν=r
n
1X
≤ k(η; Xν−r+1 , . . . , Xν ) − Q(k(η))
n ν=r
+Q(k(η)) + kFn (θℓ ) − Q(f (θℓ ))k + ǫ
≤ Zn + 2ǫ,
where
n
1X
Zn = k(η; Xν−r+1, . . . , Xν ) − Q(k(η))
n ν=r
+ max kFn (θℓ ) − Q(f (θℓ ))k.
1≤ℓ≤m
5
By (2.2), P (Zn > ǫ) → 0 as n → ∞, so
!
P sup kFn (θ) − Q(f (θ))k > 3ǫ → 0
θ∈K
for all ǫ > 0. 2

Proof (of Theorem 2.2): The existence of a θ̄-consistent Gn –estimator θ̂n follows from
Theorem 8.2. Condition (i) follows from (2.2) and (2.4). Define the function W (θ) =
Q (∂θT g(θ)). Then condition (iii) in Theorem 8.2 is equal to Condition 2.1 (4). Finally, let
M be a compact subset of N containing θ̄. Then the conditions of Lemma 2.3 are satisfied
for f = ∂θT g, so (8.1) is satisfied. The asymptotic normality, (3.15), follows from Theorem
8.5 and (2.3).
In order to prove the last statement, let K be a compact subset of Θ containing θ̄. By the
finite covering property of a compact set, it follows from the local dominated integrability
of g that g satisfies the conditions of Lemma 2.3. Hence (8.2) holds with G(θ) = Q(g(θ))
and M = K. From the local dominated integrability of g and the dominated convergence
theorem it follows that G(θ) is a continuous function, so (2.7) implies that
inf |G(θ)| > 0,

K\B̄ǫ (θ̄)
for all ǫ > 0, where B̄ǫ (θ) is the closed ball with radius ǫ centered at θ. By Theorem 8.3
it follows that (8.4) holds with M = K for every ǫ > 0. Let θ̂n′ be a Gn –estimator, and
define a Gn –estimator by θ̂n′′ = θ̂n′ 1{θ̂n′ ∈ K} + θ̂n 1{θ̂n′ ∈
/ K}, where 1 denotes an indicator
function, and θ̂n is the consistent Gn –estimator we know exists. By (8.4) the estimator θ̂n′′
is consistent, so by Theorem 8.2, P (θ̂n 6= θ̂n′′ ) → 0 as n → ∞. Hence θ̂n is eventually the
unique Gn –estimator on K.
2
3 Martingale estimating functions

In this section we consider observations X0 , Xt1 , . . . , Xtn of a d-dimensional diffusion process
given by the stochastic differential equation
dXt = b(Xt ; θ)dt + σ(Xt ; θ)dWt , (3.1)
where σ is a d × d-matrix and W a d-dimensional standard Wiener process. We denote the

state space by D. When d = 1, the state space is an interval (ℓ, r), where ℓ could possibly
be −∞, and r might be ∞. The drift b and the diffusion matrix σ depend on a parameter
θ which varies in a subset Θ of IRp . The equation (3.1) is assumed to have a weak solution,
and the coefficients b and σ are assumed to be smooth enough to ensure, for every θ ∈ Θ, the
uniqueness of the law of the solution, which we denote by Pθ . We denote the true parameter
value by θ0 .
We suppose that the transition distribution has a density y 7→ p(∆, x, y; θ) with respect
to the Lebesgue measure on D, and that p(∆, x, y; θ) > 0 for all y ∈ D. The transition
density is the conditional density under Pθ of Xt+∆ given that Xt = x.
6
We shall, in this section, be concerned with statistical inference based on estimating
functions of the form n X
Gn (θ) = g(∆i , Xti−1 , Xti ; θ). (3.2)
i=1
where g is p-dimensional function that satisfies that

Z
g(∆, x, y; θ)p(∆, x, y; θ)dy = 0 (3.3)
D
for all ∆ > 0, x ∈ D and all θ ∈ Θ. Thus, by the Markov property, the stochastic
process {Gn (θ)}n∈IN is a martingale with respect to {Fn }n∈IN under Pθ . Here and later
Fn = σ(Xti : i ≤ n). An estimating function with this property is called a martingale
estimating function.
3.1 Likelihood inference

The diffusion process X is a Markov process, so the likelihood function based on the obser-
vations X0 , Xt1 , · · · , Xtn , conditional on X0 , is
n
Y
Ln (θ) = p(ti − ti−1 , Xti−1 , Xti ; θ), (3.4)
i=1
where y 7→ p(s, x, y; θ) is the transition density and t0 = 0. Under weak regularity conditions
the maximum likelihood estimator is efficient, i.e. it has the smallest asymptotic variance
among all estimators. The transition density is only rarely explicitly known, but several
numerical approaches and accurate approximations make likelihood inference feasible for
diffusion models. We shall return to the problem of calculating the likelihood function in
Subsection 4.
The vector of partial derivatives of the log-likelihood function with respect to the coor-
dinates of θ,
n
X
Un (θ) = ∂θ log Ln (θ) = ∂θ log p(∆i , Xti−1 , Xti ; θ), (3.5)
i=1
where ∆i = ti − ti−1 , is called the score function (or score vector). Here it is obviously as-
sumed that the transition density is a differentiable function of θ. The maximum likelihood
estimator usually solves the estimating equation Un (θ) = 0. The score function is a mar-
tingale with respect to {Fn }n∈IN under Pθ , which is easily seen provided that the following
interchange of differentiation and integration is allowed:

Eθ ∂θ log p(∆i , Xti−1 , Xti ; θ) Xt1 , . . . , Xti−1
Z
∂θ p(∆i , Xti−1 , y; θ)
= p(∆i , Xti−1 , y, θ)dy
D p(∆i , Xti−1 , y; θ)
Z
= ∂θ p(∆i , Xti−1 , y; θ)dy = 0.
D
Since the score function is a martingale estimating function, the asymptotic results presented
in the next subsection applies to the maximum likelihood estimator. Asymptotic results
7
for the maximum likelihood estimator in the fixed ∆ (low frequency) asymptotic scenario
considered in this section were established by Dacunha-Castelle & Florens-Zmirou (1986).
Asymptotic results when the observations are made at random time points were obtained
by Aı̈t-Sahalia & Mykland (2003).
A simple approximation to the likelihood function is obtained by approximating the tran-
sition density by a Gaussian density with the correct first and second conditional moments.
For a one-dimensional diffusion we get
(y − F (∆, x; θ))2
" #
1
p(∆, x, y; θ) ≈ q(∆, x, y; θ) = q exp −
2πφ(∆, x; θ) 2φ(∆, x; θ)
where Z r
F (∆, x; θ) = Eθ (X∆ |X0 = x) = yp(∆, x, y; θ)dy. (3.6)
ℓ
and
φ(∆, x; θ) = (3.7)
Z r
Varθ (X∆ |X0 = x) = [y − F (∆, x; θ)]2 p(∆, x, y; θ)dy.
ℓ
In this way we obtain the quasi-likelihood

n
Y
Ln (θ) ≈ QLn (θ) = q(∆i , Xti−1 , Xti ; θ),
i=1
and by differentiation with respect to the parameter vector, we obtain the quasi-score func-
tion
n
(
X ∂θ F (∆i , Xti−1 ; θ)
∂θ log QLn (θ) = [Xti − F (∆i , Xti−1 ; θ)] (3.8)
i=1 φ(∆i , Xti−1 ; θ)
)
∂θ φ(∆i , Xti−1 ; θ) h 2
i
+ (X ti
− F (∆i , X ti−1
; θ)) − φ(∆i , X ti−1
; θ) .
2φ(∆i , Xti−1 ; θ)2
It is clear from (3.6) and (3.7) that {∂θ log QLn (θ)}n∈IN is a martingale with respect to
{Fn }n∈IN under Pθ . This quasi-score function is a particular case of the quadratic martin-
gale estimating functions considered by Bibby & Sørensen (1995) and Bibby & Sørensen
(1996). Maximum quasi-likelihood estimation for diffusions was considered by Bollerslev &
Wooldridge (1992).
3.2 Asymptotics
In this subsection we give asymptotic results for estimators obtained from martingale esti-
mating functions as the number of observations goes to infinity. To simplify the exposition
the observation time points are assumed to be equidistant, i.e., ti = i∆, i = 0, 1, . . . , n. Since
∆ is fixed, we will in most cases suppress ∆ in the notation and write for example p(x, y; θ)
and g(x, y; θ).
It is assumed that the diffusion is ergodic, that its invariant probability measure has
density function µθ for all θ ∈ Θ, and that X0 ∼ µθ under Pθ . Thus the diffusion is
stationary.
8
When the observed process, X, is a one-dimensional diffusion, the following simple con-
ditions ensure ergodicity, and an explicit expression exists for the density of the invariant
probability measure. The scale measure of X has Lebesgue density
!
x b(y; θ)
Z
s(x; θ) = exp −2 dy , x ∈ (ℓ, r), (3.9)
x# σ 2 (y; θ)
where x# ∈ (ℓ, r) is arbitrary.
Condition 3.1 The following holds for all θ ∈ Θ:

Z r Z x#
s(x; θ)dx = s(x; θ)dx = ∞
x# ℓ
and Z r
[s(x; θ)σ 2 (x; θ)]−1 dx = A(θ) < ∞.
ℓ
Under Condition 3.1 the process X is ergodic with an invariant probability measure with
Lebesgue density
µθ (x) = [A(θ)s(x; θ)σ 2 (x; θ)]−1 , x ∈ (ℓ, r). (3.10)
For details see e.g. Skorokhod (1989). For general one-dimensional diffusions, the measure
with Lebesgue density proportional to s(x; θ)σ 2 (x; θ)]−1 is called the speed measure.
Let Qθ denote the probability measure on D 2 given by
Qθ (dx, dy) = µθ (x)p(∆, x, y; θ)dxdy. (3.11)
This is the distribution of two consecutive observations (X∆(i−1) , X∆i ). Under the assumption
of ergodicity the law of large numbers (2.2) is satisfied for any function f : D 2 7→ IR such
that Q(|f |) < ∞, see e.g. Skorokhod (1989).
We impose the following condition on the function g in the estimating function (3.2)

Qθ g(θ)T g(θ) = (3.12)
Z
g(y, x; θ)T g(y, x; θ)µθ (x)p(x, y; θ)dydx < ∞,
D2
for all θ ∈ Θ. By (2.2),

n
1X Pθ
g(X∆i, X∆(i−1) ; θ′ ) −→ Qθ (g(θ′ )). (3.13)
n i=1
Since the estimating function Gn (θ) is a martingale under Pθ , the asymptotic normality
in (2.3) follows without further conditions from the central limit theorem for martingales,
see Hall & Heyde (1980). This result goes back to Billingsley (1961). In the martingale case
the asymptotic covariance matrix V (θ) in (2.3) is given by

V (θ) = Qθ0 g(θ)g(θ)T . (3.14)
Thus we have the following particular case of Theorem 2.2.
9
Theorem 3.2 Assume Condition 2.1 is satisfied with r = 2, θ̄ = θ0 , and Q = Qθ0 , where θ0
is the true parameter value, and that (2.3) holds for θ = θ0 with V (θ) given by (3.14). Then
a θ0 -consistent Gn –estimator θ̂n exists, and
√ L
−1

n(θ̂n − θ0 ) −→ Np 0, W −1V W T (3.15)
under Pθ0 , where W is given by (2.5) with θ̄ = θ0 and V = V (θ0 ). If, moreover, the function
g(x, y; θ) is locally dominated integrable with respect to Qθ0 and
Qθ0 (g(θ)) 6= 0 for all θ 6= θ0 , (3.16)
then the estimator θ̂n is the unique Gn –estimator on any bounded subset of Θ containing θ0
with probability approaching one as n → ∞.
In practice we do not know the value of θ0 , so it is necessary to check that the conditions
of Theorem 3.2 hold for a neighbourhood of any value of θ0 ∈ int Θ.
The asymptotic covariance matrix of the estimator θ̂n can be estimated consistently by
means of the following theorem.
Theorem 3.3 Under Condition 2.1 (2) – (4) (with r = 2, θ̄ = θ0 , and Q = Qθ0 ),
n
1X Pθ0
Wn = ∂θT g(X(i−1)∆ , Xi∆ ; θ̂n ) −→ W, (3.17)
n i=1
where θ̂n is a θ0 -consistent estimator. The probability that Wn is invertible approaches one
as n → ∞. If, moreover, the function (x, y) 7→ kg(x, y; θ)k is dominated for all θ ∈ N by
a function which is square integrable with respect to Qθ0 , then
n
1X Pθ0
Vn = g(X(i−1)∆ , Xi∆ ; θ̂n )g(X(i−1)∆ , Xi∆ ; θ̂n )T −→ V. (3.18)
n i=1
Proof: Let C be a compact subset of N such that θ0 ∈ int C. By Lemma 2.3,
1 Pn
n i=1 ∂θ T g(X(i−1)∆ , Xi∆ ; θ) converges to Qθ0 (∂θ T g(θ)) in probability uniformly for θ ∈ C.
This implies (3.17) because θ̂n converges in probability to θ0 . The result about invertibility
follows because W is invertible. Also the uniform convergence in probability for θ ∈ C of
1 Pn T T
n i=1 g(X(i−1)∆ , Xi∆ ; θ) g(X(i−1)∆ , Xi∆ ; θ) to Qθ0 (g(θ)g(θ) ) follows from Lemma 2.3.
2
In the case of likelihood inference, the function Qθ0 (g(θ)) appearing in the identifiability
condition (3.16) is related to the Kullback-Leibler divergence between the models. Specifi-
cally, if the following interchange of differentiation and integration is allowed,
Qθ0 (∂θ log p(x, y, θ)) = ∂θ Qθ0 (log p(x, y, θ)) = −∂θ K̄(θ, θ0 ),
where K̄(θ, θ0 ) is the average Kullback-Leibler divergence between the transition distribu-
tions under Pθ0 and Pθ given by
Z
K̄(θ, θ0 ) = K(θ, θ0 ; x) µθ0 (dx),
D
with Z
K(θ, θ0 ; x) = log[p(x, y; θ0 )/p(x, y; θ)]p(x, y; θ0) dy.
D
Thus the identifiability condition can be written in the form ∂θ K̄(θ, θ0 ) 6= 0 for all θ 6= θ0 .
The quantity K̄(θ, θ0 ) is sometimes referred to as the Kullback-Leibler divergence between
the two Markov chain models for the observed process {Xi∆ } under Pθ0 and Pθ .
10
3.3 Godambe-Heyde optimality
In this section we present a general way of approximating the score function by means
of martingales of a similar form. Suppose we have a collection of real valued functions
hj (x, y, ; θ), j = 1, . . . , N satisfying
Z
hj (x, y; θ)p(x, y; θ)dy = 0 (3.19)
D
for all x ∈ D and θ ∈ Θ. Each of the functions hj could be used separately to define an
estimating function of the form (2.1), but a better approximation to the score function,
and hence a more efficient estimator, is obtained by combining them in an optimal way.
Therefore we consider estimating functions of the form
n
X
Gn (θ) = a(X(i−1)∆ , θ)h(X(i−1)∆ , Xi∆ ; θ), (3.20)
i=1
where h = (h1 , . . . , hN )T , and the p × N weight matrix a(x, θ) is a function of x such that
(3.20) is Pθ -integrable. It follows from (3.19) that Gn (θ) is a martingale estimating function,
i.e., it is a martingale under Pθ for all θ ∈ Θ.
The matrix a determines how much weight is given to each of the hj s in the estimation
procedure. This weight matrix can be chosen in an optimal way using the theory of optimal
estimating functions reviewed in Section 9. The optimal weight matrix a∗ gives the estimating
function of the form (3.20) that provides the best possible approximation to the score function
(3.5) in a mean square sense. Moreover, the optimal g ∗(x, y; θ) = a∗ (x; θ)h(x, y; θ) is obtained
from ∂θ log p(x, y; θ) by projection in a certain space of square integrable functions, for details
see Section 9.
The choice of the functions hj , on the other hand, is an art rather than a science. The
ability to tailor these functions to a given model or to particular parameters of interest is a
considerable strength of the estimating functions methodology. It is, however, also a source
of weakness, since it is not always clear how best to choose the hj s. In the following and in
the Sections 3.6 and 3.7, we shall present ways of choosing these functions that usually work
well in practice.
Example 3.4 The martingale estimating function (3.8) is of the type (3.20) with N = 2
and
h1 (x, y; θ) = y − F (∆, x; θ),

h2 (x, y; θ) = (y − F (∆, x; θ))2 − φ(∆, x, θ),
where F and φ are given by (3.6) and (3.7). The weight matrix is
!
∂θ F (∆, x; θ) ∂θ φ(∆, x; θ)
, , (3.21)
φ(∆, x; θ) 2φ2 (∆, x; θ)∆
which we shall see is approximately optimal. 2
In the econometrics literature, a popular way of using functions like hj (x, y, ; θ), j =
1, . . . , N, to estimate the parameter θ is the generalized method of moments (GMM) of
11
Hansen (1982). In practice, the method is often implemented as follows, see e.g. Campbell,
Lo & MacKinlay (1997). Consider
n
1X
Fn (θ) = h(X(i−1)∆ , Xi∆ ; θ).
n i=1
Under weak conditions,

√ cf. Theorem 3.3, a consistent estimator of the asymptotic covariance
matrix M of nFn (θ0 ) is
n
1X
Mn = h(X(i−1)∆ , Xi∆ ; θ̃n )h(X(i−1)∆ , Xi∆ ; θ̃n )T ,
n i=1
where θ̃n is a θ0 -consistent estimator (for instance obtained by minimizing Fn (θ)T Fn (θ)).
The GMM-estimator is obtained by minimizing the function
Hn (θ) = Fn (θ)T Mn−1 Fn (θ).
The corresponding estimating function is obtained by differentiation with respect to θ
∂θ Hn (θ) = Dn (θ)Mn−1 Fn (θ),
where by (2.2)
n
1X Pθ0
Dn (θ) = ∂θ h(X(i−1)∆ , Xi∆ ; θ)T −→ Qθ0 ∂θ h(θ)T .
n i=1
Hence the estimating function ∂θ Hn (θ) is asymptotically equivalent to an estimating function

of the form (3.20) with a constant weight matrix

a(x, θ) = Qθ0 ∂θ h(θ)T M −1 ,
and we see that GMM-estimators are covered by the theory for martingale estimating func-
tions presented in this section.
We now return to the problem of finding the optimal estimating function G∗n (θ), i.e. the
estimating functions of the form (3.20) with the optimal weight matrix. We assume that the
functions hj satisfy the following condition.
Condition 3.5
(1) The functions hj , j = 1, . . . N, are linearly independent.
(2) The functions y 7→ hj (x, y; θ), j = 1, . . . N, are square integrable with respect to p(x, y; θ)
for all x ∈ D and θ ∈ Θ.
(3) h(x, y; θ) is differentiable with respect to θ.
(4) The functions y 7→ ∂θi hj (x, y; θ) are integrable with respect to p(x, y; θ) for all x ∈ D and
θ ∈ Θ.
The class of estimating functions considered here is a particular case of the class treated
in detail in Example 9.3. By (9.16), the optimal choice of the weight matrix a is given by
a∗ (x; θ) = Bh (x; θ) Vh (x; θ)−1 , (3.22)
12
where Z
Bh (x; θ) = ∂θ h(x, y; θ)T p(x, y; θ)dy (3.23)
D
and Z
Vh (x; θ) = h(x, y; θ)h(x, y; θ)T p(x, y; θ)dy. (3.24)
D
The matrix Vh (x; θ) is invertible because the functions hj , j = 1, . . . N are linearly inde-
pendent.Compared to (9.16), we have omitted a minus here. This can be done because
an optimal estimating function multiplied by an invertible p × p-matrix is also an optimal
estimating function and yields the same estimator.
The asymptotic variance of an optimal estimator, i.e. a G∗n –estimator, is simpler than
the general expression in (3.15) because in this case the matrices W and V given by (2.5)
and (3.14) are equal and given by (3.25). This is a general property of optimal estimating
functions as discussed in Section 9. The result can easily be verified under the assumption
that a∗ (x; θ) is a differentiable function of θ: by (3.19)
Z
[∂θi a∗ (x; θ)] h(x, y; θ)p(x, y; θ)dy = 0,
D
so that
Z
W = ∂θT [a∗ (x; θ0 )h(x, y; θ0 )]Qθ0 (dx, dy)
D2

= µθ0 (a∗ (θ0 )Bh (θ0 )T ) = µθ0 Bh (θ0 )Vh (θ0 )−1 Bh (θ0 )T ,
and by direct calculation

V = µθ0 Bh (θ0 )Vh (θ0 )−1 Bh (θ0 )T . (3.25)
Thus we have as a corollary to Theorem 2.2 that is g ∗(x, y, θ) = a∗ (x; θ)h(x, y; θ) satisfies
the conditions of Theorem 2.2, then a sequence θ̂n of G∗n –estimators has the asymptotic
distribution √ D

n(θ̂n − θ0 ) −→ Np 0, V −1 . (3.26)
Example 3.6 Consider the martingale estimating function of form (3.20) with N = 2 and
with h1 and h2 as in Example 3.4, where the diffusion is one-dimensional. The optimal
weight matrix has columns given by
∂θ φ(x; θ)η(x; θ) − ∂θ F (x; θ)ψ(x; θ)
a∗1 (x; θ) =
φ(x; θ)ψ(x; θ) − η(x; θ)2
∂θ F (x; θ)η(x; θ) − ∂θ φ(x; θ)φ(x; θ)
a∗2 (x; θ) = ,
φ(x; θ)ψ(x; θ) − η(x; θ)2
where
η(x; θ) = Eθ ([X∆ − F (x; θ)]3 |X0 = x)
and
ψ(x; θ) = Eθ ([X∆ − F (x; θ)]4 |X0 = x) − φ(x; θ)2 .
For the square-root diffusion (the CIR-model)
q
dXt = −β(Xt − α)dt + τ Xt dWt , (3.27)
13
where β, τ > 0, the optimal weights can be found explicitly. For this model
F (x; θ) = xe−β∆ + α(1 − e−β∆ )

τ2 1
φ(x; θ) = ( 2 α − x)e−2β∆ − (α − x)e−β∆ + 21 α
β
τ4
η(x; θ) = 2
α − 3(α − x)e−β∆ + 3(α − 2x)e−2β∆
2β

− (α − 3x)e−3β∆
3τ 6
ψ(x; θ) = (α − 4x)e−4β∆ − 4(α − 3x)e−3β∆
4β 3

+ 6(α − 2x)e−2β∆ − 4(α − x)e−β∆ + α + 2φ(x; θ)2 .
We give a method to derive these expression in Section 3.6.

The expressions for a∗1 and a∗1 can for general diffusions be simplified by the approxima-
tions
η(t, x; θ) ≈ 0 and ψ(t, x; θ) ≈ 2φ(t, x; θ)2 , (3.28)
which would be exactly true if transition density were a normal distribution. If we insert the
Gaussian approximations into the expressions for a∗1 and a∗2 , we obtain the weight functions
in (3.8). When ∆ is not large this can be justified, because the transition distribution is not
far from Gaussian. 2
In Subsections 3.6 and 3.7 we shall present martingale estimating functions for which the
matrices Bh (x; θ) and Vh (x; θ) can be found explicitly, but for most models these matrices
must be found by simulation, a problem considered in Subsection 3.5. In situations where a∗
must be determined by a relatively time consuming numerical method, it might be preferable
to use the estimating function
n
G•n (θ) = a∗ (X(i−1)∆ ; θ̃n )h(X(i−1)∆ , Xi∆ ; θ),
X
(3.29)
i=1
where θ̃n is a weakly θ0 -consistent estimator, for instance obtained by some simple choice of
the weight matrix a. In this way a∗ needs to be calculated only once per observation point.
Under weak regularity conditions, the G•n -estimator has the same efficiency as the optimal
G∗n -estimator; see e.g. Jacod & Sørensen (2008).
Most martingale estimating functions proposed in the literature are of the form (3.20)
with
θ
hj (x, y; θ) = fj (y; θ) − π∆ (fj (θ))(x), (3.30)
or more specifically,
n h i
θ
X
Gn (θ) = a(X(i−1)∆ , θ) f (Xi∆ ; θ) − π∆ (f (θ))(X(i−1)∆ ) . (3.31)
i=1
Here f = (f1 , . . . , fN )T maps D × Θ into IRN , and π∆

θ
denotes the transition operator
Z
πsθ (f )(x) = f (y)p(s, x, y; θ)dy = Eθ (f (Xs ) | X0 = x), (3.32)
D
14
applied to each coordinate of f . The polynomial estimating functions given by fj (y) = y j ,
j = 1, . . . , N, are an example. For martingale estimating functions of the special form (3.31),
the expression for the optimal weight matrix simplifies a bit because
θ θ
Bh (x; θ)ij = π∆ (∂θi fj (θ))(x) − ∂θi π∆ (fj (θ))(x), (3.33)
i = 1, . . . p, j = 1, . . . , N, and
θ θ θ
Vh (x; θ)ij = π∆ (fi (θ)fj (θ))(x) − π∆ (fi (θ))(x)π∆ (fj (θ))(x), (3.34)
i, j = 1, . . . , N. If the functions fj are chosen to be independent of θ, then

θ
Bh (x; θ)ij = −∂θi π∆ (fj )(x). (3.35)
A useful approximations to the optimal weight matrix can be obtained by applying the
formula
k
si i
πsθ (f )(x) = Aθ f (x) + O(sk+1),
X
(3.36)
i=0 i!
where Aθ denotes the generator of the diffusion
d d
Ckℓ (x; θ)∂x2k xℓ f (x),
X X
1
Aθ f (x) = bk (x; θ)∂xk f (x) + 2 (3.37)
k=1 k,ℓ=1
where C = σσ T . The formula (3.36) holds for 2(k + 1) times continuously differentiable
functions under weak conditions which ensure that the remainder term has the correct or-
der, see Kessler (1997) and Subsection 3.4. It is often enough to use the approximation
θ
π∆ (fj )(x) ≈ fj (x) + ∆Aθ fj (x). When f does not depend on θ this implies that for d = 1
h i
Bh (x; θ) ≈ ∆ ∂θ b(x; θ)f ′ (x) + 12 ∂θ σ 2 (x; θ)f ′′ (x) (3.38)
and for d = 1 and N = 1

h i
Vh (x; θ) ≈ ∆ Aθ (f 2 )(x) − 2f (x)Aθ f (x) = ∆ σ 2 (x; θ)f ′ (x)2 . (3.39)
We will refer to estimating functions obtained by approximating the optimal weight-

matrix a∗ in this way as approximately optimal estimating functions. Use of this approxi-
mation will save computer time and improve the numerical performance of the estimation
procedure. The approximation will not affect the consistency of the estimators, and if ∆ is
not too large, it will just lead to a relatively minor loss of efficiency. The magnitude of this
loss of efficiency can be calculated by means of (3.36).
Example 3.7 If we simplify the optimal weight matrix found in Example 3.6 by the expan-
sion (3.36) and the Gaussian approximation (3.28), we obtain the approximately optimal
quadratic martingale estimating function
n
(
∂θ b(X(i−1)∆ ; θ)
G◦n (θ)
X
= [Xi∆ − F (X(i−1)∆ ; θ)] (3.40)
i=1 σ 2 (X(i−1)∆ ; θ)
∂θ σ 2 (X(i−1)∆ ; θ) h
)
i
2
+ (X i∆ − F (X (i−1)∆ ; θ)) − φ(X (i−1)∆ ; θ) .
2σ 4 (X(i−1)∆ ; θ)∆
15
As in Example 3.6 the diffusion is assumed to be one-dimensional.
Consider a diffusion with linear drift, b(x; θ) = −β(x − α). Diffusion models with linear
drift and a given marginal distribution were studied in Bibby, Skovgaard & Sørensen (2005).
If σ 2 (x; θ)µθ (x)dx < ∞, then the Ito-integral in
R
Z t Z t
Xt = X0 − β(Xs − α)ds + σ(Xs ; θ)dWs
0 0
is a proper martingale with mean zero, so the function f (t) = Eθ (Xt | X0 = x) satisfies that
Z t
f (t) = x − β f (s)ds + βαt
0
or
f ′ (t) = −βf (t) + βα, f (0) = x.
Hence
f (t) = xe−βt + α(1 − e−βt )
or
F (x; α, β) = xe−β∆ + α(1 − e−β∆ )
If only estimates of drift parameters are needed, we can use the linear martingale estimating
function of the form (3.20) with N = 1 and h1 (x, y; θ) = y − F (∆, x; θ). If σ(x; θ) = τ κ(x)
for τ > 0 and κ a positive function, then the approximately optimal estimating function of
this form is
n
1
 h i 
−β∆ −β∆
X
Xi∆ − X(i−1)∆ e − α(1 − e ) 
 i=1 κ2 (X(i−1)∆ )


G◦n (α, β) =
 
 ,
 
n
 X X(i−1)∆ h i 
Xi∆ − X(i−1)∆ e−β∆ − α(1 − e −β∆
)
 
2
i=1 κ (X(i−1)∆ )
where multiplicative constants have been omitted. To solve the estimating equation G◦n (α, β) =
0 we introduce the weights
n
wiκ −2
κ(X(j−1)∆ )−2 ,
X
= κ(X(i−1)∆ ) /
j=1
Pn Pn
and let X̄ κ = i=1 wiκ Xi∆ and X̄−1 κ
= i=1 wiκ X(i−1)∆ be conditional precision weighted
sample averages of Xi∆ and X(i−1)∆ , respectively. The equation G◦n (α, β) = 0 has a unique
explicit solution provided that the weighted sample autocorrelation
Pn
i=1 wiκ (Xi∆ − X̄ κ )(X(i−1)∆ − X̄−1
κ
)
rnκ = Pn κ κ 2
i=1 wi (X(i−1)∆ − X̄−1 )
is positive. By the law of large numbers for ergodic processes, the probability that rnκ > 0
tends to one as n tends to infinity. Specifically, we obtain the explicit estimators
X̄ κ − rnκ X̄−1
κ
α̂n =
1 − rnκ
1
β̂n = − log (rnκ ) .
∆
16
A slightly simpler and asymptotically equivalent estimator may be obtained by substituting
X̄ κ for X̄−1
κ
everywhere, in which case α is estimated by the precision weighted sample√
average X̄ κ . For the square-root process (CIR-model) given by (3.27), where κ(x) = x,
a simulation study and an investigation of the asymptotic variance of these estimators in
Bibby & Sørensen (1995) show that they are not much less efficient than the estimators from
the optimal estimating function; see also the simulation study in Overbeck & Rydén (1997).
To obtain an explicit approximately optimal quadratic estimating function, we need
an expression for the conditional variance φ(x; θ). As we saw in Example 3.6, φ(x; θ) is
explicitly known for the square-root process (CIR-model) given by (3.27). For this model the
approximately optimal quadratic martingale estimating function is
n
 
X 1 h
−β∆ −β∆
i
Xi∆ − X(i−1)∆ e − α(1 − e )
 i=1 X(i−1)∆
 
 

 n 
 Xh i 

 Xi∆ − X(i−1)∆ e−β∆ − α(1 − e−β∆ ) 

 i=1
 
.
 n 
1
 X 2 
 Xi∆ − X(i−1)∆ e−β∆ − α(1 − e−β∆ ) 
 i=1 X(i−1)∆
 

τ 2 n
 # 
 o 
−2β∆ −β∆

− α/2 − X(i−1)∆ e − (α − X(i−1)∆ )e + α/2 
β
This expression is obtained from (3.40) after multiplication by an invertible non-random

matrix to obtain a simpler expression. This does not change the estimator. From this
estimating function explicit estimators can easily be obtained:
n
1X e−β̂n ∆
α̂n = Xi∆ + (Xn∆ − X0 ),
n i=1 n 1 − e−β̂n ∆
essentially the sample mean when n is large, and

Pn Pn −1
Pn
−β̂n ∆
n i=1 Xi∆ /X(i−1)∆ − ( i=1 Xi∆ )(
i=1 X(i−1)∆ )
e = −1
n2 − ( ni=1 X(i−1)∆ )( ni=1 X(i−1)∆ )
P P
2
Pn −1 −β̂n ∆
i=1 X(i−1)∆ Xi∆ − X(i−1)∆ e − α̂n (1 − e−β̂n )
τ̂n2 = Pn −1
,
i=1 X(i−1)∆ ψ(X(i−1)∆ ; α̂n , β̂n )
where
ψ(x; α, β) = ( 12 α − x)e−2β∆ − (α − x)e−β∆ + 21 α /β.
It is obviously necessary for this solution to the estimating equation to exist that the ex-
pression for e−β̂n ∆ is strictly positive, an event that happens with a probability tending to
one as n → ∞. Again this follows from the law of large numbers for ergodic processes. 2
When the optimal weight matrix is approximated by means of (3.36), there is a certain
loss of efficiency, which as in the previous example is often quite small; see Bibby & Sørensen
(1995) and Section 6 on high frequency asymptotics below. Therefore the relatively simple
estimating function (3.40) is often a good choice in practice.
17
θ
It is tempting to go on to approximate π∆ (fj (θ))(x) in (3.31) by (3.36) in order to obtain
an explicit estimating function, but as will be demonstrated in Subsection 3.6, this can be
θ
a dangerous procedure. In general the conditional expectation in π∆ should therefore be
approximated by simulations. Fortunately, Kessler & Paredes (2002) have established that,
provided the simulation is done with sufficient accuracy, this does not cause any bias, only a
minor loss of efficiency that can be made arbitrarily small; see Subsection 3.5. Moreover, as
θ
we shall also see in Subsection 3.6, π∆ (fj (θ))(x) can be found explicitly for a quite flexible
class of diffusions.
3.4 Small ∆-optimality

The Godambe-Heyde optimal estimating functions discussed above are optimal within a
within a certain class of estimating functions. In this subsection we present the concept
of small ∆-optimality, introduced and studied by Jacobsen (2001) and Jacobsen (2002). A
small ∆-optimal estimating function is optimal among all estimating functions satisfying
weak regularity conditions, but only only for high sampling frequencies, i.e. when the time
between observations is small. Thus the advantage of the concept of small ∆-optimality
is that the optimality is global, while the advantage of the concept of Godambe-Heyde
optimality is that the optimality holds for all sampling frequencies. Fortunately, we do
not have to choose between the two, because it turns out that Godambe-Heyde optimal
martingale estimating functions of the form (3.20) and (3.30) are small ∆-optimal.
Small ∆-optimality was originally introduced for general estimating functions for mul-
tivariate diffusion models, but to simplify the exposition we will concentrate on martingale
estimating functions and on one-dimensional diffusions of the form
dXt = b(Xt ; α)dt + σ(Xt ; β)dWt , (3.41)
where θ = (α, β) ∈ Θ ⊆ IR2 . This is the simplest model type for which the essential features
of the theory appear. Note that the drift and the diffusion coefficient depend on different
parameters. It is assumed that the diffusion is ergodic, that its invariant probability measure
has density function µθ for all θ ∈ Θ, and that X0 ∼ µθ under Pθ . Thus the diffusion is
stationary.
Throughout this subsection, we shall assume that the observation times are equidis-
tant, i.e. ti = i∆, i = 0, 1, . . . , n, where ∆ is fixed, and that the martingale estimat-
ing function (3.2) satisfies the conditions of Theorem 3.2, so that we know that (even-
tually) a Gn -estimator θ̂n exists, which is asymptotically normal with covariance matrix
−1
M(g) = W −1 V W T , where W is given by (2.5) with θ̄ = θ0 and V = V (θ0 ) with V (θ) given
by (3.14).
The main idea of small ∆-optimality is to expand the asymptotic covariance matrix in
powers of ∆
1
M(g) = v−1 (g) + v0 (g) + o(1). (3.42)
∆
Small ∆-optimal estimating functions minimize the leading term in (3.42). Jacobsen (2001)
obtained (3.42) by Ito-Taylor expansions, see Kloeden & Platen (1999), of the random ma-
trices that appear in the expressions for W and V under regularity conditions that will be
given below. A similar expansion was used in Aı̈t-Sahalia & Mykland (2003) and Aı̈t-Sahalia
& Mykland (2004).
18
To formulate the conditions, we define the differential operator Aθ , θ ∈ Θ. Its domain, Γ
is the set of continuous real-valued functions (s, x, y) 7→ ϕ(s, x, y) of s ≥ 0 and (x, y) ∈ D 2
that are continuous differentiable in s and twice continuously differentiable in y. The operator
Aθ is given by
Aθ ϕ(s, x, y) = ∂s ϕ(s, x, y) + Aθ ϕ(s, x, y), (3.43)
where Aθ is the generator (3.37) , which for every s and x is applied to the function y 7→
ϕ(s, x, y). The operator Aθ acting on functions in Γ that do not depend on x is the generator
of the space-time process (t, Xt )t<geq0 . We also need the probability measure Q∆θ given by
(3.11). Note that in this section the dependence on ∆ is explicit in the notation.
Condition 3.8 The function ϕ belongs to Γ and satisfies that
Z
ϕ(s, x, y)Qsθ0 (dx, dy) < ∞
ZD2
(Aθ0 ϕ(s, x, y))2Qsθ0 (dx, dy) < ∞

2
ZD
(∂y ϕ(s, x, y))2 σ 2 (y; β0)Qsθ0 (dx, dy) < ∞
D2
for all s ≥ 0.
As usual θ0 = (α0 , β0 ) denotes the true parameter value. We will say that a function with
values in IRk or IRk×ℓ satisfies Condition 3.8 is each component of the functions satisfies this
condition.
Suppose ϕ satisfies Condition 3.8. Then by Ito’s formula
ϕ(t, X0 , Xt ) = (3.44)
Z t Z t
ϕ(0, X0, X0 ) + Aθ0 ϕ(s, X0 , Xs )ds + ∂y ϕ(s, X0 , Xs )dWs
0 0
under Pθ0 . A significant consequence of Condition 3.8 is that the Ito-integral in (3.44) is a
true Pθ0 -martingale, and thus has expectation zero under Pθ0 . If the function Aθ0 ϕ satisfies
Condition 3.8, a similar result holds for this functions, which we can insert in the Lebesgue
integral in (3.44). By doing so and then taking the conditional expectation given X0 = x on
both sides of (3.44), we obtain
πtθ0 (ϕ)(t, x) = ϕ(0, x, x) + tAθ0 ϕ(0, x, x) + O(t2 ), (3.45)

where
πtθ (ϕ)(t, x) = Eθ (ϕ(t, X0 , Xt )|X0 = x) .
If the functions Aiθ0 ϕ, i = 0, . . . , k satisfy Condition 3.8, where Aiθ0 denotes i-fold application
of the operator Aθ0 , we obtain by similar arguments that
k
si i
πtθ0 (ϕ)(t, x) Aθ0 ϕ(0, x, x) + O(tk+1).
X
= (3.46)
i=0 i!
Note that A0θ is the identity operator. The previously used expansion (3.36) is a particular
case of (3.46). In the case where ϕ does not depend on x (or y) the integrals in Condition
3.8 are with respect to the invariant measure µθ0 . If, moreover, ϕ does not depend on time
s, the conditions do not depend on s.
19
Theorem 3.9 Suppose that the function g(∆, x, y; θ0) in (3.2) is such that g, ∂θT g, gg T and
Aθ0 g satisfy Condition 3.8. Assume, moreover, that we have the expansion
g(∆, x, y; θ0) = g(∆, x, y; θ0) + ∆∂∆ g(0, x, y; θ0) + oθ0 ,x,y (∆).
If the matrix Z r
S= Bθ0 (x)µθ0 (x)dx (3.47)
ℓ
is invertible, where
Bθ (x) = (3.48)
1 2
 
∂α b(x; α)∂y g1 (0, x, x; θ) 2 ∂β v(x; β)∂y g1 (0, x, x; θ)
,
 

1 2
∂α b(x; α)∂y g2 (0, x, x; θ) 2 ∂β v(x; β)∂y g2 (0, x, x; θ)
then (3.42) holds with

 R −1 
r 2 2
 ℓ (∂α b(x; α0 )) /σ (x; β0 )µθ0 (x)dx 0 
v−1 (g) ≥ 

.
 (3.49)
0 0
These is equality in (3.49) if
∂y g1 (0, x, x; θ0 ) = ∂α b(x; α0 )/σ 2 (x; β0 ), (3.50)

∂y g2 (0, x, x; θ0 ) = 0 (3.51)
for all x ∈ (ℓ, r). In this case, the second term in (3.42) satisfies that
Z r 2 −1
2 4
v0 (g)22 ≥ 2 ∂β σ (x; β0 ) /σ (x; β0 )µθ0 (x)dx
ℓ
with equality if
∂y2 g2 (0, x, x; θ0 ) = ∂β σ 2 (x; β0 )/σ 2 (x; β0 )2 , (3.52)
for all x ∈ (ℓ, r).
Thus the conditions for small ∆-optimality are (3.50), (3.51) and (3.52). For a proof of
Theorem 3.9, see Jacobsen (2001). The condition (3.51) ensures that all entries of v−1 (g)
involving the diffusion coefficient parameter, β, are zero. Since v−1 (g) is the ∆−1 -order term
in the expansion (3.42) of the asymptotic covariance matrix, this dramatically decreases the
asymptotic variance of the estimator of β when ∆ is small. We refer to the condition (3.51)
as Jacobsen’s condition.
The reader is reminded of the trivial fact that for any non-singular 2 × 2 matrix, Mn ,
the estimating functions Mn Gn (θ) and Gn (θ) give exactly the same estimator. We call them
versions of the same estimating function. The matrix Mn may depend on ∆n . Therefore a
given version of an estimating function needs not satisfy (3.50) – (3.52). The point is that
a version must exist which satisfies these conditions.
20
Example 3.10 Consider a quadratic martingale estimating function of the form
a1 (x, ∆; θ)[y − F (∆, x; θ)]

!
g(∆, y, x; θ) = , (3.53)
a2 (x, ∆; θ) [(y − F (∆, x; θ))2 − φ(∆, x; θ)]
where F and φ are given by (3.6) and (3.7). By (3.36), F (∆, x; θ) = x+O(∆) and φ(∆, x; θ) =
O(∆), so
a1 (x, 0; θ)(y − x)
!
g(0, y, x; θ) = . (3.54)
a2 (x, 0; θ)(y − x)2
Since ∂y g2 (0, y, x; θ) = 2a2 (x, ∆; θ)(y − x), the Jacobsen condition (3.51) is satisfied for all
quadratic martingale estimating functions. Using again (3.36), it is not difficult to see that
the two other conditions (3.50) and (3.52) are satisfied in three particular cases: the optimal
estimating function given in Example 3.6 and the approximations (3.8) and (3.40). 2
The following theorem gives conditions ensuring, for given functions f1 , . . . , fN , that a
small ∆-optimal estimating function of the form (3.20) and (3.30) exists. This not always
the case. We assume that the functions f1 (·; θ), . . . , fN (·; θ) are of full affine rank for all θ,
i.e., for any θ ∈ Θ, the identity
N
aθj fj (x; θ) + aθ0 = 0,
X
x ∈ (ℓ, r),
j=1
for constants aθj , implies that aθ0 = aθ1 = · · · = aθN = 0.
Theorem 3.11 Suppose that N ≥ 2, that the functions fj are twice continuously differen-
tiable and satisfies that the matrix
∂x f1 (x; θ) ∂x2 f1 (x; θ)

 
D(x) =   (3.55)
∂x f2 (x; θ) ∂x2 f2 (x; θ)
is invertible for µθ -almost all x. Moreover, assume that the coefficients b and σ are con-
tinuously differentiable with respect to the parameter. Then a specification of the weight
matrix a(x; θ), independent of ∆, exists such that the estimating function (3.20) satisfies the
conditions (3.51), (3.50) and (3.52). When N = 2, these conditions are satisfy for
∂α b(x; α)/v(x; β) c(x; θ)

 
a(x; θ) =   D(x)−1 (3.56)

2
0 ∂β v(x; β)/v(x; β)
for any function c(x; θ).
For a proof of Theorem 3.11, see Jacobsen (2002). In Section 6, we shall see that the
Godambe-Heyde optimal choice (3.22) of the weight-matrix in (3.20) gives an estimating
function which has a version that satisfies the conditions for small ∆-optimality, (3.50) –
(3.52).
We have focused on one-dimensional diffusions to simplify the exposition. The situ-
ation becomes more complicated for multi-dimensional diffusions, as we shall now briefly
21
describe. Details can be found in Jacobsen (2002). For a d-dimensional diffusion, b(x; α)
is d-dimensional and v(x; β) = σ(x; β)σ(x; β)T is a d × d-matrix. The Jacobsen condition
is unchanged (except that ∂y g2 (0, x, x; θ0 ) is now a d-dimensional vector). The other two
conditions for small ∆-optimality are
∂y g1 (0, x, x; θ0 ) = ∂α b(x; α0 )T v(x; β0 )−1
and −1
vec ∂y2 g2 (0, x, x; θ0 ) = vec (∂β v(x; β0 )) v ⊗2 (x; β0 ) .
In the latter equation, vec(M) denotes for a d × d matrix M the d2 -dimensional row vector
consisting of the rows of M placed one after the other, and M ⊗2 is the d2 × d2 -matrix with
(i′ , j ′ ), (ij)th entry equal to Mi′ i Mj ′ j . Thus if M = ∂β v(x; β) and M • = (v ⊗2 (x; β))−1 , then
the (i, j)th coordinate of vec(M) M • is i′ j ′ Mi′ j ′ M(i• ′ j ′ ),(i,j) .
P
For a d-dimensional diffusion process, the conditions analogous to those in Theorem 3.11
ensuring the existence of a small ∆-optimal estimating function of the form (3.20) is that
N ≥ d(d + 3)/2, and that the N × (d + d2 )-matrix

∂xT f (x; θ) ∂x2T f (x; θ)
has full rank d(d + 3)/2.
3.5 Simulated martingale estimating functions

The conditional moments that appear in the martingale estimating functions can for most
diffusion models not be calculated explicitly. For a versatile class of one-dimensional diffu-
sions, optimal martingale estimating functions can be found explicitly; see Subsections 3.6
and 3.7. Estimation and inference is dramatically simplified by using a model for which an
explicit optimal martingale estimating function is available. However, if for some reason a
diffusion from this class is not a suitable model, the conditional moments must be determined
by simulation.
The conditional moment πθ∆ f (x) = Eθ (f (X∆ ) | X0 = x) can be found straightforwardly.
Simply fix θ and simulate numerically M independent trajectories X (i) , i = 1, . . . , M of
{Xt : t ∈ [0, ∆]} with X0 = x. By the law of large numbers,
M
. 1 X (i)
πθ∆ f (x) = f (X∆ ).
M i=1
The variance of the error can be estimated in the traditional way, and by the cental limit
theorem, the error is approximately normal distributed. This simple approach can be im-
proved by applying variance reduction methods, for instance methods that take advantage
of the fact that πθ∆ f (x) can be approximated by (3.36). Methods for numerical simulation
of diffusion models can be found in Kloeden & Platen (1999).
The approach just described is sufficient when calculating the conditional expectation
appearing in (3.30), although it is important to use the same random numbers (seed) when
calculating the estimating functions for different values of the parameter θ, for instance
when using a search algorithm to find a solution to the estimating equation. More care is
needed if the optimal weight functions are calculated numerically. The problem is that the
22
optimal weight matrix typically contain derivatives with respect to θ of functions that must
be determined numerically, see e.g. Example 3.6. Pedersen (1994) proposed a procedure for
determining ∂θ πθ∆ f (x; θ) by simulations based on results in Friedman (1975). However, it
is often preferable to use an approximation to the optimal weight matrix obtained by using
(3.36), possibly supplemented by Gaussian approximations, as explained in Subsection 3.3.
This is not only much simpler, but also avoids potentially serious problems of numerical
instability, and by results in Section 6 the loss of efficiency is often very small. The approach
outlined here, where martingale estimating functions are approximated by simulation, is
closely related to the simulated method of moments, see Duffie & Singleton (1993) and
Clement (1997).
One might be worried that when approximating a martingale estimating function by
simulation of conditional moments, the resulting estimator might have considerably smaller
efficiency or even be inconsistent. The asymptotic properties of the estimators obtained
when the conditional moments are approximated by simulation were investigated by Kessler
& Paredes (2002), who found that if the simulations are done with sufficient care, there is
no need to worry. However, their results also show that care is needed: if the discretization
used in the simulation method is too crude, the estimator behaves badly. Kessler & Paredes
(2002) considered martingale estimating functions of the general form
n h
X i
Gn (θ) = f (Xi∆ , X(i−1)∆ ; θ) − F (X(i−1)∆ ; θ) , (3.57)
i=1
where f is a p-dimensional function, and

F (x; θ) = Eθ (f (X∆ , x; θ))|X0 = x).
As previously, X is the unique solution of the stochastic differential equation (3.1). For
simplicity X is assumed to be one-dimensional, but Kessler & Paredes (2002) point out that
similar results hold for multivariate diffusions. Below the dependence of X on the initial
value X0 = x and θ is, when needed, emphasized in the notation by writing X(x, θ).
Let Y (δ, θ, x) be an approximation to the solution X(θ, x) , which is calculated at discrete
time points with step size δ that is much smaller than ∆, and which satisfies that Y0 (δ, θ, x) =
x. A simple example is the Euler scheme
Yiδ = Y(i−1)δ + b(Y(i−1)δ ; θ)δ + σ(Y(i−1)δ ; θ)Zi , Y0 = x, (3.58)
where the Zi s are independent and Zi ∼ N(0, δ).
If the conditional expectation F (x; θ) is approximated by the simple method described
above, we obtain the following approximation to the estimating function (3.57)
GM,δ
n (θ) = (3.59)
 
n M
X
f (Xi∆ , X(i−1)∆ ; θ) −
1 X (j)
f (Y∆ (δ, θ, X(i−1)∆ ), X(i−1)∆ ; θ) ,
i=1 M j=1
where Y (j) (δ, θ, x), j = 1, . . . , N are independent copies of Y (δ, θ, x).

Kessler & Paredes (2002) assume that the approximation scheme Y (δ, θ, x) is of weak
order β > 0 in the sense that
|Eθ (g(X∆ (x, θ), x; θ)) − E(g(Y∆ (δ, θ, x), x; θ))| ≤ R(x; θ)δ β (3.60)
23
for all θ ∈ Θ, for all x in the state space of X and for δ sufficiently small. Here R(x; θ)
is of polynomial growth in x uniformly for θ in compact sets, i.e., for any compact subset
K ⊆ Θ, there exist constants C1 , C2 > 0 such that supθ∈K |R(x; θ)| ≤ C1 (1 + |x|C2 ) for all x
in the state space of the diffusion. The inequality (3.60) is assumed to hold for any function
g(y, x; θ) which is 2(β + 1) times differentiable with respect to x, and satisfies that g and its
partial derivatives (with respect to x) up to order 2(β + 1) are of polynomial growth in x
uniformly for θ in compact sets. This definition of weak order is stronger than the definition
in Kloeden & Platen (1999) in that control of the polynomial order with respect to the
initial value x is added, but Kessler & Paredes (2002) point out that theorems in Kloeden
& Platen (1999) that give the order of approximation schemes can be modified in a tedious,
but straightforward, way to ensure that the schemes satisfy the stronger condition (3.60). In
particular, the Euler scheme (3.58) is of weak order one if the coefficients of the stochastic
differential equation (3.1) are smooth enough.
Under a number of further regularity conditions, Kessler & Paredes (2002) showed the
following results about a GM,δ M,δ
n -estimator, θ̂n , with Gn
M,δ
given by (3.57). We shall not go
into these rather technical conditions. Not surprisingly, they include conditions that ensure
the eventual existence of a consistent and asymptotically
√ normal Gn -estimator, cf. Theorem
3.2. If δ goes to zero sufficiently fast that nδ β → 0 as n → ∞, then
√ M,δ
D

n θ̂n − θ0 −→ N 0, (1 + M −1 )Σ ,
where Σ denotes the asymptotic covariance matrix of a Gn -estimator, see Theorem 3.2. Thus
for δ sufficiently small and M sufficiently large, it does not matter much that the conditional
moment F (x; θ) has been determined by simulation in (3.59). Moreover,√ β we can control the
loss of efficiency by our choice of M. However, when 0 < limn→∞ nδ < ∞,
√ M,δ
D

n θ̂n − θ0 −→ N m(θ0 ), (1 + M −1 )Σ ,
√
and when nδ β → ∞,
δ −β θ̂nN,δ − θ0 → m(θ0 )
in probability. Here the p-dimensional vector m(θ0 ) depends on f and is generally different
from zero. Thus it is essential that a sufficiently small value of δ is used.
3.6 Explicit martingale estimating functions

In this section we consider one-dimensional diffusion models for which estimation is partic-
ularly easy because an explicit martingale estimating function exists.
Kessler & Sørensen (1999) proposed estimating functions of the form (3.31) where the
functions fj , i = 1, . . . , N are eigenfunctions for the generator (3.37), i.e.
Aθ fj (x; θ) = −λj (θ)fj (x; θ),
where the real number λj (θ) ≥ 0 is called the eigenvalue corresponding to fj (x; θ). Under
weak regularity conditions, fj is also an eigenfunction for the transition operator πtθ , i.e.
πtθ (fj (θ))(x) = e−λj (θ)t fj (x; θ). (3.61)
for all t > 0. Thus the function hj in (3.30) is explicit.
24
Theorem 3.12 Let φ(x; θ) be an eigenfunction for the generator (3.37) with eigenvalue
λ(θ). Suppose Z r
[∂x φ(x; θ)σ(x; θ)]2 µθ (dx) < ∞ (3.62)
ℓ
for all t > 0. Then
πtθ (φ(θ))(x) = e−λ(θ)t φ(x; θ). (3.63)
for all t > 0.
Proof: Define Yt = eλt φ(Xt ). We suppress θ in the notation. By Ito’s formula
Yt = Y0 + 0t eλs [Aφ(Xs ) + λφ(Xs )]ds +

Rt
eλs φ′ (Xs )σ(Xs )dWs
R
0
= Y0 + 0t eλs φ′ (Xs )σ(Xs )dWs ,
R
so by (3.62), Y is a true martingale, which implies (3.63). 2

Note that if σ(x; θ) and ∂x φ(x; θ) are bounded functions of x ∈ (ℓ, r), then (3.62) holds.
If φ is a polynomial of order k and σ(x) ≤ C(1 + xm ), then (3.62) holds if the 2(k + m − 1)’th
moment of the invariant distribution µθ is finite.
Example 3.13 For the square-root model (CIR-model) defined by (3.27) with α > 0, β > 0,
(ν) (ν)
and τ > 0, the eigenfunctions are φi (x) = Li (2βxτ −2 ) with ν = 2αβτ −2 − 1, where Li is
the ith order Laguerre polynomial
i
xm
!
(ν) X m i+ν
Li (x) = (−1) ,
m=0
i−m m!
(ν)
and the eigenvalues are {iβ : i = 0, 1, · · ·}. It is easily seen by direct calculation that Li
solves the differential equation
τ xf ′′ (x) − β(x − α)f ′(x) + iβf (x) = 0.
By Theorem 3.12, (3.63) holds, so we can calculate all conditional polynomial moments, of
which the first four were given in Example 3.6. Thus all polynomial martingale estimating
functions are explicit.
2
Example 3.14 The diffusion given as the solution of
dXt = −θ tan(Xt )dt + dWt , (3.64)
is an ergodic diffusion on the interval (−π/2, π/2) provided that θ ≥ 1/2, so that Condition
3.1 is satisfied. This process was introduced by Kessler & Sørensen (1999), who called it an
Ornstein-Uhlenbeck process on (−π/2, π/2) because tan x ∼ x near zero. The generalization
to other finite intervals is obvious. The invariant measure has a density proportional to
cos(x)2θ .
The eigenfunctions are
φi (x; θ) = Ciθ (sin(x)), i = 1, 2, . . . ,
25
where Ciθ is a Gegenbauer polynomial of order i, and the eigenvalues are i(θ + i/2), i =
1, 2, . . .. This follows because the Gegenbauer polynomial Ciθ solves the differential equation
(2θ + 1)y ′ i(2θ + i)

f ′′ (y) + 2
f (y) − 2 f (y) = 0,
y −1 y −1
so that φi (x; θ) solves the equation

1 ′′
φ (x; θ) − θ tan(x)φ′i (x; θ) = −i(θ + i/2)φi (x; θ).
2 i
Hence φi is an eigenfunction for the generator of the model with eigenvalue i(θ + i/2). From
equation 8.934-2 in Gradshteyn & Ryzhik (1965) it follows that
i
! !
X θ−1+m θ−1+i−m
φi (x; θ) = cos[(2m − i)(π/2 − x)].
m=0
m i−m
Condition (3.62) in Theorem 3.12 is obviously satisfied because the state space is bounded,
so (3.63) holds.
The first non-trivial eigenfunction is sin(x) (a constant is omitted) with eigenvalue θ+1/2.
From the martingale estimating function
n
sin(X(i−1)∆ )[sin(Xi∆ )) − e−(θ+1/2)∆ sin(X(i−1)∆ ))],
X
Ǧn (θ) (3.65)
i=1
we obtain the very simple estimator for θ

Pn !
−1 i=1 sin(X(i−1)∆ ) sin(Xi∆ )
θ̌n = −∆ log Pn 2 − 1/2,
i=1 sin (X(i−1)∆ )
which is defined when the numerator is positive.

An asymmetric generalization of (3.64) was proposed in Larsen & Sørensen (2007) as
a model of the logarithm of an exchange rate in a target zone. The diffusion solves the
equation
sin 21 π(Xt − m)/z − ϕ
dXt = −ρ dt + σdWt ,
cos 12 π(Xt − m)/z
where ρ > 0, ϕ ∈ (−1, 1), σ > 0 z > 0, m ∈ IR. The process (3.64) is obtained is when
ϕ = 0, m = 0, and z = π/2. The state space is (m − z, m + z), and the process is ergodic if
ρ ≥ 21 σ 2 and −1 + σ 2 /(2ρ) ≤ ϕ ≤ 1 − σ 2 /(2ρ). The eigenfunctions are
(ρ(1−ϕ)σ−2 − 21 , ρ(1+ϕ)σ−2 − 12 )

φi(x; ρ, ϕ, σ, m, z) = Pi sin( 21 πx/z − m) ,
(a,b)

with eigenvalues λi (ρ, ϕ, σ) = i ρ + 21 nσ 2 , i = 1, 2, . . .. Here Pi (x) denotes the Jacobi
polynomial of order i.
2
26
For most diffusion models where explicit expressions for eigenfunctions can be found,
including the examples above, the eigenfunctions are of the form
i
ai,j (θ) κ(y)j
X
φi (y; θ) = (3.66)
j=0
where κ is a real function defined on the state space and is independent of θ. For martingale
estimating functions based on eigenfunctions of this form, the optimal weight matrix (3.22)
can be found explicitly too.
Theorem 3.15 Suppose 2N eigenfunctions are of the form (3.66) for i = 1, . . . , 2N, where
the coefficients ai,j (θ) are differentiable with respect to θ. If a martingale estimating functions
is defined by (3.30) using the first N eigenfunctions, then
j
∂θi aj,k (θ)νk (x; θ) − ∂θi [e−λj (θ)∆ φj (x; θ)]
X
Bh (x, θ)ij = (3.67)
k=0
and
Vh (x, θ)i,j = (3.68)

i j
ai,r (θ)aj,s (θ)νr+s (x; θ) − e−[λi (θ)+λj (θ)]∆ φi (x; θ)φj (x; θ) ,
X X
r=0 s=0
θ
where νi (x; θ) = π∆ (κi )(x), i = 1, . . . , 2N, solve the following triangular system of linear
equations
i
−λi (θ)∆
X
e φi (x; θ) = ai,j (θ)νj (x; θ) i = 1, . . . , 2N, (3.69)
j=0
with ν0 (x; θ) = 1.
Proof: The expressions for Bh and Vh follow from (3.33) and (3.34) when the eigenfunc-
θ
tions are of the form (3.66), and (3.69) follows by applying π∆ to both sides of (3.66).
2
Example 3.16 Consider again the diffusion (3.64) in Example 3.14. We will find the opti-
mal martingale estimating function based on the first non-trivial eigenfunction, sin(x) (where
we have neglected a non-essential multiplicative function of θ) with eigenvalue θ + 1/2. It
follows from (3.33) that
Bh (x; θ) = ∆e−(θ+1/2)∆ sin(x)
because sin(x) does not depend on θ. To find Vh we need Theorem 3.15. The second non-
trivial eigenfunction is 2(θ + 1) sin2 (x) − 1 with eigenvalue 2(θ + 1), so
1 1
ν2 (x; θ) = e−2(θ+1)∆ [sin2 (x) − (θ + 1)−1 ] + (θ + 1)−1 .
2 2
Hence the optimal estimating function is
n 1
sin(X(i−1)∆ )[sin(Xi∆ ) − e−(θ+ 2 )∆ sin(X(i−1)∆ )]
G◦n (θ)
X
= 1 2(θ+1)∆
i=1 2
(e − 1)/(θ + 1) − (e∆ − 1) sin2 (X(i−1)∆ )
27
where a constant has been omitted. When ∆ is small it is a good idea to multiply G◦n (θ) by
∆ because the denominator is then of order ∆.
Note that when ∆ is sufficiently small, we can expand the exponential function in the
numerator to obtain (after multiplication by ∆) the approximately optimal estimating func-
tion 1
n
X sin(X(i−1)∆ )[sin(Xi∆ ) − e−(θ+ 2 )∆ sin(X(i−1)∆ )]
G̃n (θ) = ,
i=1 cos2 (X(i−1)∆ )
which has the explicit solution
Pn !
−1 i=1 tan(X(i−1)∆ ) sin(Xi∆ ))/ cos(X(i−1)∆ ) 1
θ̃n = −∆ log Pn 2 − .
i=1 tan (X(i−1)∆ ) 2
The explicit estimator θ̃ can, for instance, be used as a starting value when finding the
optimal estimator by solving G◦n (θ) = 0 numerically. Note however that for G̃n the square
integrability under Qθ0 (3.12) required in Theorem 3.2 (to ensure the central limit theorem)
is only satisfied when θ0 > 1.5. This problem can be avoided by replacing cos2 (X(i−1)∆ ) in
the numerator by 1, which it is close to when the process is not near the boundaries. In that
way we arrive at the simple estimating function (3.65), which is thus also approximately
optimal.
2
3.7 Pearson diffusions

A widely applicable class of diffusion models for which explicit polynomial eigenfunctions are
available is the class of Pearson diffusions, see Wong (1964) and Forman & Sørensen (2008).
A Pearson diffusion is a stationary solution to a stochastic differential equation of the form
q
dXt = −β(Xt − α)dt + 2β(aXt2 + bXt + c)dWt , (3.70)
where β > 0, and a, b and c are such that the square root is well defined when Xt is in the
state space. The parameter β > 0 is a scaling of time that determines how fast the diffusion
moves. The parameters α, a, b, and c determine the state space of the diffusion as well as
the shape of the invariant distribution. In particular, α is the expectation of the invariant
distribution. We define θ = (α, β, a, b, c).
In the context of martingale estimating functions, an important property of the Pearson
diffusions is that the generator (3.37) maps polynomials into polynomials. It is therefore
easy to find eigenfunctions among the polynomials
n
pn,j xj .
X
pn (x) =
j=0
The polynomial pn (x) is an eigenfunction if an eigenvalue λn > 0 exist satisfying that
β(ax2 + bx + c)p′′n (x) − β(x − α)p′n (x) = −λn pn (x),
or
n n−1 n−2
{λn − aj }pn,j xj + bj+1 pn,j+1xj + cj+2 pn,j+2xj = 0.
X X X
j=0 j=0 j=0
28
where aj = j{1 − (j − 1)a}β, bj = j{α + (j − 1)b}β, and cj = j(j − 1)cβ for j = 0, 1, 2, . . ..
Without loss of generality, we assume pn,n = 1. Thus, equating the coefficients we find that
the eigenvalue is given by
λn = an = n{1 − (n − 1)a}β. (3.71)
If we define pn,n+1 = 0, then the coefficients {pn,j }j=0,...,n−1 solve the linear system
(aj − an )pn,j = bj+1 pn,j+1 + cj+2 pn,j+2. (3.72)
Equation (3.72) is equivalent to a simple recursive formula if an − aj 6= 0 for all j =

0, 1, . . . , n − 1. Note that an − aj = 0 if and only if there exists an integer n − 1 ≤ m < 2n − 1
such that a = m−1 and j = m−n+1. In particular, an −aj = 0 cannot occur if a < (2n−1)−1 .
It is important to notice that λn is positive if and only if a < (n − 1)−1 . We shall see be-
low that this is exactly the condition ensuring that pn (x) is integrable with respect to the
invariant distribution. If the stronger condition a < (2n − 1)−1 is satisfied, the first n eigen-
functions belong to the space of functions that are square integrable with respect to the
invariant distribution, and they are orthogonal with respect to the usual inner product in
this space. The space of functions that are square integrable with respect to the invariant
distribution (or a subset of this space) is often taken as the domain of the generator. Ob-
viously, the eigenfunction pn (x) satisfies the condition (3.62) if pn (x) is square integrable
with respect to the invariant distribution, which is the case if a < (2n − 1)−1 . By Theorem
3.12 this implies that the transition operator satisfies (3.63), so that pn (x) can be used to
construct explicit optimal martingale estimating functions as explained in Subsection 3.6.
For Pearson diffusions with a ≤ 0, a < (2n − 1)−1 is automatically satisfied, and there are
infinitely many polynomial eigenfunctions. In these cases the eigenfunctions are well-known
families of orthogonal polynomials. When a > 0, there are only finitely many square inte-
grable polynomial eigenfunctions. In these cases more complicated eigenfunctions defined in
terms of special functions exist too, see Wong (1964). It is of some historical interest that
Hildebrandt (1931) derived the polynomials above from the viewpoint of Gram-Charlier ex-
pansions associated with the Pearson system. Some special cases had previously been derived
by Romanovsky (1924).
From a modeling point of view, it is important that the class of stationary distributions
equals the full Pearson system of distributions. Thus a very wide spectrum of marginal
distributions is available ranging from distributions with compact support to very heavy-
tailed distributions. To see that the invariant distributions belong to the Pearson system,
note that the scale measure has density
x u−α
Z
s(x) = exp 2
du ,
x0 au + bu + c
where x0 is a point such that ax20 + bx0 + c > 0, cf. (3.9). Since the density of the invariant
probability measure is given by
1
µθ (x) ∝ ,
s(x)(ax2 + bx + c)
cf. (3.10), it follows that
(2a + 1)x − µ + b
m′ (x) = − m(x).
ax2 + bx + c
29
The Pearson system is defined as the class of probability densities obtained by solving a
differential equation of this form, see Pearson (1895).
In the following we present a full classification of the ergodic Pearson diffusions, which
shows that all distributions in the Pearson system can be obtained as invariant distributions
for a model in the class of Pearson diffusions. We consider six cases according to whether
the squared diffusion coefficient is constant, linear, a convex parabola with either zero, one
or two roots, or a concave parabola with two roots. The classification problem can be
reduced by first noting that the Pearson class of diffusions is closed under location and
scale-transformations. To be specific, if X is an ergodic Pearson diffusion, then so is X̃
where X̃t = γXt + δ. The parameters of the stochastic differential equation (3.70) for X̃
are ã = a, b̃ = bγ − 2aδ, c̃ = cγ 2 − bγδ + aδ 2 , β̃ = β, and α̃ = γα + δ. Hence, up to
transformations of location and scale, the ergodic Pearson diffusions can take the following
forms. Note that we consider scale transformations in a general sense where multiplication
by a negative real number is allowed, so that to each case of a diffusion with state space
(0, ∞) there corresponds a diffusion with state space (−∞, 0).
Case 1: σ 2 (x) = 2β. The solution to (3.70) is an Ornstein-Uhlenbeck process. The state
space is IR, and the invariant distribution is the normal distribution with mean α and variance
1. The eigenfunctions are the Hermite polynomials.
Case 2: σ 2 (x) = 2βx. The solution to (3.70) is the square root process (CIR process)
(3.27) with state space (0, ∞). Condition 3.1 that ensures ergodicity is satisfied if and only
if α > 1. If 0 < α ≤ 1, the boundary 0 can with positive probability be reached at a finite
time point, but if the boundary is made instantaneously reflecting, we obtain a stationary
process. The invariant distribution is the gamma distribution with scale parameter 1 and
shape parameter α. The eigenfunctions are the Laguerre polynomials.
Case 3: a > 0 and σ 2 (x) = 2βa(x2 +1). The state space is the real line, and the scale density
1
is given by s(x) = (x2 +1) 2a exp(− αa tan−1 x). By Condition 3.1, the solution is ergodic for all
1
a > 0 and all α ∈ IR. The invariant density is given by µθ (x) ∝ (x2 + 1)− 2a −1 exp( αa tan−1 x)
If α = 0 the invariant distribution is a scaled t-distribution with ν = 1 + a−1 degrees of
1
freedom and scale parameter ν − 2 . If α 6= 0 the invariant distribution is skew and has tails
decaying at the same rate as the t-distribution with 1 + a−1 degrees of freedom. A fitting
name for this distribution is the skew t-distribution. It is also known as Pearson’s type IV
distribution. In either case the mean is α and the invariant distribution has moments of
order k for k < 1 + a−1 . With its skew and heavy tailed marginal distribution, the class of
diffusions with α 6= 0 is potentially very useful in many applications, e.g. finance. It was
studied and fitted financial data by Nagahara (1996) using the local linearization method of
Ozaki (1985). We consider this process in more detail below.
Case 4: a > 0 and σ 2 (x) = 2βax2 . The state space is (0, ∞) and the scale density is
1 α
s(x) = x a exp( ax ). Condition 3.1 holds if and only if α > 0. The invariant distribution is
1
α
given by µθ (x) ∝ x− a −2 exp(− ax ), and is thus an inverse gamma distribution with shape
1
parameter 1 + a and scale parameter αa . The invariant distribution has moments of order k
for k < 1 + a1 . This process is sometimes referred to as the GARCH diffusion model. The
polynomial eigenfunctions are known as the Bessel polynomials.
Case 5: a > 0 and σ 2 (x) = 2βax(x + 1).The state space is (0, ∞) and the scale density is
α+1 α
s(x) = (1 + x) a x− a . The ergodicity Condition 3.1 holds if and only if αa ≥ 1. Hence, for all
a > 0 and all µ ≥ a a unique ergodic solution to (3.70) exists. If 0 < α < 1, the boundary 0
30
can with positive probability be reached at a finite time point, but if the boundary is made
instantaneously reflecting, a stationary process is obtained. The density of the invariant
α+1 α
distribution is given by µθ (x) ∝ (1 + x)− a −1 x a −1 . This is a scaled F-distribution with
2α
a
and a2 + 2 degrees of freedom and scale parameter 1+a α
. The invariant distribution has
1
moments of order k for k < 1 + a .
Case 6: a < 0 and σ 2 (x) = 2βax(x − 1). The state space is (0, ∞) and the scale density
1−α α
is s(x) = (1 − x) a x a . Condition 3.1 holds if and only if αa ≤ −1 and 1−α a
≤ −1. Hence,
for all a < 0 and all α > 0 such that min(α, 1 − α) ≥ −a a unique ergodic solution to (3.70)
exists. If 0 < α < −a, the boundary 0 can with positive probability be reached at a finite
time point, but if the boundary is made instantaneously reflecting, a stationary process is
obtained. Similar remarks apply to the boundary 1 when 0 < 1 − α < −a. The invariant
1−α α
distribution is given by µθ (x) ∝ (1 − x)− a −1 x− a −1 and is thus the Beta distribution with
α
shape parameters −a , 1−α
−a
. This class of diffusions will be discussed in more detail below.
It is often referred to as the Jacobi diffusions because the related eigenfunctions are Jacobi
polynomials. Multivariate Jacobi diffusions were considered by Gourieroux & Jasiak (2006).
Example 3.17 The skew t-distribution with mean zero, ν degrees of freedom, and skewness
parameter ρ has (unnormalized) density
f (z) ∝
√ n √ o
{(z/ ν + ρ)2 + 1}−(ν+1)/2 exp ρ(ν − 1) tan−1 z/ ν + ρ ,
√
which is the invariant density of the diffusion Zt = ν(Xt − ρ) with ν = 1 + a−1 and
ρ = α, where X is as in Case 3. An expression for the normalizing constant when ν is
integer valued was derived in Nagahara (1996). By the transformation result above, the
corresponding stochastic differential equation is
q
1
dZt = −βZt dt + 2β(ν − 1)−1 {Zt2 + 2ρν 2 Zt + (1 + ρ2 )ν}dWt . (3.73)
For ρ = 0 the invariant distribution is the t-distribution with ν degrees of freedom.

The skew t-diffusion (3.73) has the eigenvalues λn = n(ν − n)(ν − 1)−1 β for n < ν. The
four first eigenfunctions are
p1 (z) = z,
1
4ρν 2
2 (1 + ρ2 )ν
p2 (z) = z − z− ,
ν−3 ν −2
1 3
3 12ρν 2 2 24ρ2 ν + 3(1+ρ2)ν(ν − 5) 8ρ(1+ρ2 )ν 2
p3 (z) = z − z + z+ ,
ν−5 (ν − 5)(ν − 4) (ν −5)(ν −3)
and
1
24ρν 2 3 144ρ2 ν − 6(1 + ρ2 )ν(ν − 7) 2
p4 (z) = z 4 − z + z
ν−7 (ν − 7)(ν − 6)
3 3 3
8ρ(1 + ρ2 )ν 2 (ν − 7) + 48ρ(1 + ρ2 )ν 2 (ν − 6) − 192ρ3 ν 2
+ z
(ν − 7)(ν − 6)(ν − 5)
3(1 + ρ2 )2 ν(ν − 7) − 72ρ2 (1 + ρ2 )ν 2
+ ,
(ν − 7)(ν − 6)(ν − 4)
31
provided that ν > 4 If ν > 2i the first i eigenfunctions are square integrable and thus
satisfy (3.62). Hence (3.63) holds, and the eigenfunctions can be used to construct explicit
martingale estimating functions. 2
Example 3.18 The model

q
dXt = −β[Xt − (m + γz)]dt + σ z 2 − (Xt − m)2 dWt (3.74)
where β > 0 and γ ∈ (−1, 1) has been proposed as a model for the random variation of the
logarithm of an exchange rate in a target zone between realignments by De Jong, Drost &
Werker (2001) (γ = 0) and Larsen & Sørensen (2007). This is a diffusion on the interval
(m − z, m + z) with mean reversion around m + γz. It is a Jacobi diffusion obtained by a
location-scale transformation of the diffusion in Case 6 above. The parameter γ quantifies
the asymmetry of the model. When β(1 − γ) ≥ σ 2 and β(1 + γ) ≥ σ 2 , X is an ergodic
diffusion, for which the stationary distribution is a Beta-distribution on (m − z, m + z) with
parameters κ1 = β(1 − γ)σ −2 and κ2 = β(1 + γ)σ −2 . If the parameter restrictions are not
satisfied, one or both of the boundaries can be hit in finite time, but if the boundaries are
made instantaneously reflecting, a stationary process is obtained.
The eigenfunctions for the generator of the diffusion (3.74) are
(κ1 −1, κ2 −1)
φi (x; β, γ, σ, m, z) = Pi ((x − m)/z), i = 1, 2, . . .
(a,b)
where Pi (x) denotes the Jacobi polynomial of order i given by
i
! !
(a,b) n+a a+b+n+j
2−j (x − 1)j ,
X
Pi (x) = −1 < x < 1.
j=0 n−j j
The eigenvalue of φi is i(β + 21 σ 2 (i − 1)). Since (3.62) is obviously satisfied, (3.63) holds, so
that the eigenfunctions can be used to construct explicit martingale estimating functions. 2
Explicit formulae for the conditional moments of a Pearson diffusion can be obtained
from the eigenfunctions by means on (3.61). Specifically,
n
n X
E(Xtn | X0 = x) = qn,k,ℓ · e−λℓ t · xk ,
X
(3.75)
k=0 ℓ=0
where qn,k,n = pn,k , qn,n,ℓ = 0 for ℓ ≤ n − 1, and

n−1
X
qn,k,ℓ = − pn,j qj,k,ℓ
j=k∨ℓ
for k, ℓ = 0, . . . , n − 1 with λℓ and pn,j given by (3.71) and (3.72). For details see Forman &
Sørensen (2008).
Also the moments of the Pearson diffusions can, when they exist, be found explicitly by
using the fact that the integral of the eigenfunctions with respect to the invariant probability
measure is zero.We have seen above that E(|Xt |κ ) < ∞ if and only if a < (κ − 1)−1 . Thus if
a ≤ 0 all moments exist, while for a > 0 only the moments satisfying that κ < a−1 + 1 exist.
32
In particular, the expectation always exists. The moments of the invariant distribution can
be found by the recursion
E(Xtn ) = a−1 n−1
n {bn · E(Xt ) + cn · E(Xtn−2 )} (3.76)
where an = n{1 − (n − 1)a}β, bn = n{α + (n − 1)b}β, and cn = n(n − 1)cβ for n = 0, 1, 2, . . ..
The initial conditions are given by E(Xt0 ) = 1, and E(Xt ) = α. This can be found from the
expressions for the eigenfunctions, but is more easily seen as follows. By Ito’s formula
dXtn = −βnXtn−1 (Xt − µ)dt + βn(n − 1)Xtn−2(aXt2 + bXt + c)dt
+nXtn−1 σ(Xt )dWt ,
and if E(Xt2n ) is finite, i.e. if a < (2n − 1)−1 , the integral of the last term is a martingale
with expectation zero.
Example 3.19 Equation (3.76) allows us to find the moments of the skewed t-distribution,
in spite of the fact that the normalizing constant of the density is unknown. In particular,
for the diffusion (3.73),
E(Zt ) = 0,
(1 + ρ2 )ν
E(Zt2 ) = ,
ν −2
3
3 4ρ(1 + ρ2 )ν 2
E(Zt ) = ,
(ν − 3)(ν − 2)
24ρ2 (1 + ρ2 )ν 2 + 3(ν − 3)(1 + ρ2 )2 ν 2
E(Zt4 ) = .
(ν − 4)(ν − 3)(ν − 2)
2
For a diffusion T (X) obtained from a solution X to (3.70) by a twice differentiable and
invertible transformation T , the eigenfunctions of the generator are pn {T −1 (x)}, where pn
is an eigenfunction of the generator of X. The eigenvalues are the same as for the original
eigenfunctions. Since the original eigenfunctions are polynomials, the eigenfunctions of T (X)
are of the form (3.66) with κ = T −1 . Hence explicit optimal martingale estimating functions
are also available for transformations of Pearson diffusions, which is a very large and flexible
class of diffusion processes. Their stochastic differential equations can, of course, be found
by Ito’s formula.
Example 3.20 For the Jacobi-diffusion (case 6) with µ = −a = 21 , i.e.

q
dXt = −β(Xt − 12 )dt + βXt (1 − Xt )dWt
the invariant distribution is the uniform distribution on (0, 1) for all β > 0. For any strictly
increasing and twice differentiable distribution function F we therefore have a class of diffu-
sions given by Yt = F −1 (Xt ) or
(F (Yt ) − 12 )f (Yt )2 + 21 F (Yt ){1 − F (Yt )}
dYt = −β dt
f (Yt )3
βF (Yt ){1 − F (Yt )}
+ dWt ,
f (Yt )
33
which has invariant distribution with density f = F ′ . A particular example is the logistic
distribution
ex
F (x) = x ∈ IR,
1 + ex
for which n o q
4
dYt = −β sinh(x) + 8 cosh (x/2) dt + 2 β cosh(x/2)dWt .
If the same transformation F −1 (y) = log(y/(1 −y)) is applied to the general Jacoby diffusion
(case 6), then we obtain
n o
dXt = −β 1 − 2µ + (1 − µ)ex − µe−1 − 8a cosh4 (x/2) dt
q
+2 −aβ cosh(x/2)dWt ,
a diffusion for which the invariant distribution is the generalized logistic distribution with
density
eκ1 x
f (x) = , x ∈ IR,
(1 + ex )κ1 +κ2 B(κ1 , κ2 )
where κ1 = −(1 − α)/a, κ2 = α/a and B denotes the Beta-function. This distribution was
introduced and studied in Barndorff-Nielsen, Kent & Sørensen (1982).
Example 3.21 Let again X be a general Jacobi-diffusion (case 6). If we apply the trans-
formation T (x) = sin−1 (2x − 1) to Xt we obtain the diffusion
sin(Yt ) − ϕ q
dYt = −ρ dt + −aβ/2dWt ,
cos(Yt )
where ρ = β(1 + a/4) and ϕ = (2α − 1)/(1 + a/4). The state space is (−π/2, π/2). Note
that Y has dynamics that are very different from those of the Jacobi diffusion: the drift is
non-linear and the diffusion coefficient is constant. This model was considered in Example
3.14.
4 The likelihood function

The transition density of a diffusion process is only rarely explicitly known, but several nu-
merical approaches make likelihood inference feasible for diffusion models. Pedersen (1995)
proposed a method for obtaining an approximation to the likelihood function by rather
extensive simulation. Pedersen’s method was very considerably improved by Durham &
Gallant (2002), whose method is computationally much more efficient. Poulsen (1999) ob-
tained an approximation to the transition density by numerically solving a partial differential
equation, whereas Aı̈t-Sahalia (2002) and Aı̈t-Sahalia (2008) proposed to approximate the
transition density by means of expansions. A Gaussian approximation to the likelihood
function obtained by local linearization of (3.1) was proposed by Ozaki (1985), while For-
man & Sørensen (2008) proposed to use an approximation in terms of eigenfunctions of the
34
generator of the diffusion. Bayesian estimators with the same asymptotic properties as the
maximum likelihood estimator can be obtained by Markov chain Monte Carlo methods, see
Elerian, Chib & Shephard (2001), Eraker (2001), and Roberts & Stramer (2001). Finally,
exact and computationally efficient likelihood-based estimation methods were presented by
Beskos et al. (2006).
5 Non-martingale estimating functions

5.1 Asymptotics
When the estimating function Gn (θ) is not a martingale under Pθ , further conditions on the
diffusion process must be imposed to ensure the asymptotic normality in (2.3). A sufficient
condition that (2.3) holds under Pθ0 with V (θ) given by (5.1) is that the diffusion is stationary
and geometrically α-mixing, that

V (θ) = Qθ0 g(θ)g(θ)T (5.1)
∞ h
Eθ0 g(X∆ , . . . , Xr∆ )g(X(k+1)∆ , . . . , X(k+r)∆ )T
X
+
k=1
i
+ Eθ0 g(X(k+1)∆ , X(k+r)∆ )g(X∆ , Xr∆ )T ,
converges and is strictly positive definite, and that Qθ0 (gi (θ)2+ǫ ) < ∞, i = 1, . . . , p for some
ǫ > 0, see e.g. Doukhan (1994). To define the concept of α-mixing, let Ft denote the σ-
field generated by {Xs | s ≤ t} and let F t denote the σ-field generated by {Xs | s ≥ t}. A
stochastic process X is said to be α-mixing, if
sup |Pθ0 (A)Pθ0 (B) − Pθ0 (A ∩ B)| ≤ α(u)

A∈Ft ,B∈F t+u
for all t > 0 and u > 0, where α(u) → 0 as u → ∞. This means that Xt and Xt+u are
almost independent, when u is large. If there exit positive constants c1 and c2 such that
α(u) ≤ c1 e−c2 u ,
for all u > 0, then the process X is called geometrically α-mixing. For one-dimensional
diffusions there are simple conditions for geometric α-mixing. If all non-zero eigenvalues of
the generator (3.37) are larger than λ > 0, then the diffusion is geometrically α-mixing with
c2 = λ. This is for instance the case if the spectrum of the generator is discrete. Ergodic
diffusions with a linear drift −β(x − α), β > 0, as for instance the Pearson diffusions, are
geometrically α-mixing with c2 = β; see Hansen, Scheinkman & Touzi (1998).
Genon-Catalot, Jeantheau & Larédo (2000) gave the following simple sufficient condition
for the one-dimensional diffusion that solves (3.1) to be geometrically α-mixing.
Condition 5.1
(i) The function b is continuously differentiable with respect to x and σ is twice continuously
differentiable with respect to x, σ(x; θ) > 0 for all x ∈ (ℓ, r), and there exists a constant
Kθ > 0 such that |b(x; θ)| ≤ Kθ (1 + |x|) and σ 2 (x; θ) ≤ Kθ (1 + x2 ) for all x ∈ (ℓ, r).
35
(ii) σ(x; θ)µθ (x) → 0 as x ↓ ℓ and x ↑ r.
(iii) 1/γ(x; θ) has a finite limit as x ↓ ℓ and x ↑ r, where γ(x; θ) = ∂x σ(x; θ)−2b(x; θ)/σ(x; θ).
Other conditions for geometric α-mixing were given by Veretennikov (1987), Hansen &
Scheinkman (1995), and Kusuoka & Yoshida (2000).
For geometrically α-mixing diffusions processes and estimating functions Gn satisfying
Condition 2.1 the existence of a θ̄-consistent and asymptotically normal Gn -estimator follows
from Theorem 2.2, which also contains a result about eventual uniqueness of the estimator.
5.2 Explicit non-martingale estimating functions

Explicit martingale estimating functions are only available for the relatively small, but versa-
tile, class of diffusions for which explicit eigenfunctions for the generator are available; see the
Subsections 3.6 and 3.7. Explicit non-martingale estimating functions can be found for all
diffusions, but cannot be expected to approximate the score functions as well as martingale
estimating functions, and will therefore usually give less efficient estimators.
First we consider estimating function of the form
n
X
Gn (θ) = h(X∆i ; θ), (5.2)
i=1
where h is a p-dimensional function. We assume that the diffusion is geometrically α-mixing,

so that a central limit theorem holds, and that Condition 2.1 holds for r = 1 and θ̄ = θ0 . The
latter condition simplifies considerably, because it does not involve the transition density,
but only the invariant probability density µθ , which for one-dimensional ergodic diffusions
is given explicitly by (3.10). In particular, (2.4) and (2.5) simplifies to
Z r
µθ0 (h(θ0 )) = h(x; θ0 )µθ0 (x)dx = 0 (5.3)
ℓ
and Z r
W = µθ0 (∂θT h(θ0 )) = ∂θT h(x; θ0 )µθ0 (x)dx.
ℓ
The condition for eventual uniqueness of the Gn -estimator (2.7) is here that θ0 is the only
root of µθ0 (h(θ)).
Kessler (2000) proposed
h(x; θ) = ∂θ log µθ (x), (5.4)
which is the score functions (the derivative of the log-likelihood functions) if we pretend
that the observations are an i.i.d. sample from the stationary distribution. If ∆ is large, this
might be a reasonable approximation. That (5.3) is satisfied for this specification of h follows
under standard conditions that allow the interchange of differentiation and integration.
Z r Z r Z r
(∂θ log µθ (x)) µθ (x)dx = ∂θ µθ (x)dx = ∂θ µθ (x)dx = 0.
ℓ ℓ ℓ
Hansen & Scheinkman (1995) and Kessler (2000) proposed and studied the generally
applicable specification
hj (x; θ) = Aθ fj (x; θ), (5.5)
36
where Aθ is the generator (3.37), and fj , j = 1, . . . , d are twice differentiable functions chosen
such that Condition 2.1 holds. The estimating function with h given by (5.5) can easily be
applied to multivariate diffusions, because an explicit expression for the invariant density
µθ is not needed. The following lemma for one-dimensional diffusions shows that only weak
conditions are needed to ensure (5.3).
Lemma 5.2 Suppose f ∈ C 2 ((ℓ, r)), Aθ f ∈ L1 (µθ ) and
lim f ′ (x)σ 2 (x; θ)µθ (x) = lim f ′ (x)σ 2 (x; θ)µθ (x). (5.6)
x→r x→ℓ
Then Z r
(Aθ f )(x)µθ (x)dx = 0.
ℓ
Proof: Note that by (3.10), the function ν(x; θ) = 21 σ 2 (x; θ)µθ (x) satisfies that ν ′ (x; θ) =
b(x; θ)µθ (x). In this proof all derivatives are with respect to x. It follows that
Z r
(Aθ f )(x)µθ (x)dx
ℓ
Z r
= b(x; θ)f ′ (x) + 21 σ 2 (x; θ)f ′′ (x) µθ (x)dx
Zℓ r Z r
′ ′ ′′ ′
= (f (x)ν (x; θ) + f (x)ν(x; θ)) dx = (f ′ (x)ν(x; θ)) dx
ℓ ℓ
′ 2 ′ 2
= lim f (x)σ (x; θ)µθ (x) − lim f (x)σ (x; θ)µθ (x) = 0.
x→r x→ℓ
Example 5.3 Consider the square-root process (3.27) with σ = 1. For f1 (x) = x and
f2 (x) = x2 , we see that !
−β(x − α)
Aθ f (x) = ,
−2β(x − α)x + x
which gives the simple estimators
n
1X
n
Xi∆
1X n i=1
α̂n = Xi∆ , β̂n =  !2  .
n i=1 1X n
1X n
2
2 Xi∆ − Xi∆ 
n i=1 n i=1
The condition (5.6) is obviously satisfied because the invariant distribution is a normal
distribution.
2
Sørensen (2001) derived the estimating function of the form (5.2) with
h(x; θ) = Aθ ∂θ log µθ (x) (5.7)
as an approximation to the score function for continuous-time observation of the diffusion

process. This is clearly a particular case of (5.5) with f (x; θ) = ∂θ log µθ (x), which is the
i.i.d. score function used in (5.4).
37
As mentioned above, an estimating function of the form (5.2) cannot be expected to
yield as efficient estimators as an estimating function that depends on pairs of consecutive
observations, and thus can use the information contained in the transitions. Hansen &
Scheinkman (1995) proposed non-martingale estimating functions of the form (3.2) with g
given by
gj (∆, x, y; θ) = hj (y)Aθ fj (x) − fj (x)Âθ hj (y), (5.8)
where the functions fj and hj satisfy weak regularity conditions ensuring that (2.4) holds
for θ̄ = θ0 . The differential operator Âθ is the generator of the time reversal of the observed
diffusion X. For a multivariate diffusion it is given by
d d
Ckℓ (x; θ)∂x2k xℓ f (x),
X X
1
Âθ f (x) = b̂k (x; θ)∂xk f (x) + 2
k=1 k,ℓ=1
where C = σσ T and
d
1 X
b̂k (x; θ) = −bk (x; θ) + ∂x (µθ Ckl ) (x; θ).
µθ (x) ℓ=1 ℓ
For one-dimensional ergodic diffusions, Âθ = Aθ . Obviously, the estimating function of the
form (5.2) with hj (x; θ) = Aθ fj (x) is a particular case of (5.8) with hj (y) = 1.
5.3 Approximate martingale estimating functions

For martingale estimating functions of the form (3.20) and (3.30), we can always, as discussed
in Subsection 3.3, obtain an explicit approximation to the optimal weight matrix by means
of the expansion (3.36). For diffusion models where there is no explicit expression for the
θ
transition operator, it is tempting to go on and approximate π∆ (fj (θ))(x) using (3.36), and
thus, quite generally, obtain explicit approximate martingale estimating function. Estimators
of this type were the first type of estimators for discretely observed diffusion processes to
be studied in the literature. They have been considered by Dorogovcev (1976), Prakasa
Rao (1988), Florens-Zmirou (1989), Yoshida (1992), Chan et al. (1992), Kessler (1997), and
Kelly, Platen & Sørensen (2004).
It is, however, important to note that there is a dangerous pitfall when using these simple
approximate martingale estimating functions. They do not satisfy that Qθ0 (g(θ0 )) = 0, and
hence the estimators are inconsistent. To illustrate the problem, consider an estimating
function of the form (3.2) with
g(x, y; θ) = a(x, θ)[f (y) − f (x) − ∆Aθ f (x)], (5.9)

θ
where Aθ is the generator (3.37), i.e., we have used a first order expansion of π∆ f (x). To
simplify the exposition, we assume that θ, a and f are one-dimensional. We assume that
the diffusion is geometrically α-mixing, that the other conditions mentioned above for the
weak convergence result (2.3) hold, and that Condition 2.1 is satisfied. Then by Theorem
2.2, the estimator obtained using (5.9) converges to the solution, θ̄, of
Qθ0 (g(θ̄)) = 0. (5.10)
38
We assume that the solution is unique. Using the expansion (3.36), we find that

θ0
Qθ0 (g(θ)) = µθ0 a(θ)[π∆ f − f − ∆Aθ f ]

= ∆µθ0 a(θ)[Aθ0 f − Aθ f + 21 ∆A2θ0 f ] + O(∆3 )

= (θ0 − θ)∆µθ0 (a(θ0 )∂θ Aθ0 f ) + 21 ∆2 µθ0 a(θ0 )A2θ0 f
+O(∆|θ − θ0 |2 ) + O(∆2 |θ − θ0 |) + O(∆3 ).
If we neglect all O-terms, we obtain that
.

θ̄ = θ0 + ∆ 21 µθ0 a(θ0 )A2θ0 f /µθ0 (a(θ0 )∂θ Aθ0 f ) ,
which indicates that when ∆ is small, the asymptotic bias is of order ∆. However, the bias
can be huge when ∆ is not sufficiently small as the following example shows.
Example 5.4 Consider again a diffusion with linear drift, b(x; θ) = −β(x − α). In this case
(5.9) with f (x) = x gives the estimating function
n
X
Gn (θ) = a(X∆(i−1) ; θ)[X∆i − X∆(i−1) + β X∆(i−1) − α ∆],
i=1
where a is 2-dimensional. For a diffusion with linear drift, we found in Example 3.7 that
F (x; α, β) = −β(x − α)∆. Using this, we obtain that
Qθ0 (g(θ)) = c1 (e−β0 ∆ − 1 + β∆) + c2 β(α0 − α),
where Z
c1 = a(x)xµθ0 (dx) − µθ0 (a)α0 , c2 = µθ0 (a)∆.
D
Thus
ᾱ = α0
and
1 − e−β0 ∆ 1
β̄ = ≤ .
∆ ∆
We see that the estimator of α is consistent, while the estimator of β will tend to be small
if ∆ is large, irrespective of the value of β0 . We see that what determines how well β̂ works
is the magnitude of β0 ∆, so it is not enough to know that ∆ is small. Moreover, we cannot
use β̂∆ to evaluate whether there is a problem, because this quantity will always tend to be
smaller than one. If β0 ∆ actually is small, then the bias is proportional to ∆ as expected
β̄ = β0 − 21 ∆β02 + O(∆2 ).
We can get an impression of what can happen when estimating the parameter β by means
of the dangerous estimating function given by (5.9) from the simulation study in Bibby &
Sørensen (1995) for the square root process (3.27). The result is given in Table 5.1. For
the function a the approximately optimal weight function was used, cf. Example 3.7. For
different values of ∆ and the sample size, 500 independent datasets were simulated, and
the estimators were calculated for each dataset. The expectation of the estimator β̂ was
determined as the average of the simulated estimators. The parameter values were α = 10,
β = 1 and τ = 1, and the initial value was x0 = 10. When ∆ is large, the behaviour of the
estimator is bizarre. 2
39
∆ # obs. mean ∆ # obs. mean
0.5 200 0.81 1.5 200 0.52
500 0.80 500 0.52
1000 0.79 1000 0.52
1.0 200 0.65 2.0 200 0.43
500 0.64 500 0.43
1000 0.63 1000 0.43
Table 5.1: Empirical mean of 500 estimates of the parameter β in the CIR model. The true
parameter values are α = 10, β = 1, and τ = 1.
The asymptotic bias given by (5.10) is small when ∆ is sufficiently small, and the results
in the following section on high frequency asymptotics show that in this case the approximate
martingale estimating functions work well. However, how small ∆ needs to be depends on
the parameter values, and without prior knowledge about the parameters, it is safer to use
an exact martingale estimating function, which gives consistent estimators at all sampling
frequencies.
6 High-frequency asymptotics
A large number of estimating functions have been proposed for diffusion models, and a large
number of simulation studies have been performed to compare their relative merits, but the
general picture has been rather confusing. By considering the high frequency scenario,
n → ∞, ∆n → 0, n∆n → ∞, (6.1)
Sørensen (2007) obtained simple conditions for rate optimality and efficiency for ergodic
diffusions, which allow identification of estimators that work well when the time between
observations, ∆n , is not too large. For financial data the speed of reversion is usually slow
enough that this type of asymptotics works for daily, sometimes even weekly observations.
A main result of this theory is that under weak conditions optimal martingale estimating
functions give rate optimal and efficient estimators.
To simplify the exposition, we restrict attention to a one-dimensional diffusion given by
dXt = b(Xt ; α)dt + σ(Xt ; β)dWt , (6.2)
where θ = (α, β) ∈ Θ ⊆ IR2 . The results below can be generalized to multivariate diffusions
and parameters of higher dimension. We consider estimating functions of the general form
(2.1), where the two-dimensional function g = (g1 , g2 ) for some κ ≥ 2 and for all θ ∈ Θ
satisfies
Eθ (g(∆n , X∆n i , X∆n(i−1) ; θ) | X∆n(i−1) ) = ∆κn R(∆n , X∆n (i−1) ; θ). (6.3)
Here and later R(∆, y, x; θ) denotes a function such that |R(∆, y, x; θ)| ≤ F (y, x; θ), where
F is of polynomial growth in y and x uniformly for θ in a compact set1 . We assume that the
1
For any compact subset K ⊆ Θ, there exist constants C1 , C2 , C3 > 0 such that supθ∈K |F (y, x; θ)| ≤
C1 (1 + |x|C C
2 + |y|3 ) for all x and y in the state space of the diffusion.
40
diffusion and the estimating functions satisfy the technical regularity Condition 6.2 given
below.
Martingale estimating functions obviously satisfy (6.3) with R = 0, but for instance the
approximate martingale estimating functions discussed at the end of the previous section
satisfy (6.3) too.
Theorem 6.1 Suppose that
∂y g2 (0, x, x; θ) = 0, (6.4)
∂y g1 (0, x, x; θ) = ∂α b(x; α)/σ 2 (x; β), (6.5)
∂y2 g2 (0, x, x; θ) = ∂β σ 2 (x; β)/σ 2 (x; β)2 , (6.6)
for all x ∈ (ℓ, r) and θ ∈ Θ. Assume, moreover, that the following identifiability condition
is satisfied
Z r
[b(x, α0 ) − b(x, α)]∂y g1 (0, x, x; θ)µθ0 (x)dx 6= 0 when α 6= α0 ,
ℓ
Z r
[σ 2 (x, β0 ) − σ 2 (x, β)]∂y2 g2 (0, x, x; θ)µθ0 (x)dx 6= 0 when β 6= β0 ,
ℓ
and that
r (∂α b(x; α0 ))2
Z
W1 = µθ0 (x)dx 6= 0,
ℓ σ 2 (x; β0 )
#2
∂β σ 2 (x; β0 )
"
Z r
W2 = µθ0 (x)dx 6= 0.
ℓ σ 2 (x; β0 )
Then a consistent Gn –estimator θ̂n = (α̂n , β̂n ) exists and is unique in any compact subset of
Θ containing θ0 with probability approaching one as n → ∞. For a martingale estimating
function, or more generally if n∆2(κ−1) → 0,
√
W1−1
! ! !!
n∆n (α̂n − α0 ) D 0 0
√ −→ N2 , . (6.7)
n(β̂n − β0 ) 0 0 W2−1
An estimator satisfying (6.7) is rate optimal and efficient, cf. Gobet (2002), who showed
that the model considered here is locally asymptotically normal. Note that the estimator
of the diffusion coefficient parameter, β, converges faster than the estimator of the drift
parameter, α. Condition (6.4) implies rate optimality. If this condition is not √ satisfied,
the estimator of the diffusion coefficient parameter converges at the slower rate n∆n . This
condition is called the Jacobsen condition, because it appears in the theory of small ∆-optimal
estimation developed in Jacobsen (2001) and Jacobsen (2002). In this theory the asymptotic
covariance matrix in (3.15) is expanded in powers of ∆, the time between observations. The
leading term is minimal when (6.5) and (6.6) are satisfied. The same expansion of (3.15)
was used by Aı̈t-Sahalia & Mykland (2004).
The assumption n∆n → ∞ in (6.1) is needed to ensure that the drift parameter, α, can
be consistently estimated. If the drift is known and only the diffusion coefficient parameter,
β, needs to be estimated, this condition can be omitted, see Genon-Catalot & Jacod (1993).
Another situation where the infinite observation horizon, n∆n → ∞, is not needed for
41
consistent estimation of α is when the high frequency asymptotic scenario is combined with
the small diffusion scenario, where σ(x; β) = ǫn ζ(x; β) and ǫn → 0, see Genon-Catalot (1990),
Sørensen & Uchida (2003) and Gloter & Sørensen (2008).
The reader is reminded of the trivial fact that for any non-singular 2 × 2 matrix, Mn ,
the estimating functions Mn Gn (θ) and Gn (θ) give exactly the same estimator. We call them
versions of the same estimating function. The matrix Mn may depend on ∆n . Therefore a
given version of an estimating function needs not satisfy (6.4) – (6.6). The point is that a
version must exist which satisfies these conditions.
It follows from results in Jacobsen (2002) that to obtain a rate optimal and efficient
estimator from an estimating function of the form (3.31), we need that N ≥ 2 and that the
matrix
∂x f1 (x; θ) ∂x2 f1 (x; θ)
!
D(x) =
∂x f2 (x; θ) ∂x2 f2 (x; θ)
is invertible for µθ -almost all x. Under these conditions, Sørensen (2007) showed that
Godambe-Heyde optimal martingale estimating functions give rate optimal and efficient es-
timators. For a d-dimensional diffusion, Jacobsen (2002) gave the conditions N ≥ d(d+3)/2,
and that the N × (d + d2 )-matrix D(x) = (∂x f (x; θ) ∂x2 f (x; θ)) has full rank d(d + 3)/2.
We conclude this section by stating technical conditions under which the results in this
section hold. The assumptions about polynomial growth are far too strong, but simplify the
proofs. These conditions can most likely be weakened very considerably in a way similar to
the proofs in Gloter & Sørensen (2008).
Condition 6.2 The diffusion is ergodic and the following conditions hold for all θ ∈ Θ:
Rr
(1) ℓ xk µθ (x)dx < ∞ for all k ∈ IN.
(2) supt Eθ (|Xt |k ) < ∞ for all k ∈ IN.
(3) b, σ ∈ Cp,4,1((ℓ, r) × Θ).
(4) g(∆, y, x; θ) ∈ Cp,2,6,2(IR+ × (ℓ, r)2 × Θ) and has an expansion in powers of ∆:
g(∆, y, x; θ) =
g(0, y, x; θ) + ∆g (1) (y, x; θ) + 12 ∆2 g (2) (y, x; θ) + ∆3 R(∆, y, x; θ),
where
g(0, y, x; θ) ∈ Cp,6,2((ℓ, r)2 × Θ),

g (1) (y, x; θ) ∈ Cp,4,2((ℓ, r)2 × Θ),
g (2) (y, x; θ) ∈ Cp,2,2((ℓ, r)2 × Θ).
We define Cp,k1 ,k2 ,k3 (IR+ × (ℓ, r)2 × Θ) as the class of real functions f (t, y, x; θ) satisfying
that
(i) f (t, y, x; θ) is k1 times continuously differentiable with respect t, k2 times continuously

differentiable with respect y, and k3 times continuously differentiable with respect α
and with respect to β
42
(ii) f and all partial derivatives ∂ti1 ∂yi2 ∂αi3 ∂βi4 f , ij = 1, . . . kj , j = 1, 2, i3 + i4 ≤ k3 , are of
polynomial growth in x and y uniformly for θ in a compact set (for fixed t).
The classes Cp,k1 ,k2 ((ℓ, r) × Θ) and Cp,k1,k2 ((ℓ, r)2 × Θ) are defined similarly for functions
f (y; θ) and f (y, x; θ), respectively.
7 Non-Markovian models
In this section we consider estimating functions that can be used when the observed process
is not a Markov process. In this situation, it is usually not easy to find a tractable mar-
tingale estimating function. For instance a simple estimating function of the form (3.31)
is not a martingale. To obtain a martingale, the conditional expectation given X(i−1)∆ in
(3.31) must be replaced by the conditional expectation given all previous observations, which
can only very rarely be found explicitly, and which it is rather hopeless to find by simula-
tion. Instead we will consider a generalization of the martingale estimating functions, called
the prediction-based estimating functions, which can be interpreted as approximations to
martingale estimating functions.
To clarify our thoughts, we will consider a concrete model type. Let the D-dimensional
process X be the stationary solution to the stochastic differential equation
dXt = b(Xt ; θ)dt + σ(Xt ; θ)dWt , (7.1)
where b is D-dimensional, σ is a D × D-matrix, and W a D-dimensional standard Wiener

process. As usual the parameter θ varies in a subset Θ of IRp . However, we do not observed
X directly. What we observe is
Yi = k(Xti ) + Zi , (7.2)
where k maps IRD into IRd (d < D), and {Zi} is a sequence of independent identically
distributed measurement errors with mean zero. We assume that the measurement errors
are independent of the process X. Obviously, the discrete time process {Yi } is not a Markov-
process.
7.1 Prediction-based estimating functions

In the following we will outline the method of prediction-based estimating functions intro-
duced in Sørensen (2000). Assume that fj , j = 1, . . . , N, are functions that map IRs+1 × Θ
into IR such that Eθ (fj (Ys+1 , . . . , Y1; θ)2 ) < ∞ for all θ ∈ Θ. Let Pi−1,j be a closed linear
subset of the L2 -space of all functions of Y1 , . . . , Yi−1 with finite variance under Pθ . This
set can be interpreted as a set of predictors of fj (Yi , . . . , Yi−s ; θ) based on Y1 , . . . , Yi−1 . A
prediction-based estimating function has the form
n X
N h i
X (i−1) (i−1)
Gn (θ) = Πj (θ) fj (Yi , . . . , Yi−s ; θ) − π̆j (θ)
i=s+1 j=1
(i−1)
where Πj (θ) is a p-dimensional vector, the coordinates of which belong to Pi−1,j , and
(i−1)
π̆j (θ) is the minimum mean square error predictor in Pi−1,j of fj (Yi , . . . , Yi−s ; θ) under
Pθ . When s = 0 and Pi−1,j is the set of all functions of Y1 , . . . , Yi−1 with finite variance,
43
(i−1)
π̆j (θ) is the conditional expectation under Pθ of fj (Yi ; θ) given Y1 , . . . , Yi−1 , so in this
case we obtain a martingale estimating function. Thus for a Markov process, a martingale
estimating function of the form (3.31) is a particular case of a prediction-based estimating
function.
The minimum mean square error predictor of fj (Yi , . . . , Yi−s ; θ) is the projection of
fj (Yi , . . . , Yi−s ; θ) onto the subspace Pi−1,j of the L2 -space of all functions of Y1 , . . . , Yi with
(i−1)
finite variance under Pθ . Therefore π̆j (θ) satisfies the normal equation
n o
(i−1) (i−1)
Eθ πj fj (Yi, . . . , Yi−s ; θ) − π̆j (θ) =0 (7.3)
(i−1)
for all πj ∈ Pi−1,j . This implies that a prediction-based estimating function satisfies that
Eθ (Gn (θ)) = 0. (7.4)
We can interpret the minimum mean square error predictor as an approximation to the
conditional expectation of fj (Yi , . . . , Yi−s ; θ) given X1 , . . . , Xi−1 , which is the projection of
fj (Yi , . . . , Yi−s ; θ) onto the subspace of all functions of X1 , . . . , Xi−1 with finite variance.
To obtain estimators that can relatively easily be calculated in practice, we will from
now on restrict attention to predictor sets, Pi−1,j , that are finite dimensional. Let hjk , j =
1, . . . , N, k = 0, . . . , qj be functions from IRr into IR (r ≥ s), and define (for i ≥ r + 1)
random variables by
(i−1)
Zjk = hjk (Yi−1 , Yi−2 , . . . , Yi−r ).
(i−1)
We assume that Eθ ((Zjk )2 ) < ∞ for all θ ∈ Θ, and let Pi−1,j denote the subspace spanned
(i−1) (i−1)
by Zj0 , . . . , Zjqj . We set hj0 = 1 and make the natural assumption that the functions
(i−1)
hj0 , . . . , hjqj are linearly independent. We write the elements of Pi−1,j in the form aT Zj ,
where aT = (a0 , . . . , aqj ) and
(i−1) T

(i−1) (i−1)
Zj = Zj0 , . . . , Zjqj
are qj + 1-dimensional vectors. With this specification of the predictors, the estimating
function can only include terms with i ≥ r + 1:
n X
N h i
X (i−1) (i−1)
Gn (θ) = Πj (θ) fj (Yi, . . . , Yi−s ; θ) − π̆j (θ) (7.5)
i=r+1 j=1
(i−1)
It is well-known that the minimum mean square error predictor, π̆j (θ), is found by solv-
(r) (r)
ing the normal equations (7.3). We define Cj (θ) as the covariance matrix of (Zj1 , . . . , Zjqj )T
under Pθ , and bj (θ) as the vector for which the ith coordinate is
(r)
bj (θ)i = Covθ (Zji , fj (Yr+1 , . . . , Yr+1−s ; θ)), (7.6)
i = 1, . . . , qj . Then we have
(i−1) (i−1)
π̆j (θ) = ăj (θ)T Zj
where ăj (θ)T = (ăj0 (θ), ăj∗(θ)T ) with
ăj∗ (θ) = Cj (θ)−1 bj (θ) (7.7)
44
and qj
X (r)
ăj0 (θ) = Eθ (fj (Ys+1, . . . , Y1 ; θ)) − ăjk (θ)Eθ (Zjk ). (7.8)
k=1
That Cj (θ) is invertible follows from the assumption that the functions hjk are linearly
independent. If fj (Yi, . . . , Yi−s ; θ) has mean zero under Pθ for all θ ∈ Θ, we need not include
(i−1) (i−1)
a constant in the space of predictors, i.e. we need only the space spanned by Zj1 , . . . , Zjqj .
Example 7.1 An important particular case when d = 1 is fj (y) = y j , j = 1, . . . , m.

(i−1)
For each i = r + 1, . . . , n and j = 1, . . . , m, we let {Zjk | k = 0, . . . , qj } be a subset of
κ (i−1)
{Yi−ℓ | ℓ = 1, . . . , r, κ = 0, . . . , j}, where Zj0 is always equal to 1. Here we need to assume
(i−1)
that Eθ (Yi2m ) < ∞ for all θ ∈ Θ. To find π̆j (θ), j = 1, . . . , m, by means of (7.7) and
(7.8), we must calculate moments of the form
Eθ (Y1κ Ykj ), 0 ≤ κ ≤ j ≤ m, k = 1, . . . , r. (7.9)
To avoid the matrix inversion in (7.7), the vector of coefficients ăj can be found by means of
the m-dimensional Durbin-Levinson algorithm applied to the process {(Yi, Yi2 , . . . , Yim )}i∈IN ,
see Brockwell & Davis (1991). Suppose the diffusion process X is exponentially ρ-mixing, see
Doukhan (1994). This is for instance the case for a Pearson diffusion or for a one-dimensional
diffusion that satisfies Condition 5.1. Then the observed process Y inherits this property,
which implies that constants K > 0 and λ > 0 exist such that |Covθ (Y1j , Ykj )| ≤ Ke−λk .
Therefore r will usually not need to be chosen particularly large.
In many situations it is reasonable take m = 2 with the following simple predictor sets
(i−1) (i−1)
where q1 = r and q2 = 2r. The predictor sets are generated by Zj0 = 1, Zjk = Yi−k ,
(i−1) 2
k = 1, . . . , r, j = 1, 2 and Z2k = Yi+r−k , k = r + 1, . . . , 2r. In this case the minimum
mean square error predictor of Yi can be found using the Durbin-Levinson algorithm for real
processes, while the predictor of Yi2 can be found by applying the two-dimensional Durbin-
Levinson algorithm to the process (Yi , Yi2 ). Including predictors in the form of lagged terms
Yi−k Yi−k−l for a number of lags l’s might also be of relevance.
We will illustrate the use of the Durbin-Levinson algorithm in the simplest possible case,
(i−1) (i−1)
where m = 1, f (x) = x, Z0 = 1, Zk = Yi−k , k = 1, . . . , r. We suppress the superfluous
j in the notation. Let Kℓ (θ) denote the covariance between Y1 and Yℓ+1 under Pθ , and define
φ1,1 (θ)) = K1 (θ)/K0 (θ) and v0 (θ) = K0 (θ). Then the Durbin-Levinson algorithm goes as
follows
ℓ−1
!
φℓ−1,k (θ)Kℓ−k (θ) vℓ−1 (θ)−1 ,
X
φℓ,ℓ (θ) = Kℓ (θ) −
k=1
     
φℓ,1(θ) φℓ−1,1(θ) φℓ−1,ℓ−1 (θ)
 ..  
=
 .. 
 − φℓ,ℓ (θ)  .. 

 .    .  . 

φℓ,ℓ−1(θ)) φℓ−1,ℓ−1(θ)) φℓ−1,1 (θ))
and
vℓ (θ) = vℓ−1 (θ) 1 − φℓ,ℓ (θ)2 .
The algorithm is run for ℓ = 2, . . . , r. Then
ă∗ (θ) = (φr,1 (θ), . . . , φr,r (θ)),
45
while ă0 can be found from (7.8), which simplifies to
r
!
X
ă0 (θ) = Eθ (Y1 ) 1 − φr,k (θ) .
k=1
The quantity vr (θ) is the prediction error Eθ (Yi − π̆ (i−1) ). Note that if we want to include a
further lagged value of Y in the predictor, we just iterate the algorithm once more.
2
We will now find the optimal prediction-based estimating function of the form (7.5) in
the sense explained in Section 9. First we express the estimating function in a more compact
(i−1)
way. The ℓth coordinate of the vector Πj (θ) can be written as
qj
(i−1) X (i−1)
πℓ,j (θ) = aℓjk (θ)Zjk , ℓ = 1, . . . , p.
k=0
With this notation, (7.5) can be written in the form

n
H (i) (θ),
X
Gn (θ) = A(θ) (7.10)
i=r+1
where  
a110 (θ) · · · a11q1 (θ) · · · · · · a1N 0 (θ) · · · a1N qN (θ)

A(θ) =  .. .. .. .. 
,
 . . . . 
ap10 (θ) · · · ap1q1 (θ) · · · · · · apN 0 (θ) · · · apN qN (θ)
and
H (i) (θ) = Z (i−1) F (Yi , . . . , Yi−s ; θ) − π̆ (i−1) (θ) , (7.11)
(i−1) (i−1)
with F = (f1 , . . . , fN )T , π̆ (i−1) (θ) = (π̆1 (θ), . . . , π̆N (θ))T , and
 
(i−1)
Z
 1
0q 1 ··· 0q 1 
 
 
(i−1)
Z (i−1) = 0q 2 Z2 ··· 0q 2
 
 . (7.12)
.. .. ..
 
 

 . . . 

(i−1)
0q N 0q N · · · ZN
Here 0qj denotes the qj -dimensional zero-vector. When we have chosen the functions fj and
the predictor spaces, the quantities H (i) (θ) are completely determined, whereas we are free
to choose the matrix A(θ) in an optimal way, i.e. such that the asymptotic variance of the
estimators is minimized.
We will find en explicit expression for the optimal weight matrix, A∗ (θ), under the fol-
lowing condition, in which we need one further definition:
ă(θ) = (ă10 (θ), . . . , ă1q1 (θ), . . . , ăN 0 (θ), . . . ăN qN (θ))T , (7.13)
where the ăjk s define the minimum mean square error predictor. Specifically, π̆ (i−1) (θ) =
(Z (i−1) )T ă(θ).
46
Condition 7.2
(1) The function F (y1 , . . . , ys+1; θ) and the coordinates of ă(θ) are continuously differentiable
functions of θ.
(2) p ≤ p̄ = N + q1 + · · · + qN .
(3) The p̄ × p-matrix ∂θT ᾰ(θ) has rank p.
(4) The functions 1, f1 , . . . , fN are linearly independent (for fixed θ) on the support of the
conditional distribution of (Yi , . . . , Yi−s ) given (Xi−1 , . . . , Xi−r ).
(5) The p × p-matrix
U(θ)T = Eθ Z (i−1) ∂θT F (Yi , . . . , Yi−s ; θ) (7.14)
exists.
If we denote the optimal prediction-based estimating function by G∗n (θ), then

Eθ Gn (θ)G∗n (θ)T = (n − r)A(θ)M̄n (θ)A∗n (θ)T ,
where

M̄n (θ) = Eθ H (r+1) (θ)H (r+1) (θ)T (7.15)
n−r−1
(n − r − k) n (r+1)
(θ)H (r+1+k) (θ)T
X
+ Eθ H
k=1 (n − r)
o
+ Eθ H (r+1+k) (θ)H (r+1) (θ)T ,
√
which is the covariance matrix of ni=r+1 H (i) (θ)/ n − r. The sensitivity function (9.1) is
P
given by
SGn (θ) = (n − r)A(θ)(U(θ)T − D(θ)∂θT ă(θ))
where the p̄ × p̄-matrix D(θ) is given by

D(θ) = Eθ Z (i−1) (Z (i−1) )T (7.16)

It follows from Theorem 9.1 that A∗n (θ) is optimal if Eθ Gn (θ)G∗n (θ)T = SGn (θ). Under
Condition 7.2 (4) the matrix V̄n (θ) is invertible, see Sørensen (2000), so it follows that
A∗n (θ) = (U(θ) − ∂θ ă(θ)T D(θ))M̄n (θ)−1 , (7.17)
so that the estimating function
n
G∗n (θ) = A∗n (θ) Z (i−1) F (Yi, . . . , Yi−s ; θ) − π̆ (i−1) (θ) ,
X
(7.18)
i=s+1
is Godambe optimal. When the function F does not depend on θ, the expression for A∗n (θ)
simplifies slightly as in this case U(θ) = 0.
Example 7.3 Consider again the type of prediction-based estimating function discussed in
Example 7.1. In order to calculate (7.15), we need mixed moments of the form
Eψ [Y1k1 Ytk12 Ytk23 Ytk3 4 ], (7.19)
for 1 ≤ t1 ≤ t2 ≤ t3 and k1 + k2 + k3 + k4 ≤ 4N, where ki , i = 1, . . . , 4 are non-negative
integers.
2
47
7.2 Asymptotics
A prediction-based estimating function of the form (7.10) gives consistent and asymptotically
normal estimators under the following condition, where θ0 as usual is the true parameter
value.
Condition 7.4
(1) The diffusion process X is stationary and geometrically α-mixing.
(2) There exists a δ > 0 such that

(r) 2+δ
Eθ0 Zjk fj (Xr+1 , . . . , Xr+1−s ; θ0 ) <∞
and
(r) (r) 2+δ
Eθ0 Zjk Zjℓ < ∞,
for j = 1, . . . , N, k, ℓ = 0, . . . qj .
(3) The function F (y1, . . . , ys+1; θ) and the components of A(θ) and ă(θ), given by (7.13) are
continuously differentiable functions of θ.
(4) The matrix W = A(θ0 )(U(θ0 ) − D(θ0 )∂θT ă(θ0 )) has full rank p. The matrices U(θ) and
D(θ) are given by (7.14) and (7.16).
(5)
A(θ) Eθ0 Z (i−1) F (Yi, . . . , Yi−s ; θ) − D(θ0 )∂θT ă(θ)) 6= 0
for all θ 6= θ0 .
Condition 7.4 (1) and (2) ensures that the central limit theorem (2.3) holds and that
M̄n (θ0 ) → M(θ0 ), where

M(θ) = Eθ H (r+1) (θ)H (r+1) (θ)T
∞ n
Eθ H (r+1) (θ)H (r+1+k) (θ)T
X
+
k=1
o
+ Eθ H (r+1+k) (θ)H (r+1) (θ)T .
The concept of geometric α-mixing was explained in Subsection 5.1, where also conditions
for geometric α-mixing were discussed. It is not difficult to see that if the basic diffusion
process X is geometrically α-mixing, then the observed process Y inherits this property.
As explained in Subsection 5.1, we only need to check Condition 2.1 with θ̄ = θ0 to obtain
asymptotic results for prediction-based estimators. The condition (2.4) is satisfied because
of (7.4). It is easy to see that Condition 7.4 (3) and (4) implies that θ 7→ g(y1, . . . , yr+1 )
is continuously differentiable and that g as well as ∂thetaT g are locally dominated integrable
under Pθ0 . Finally, the condition (2.7) is identical to Condition 7.4 (5). Therefore it follows
from Theorem 2.2 that a consistent Gn –estimator θ̂n exists and is the unique Gn –estimator
on any bounded subset of Θ containing θ0 with probability approaching one as n → ∞. The
estimator satisfies that
√ L
−1

n(θ̂n − θ0 ) −→ Np 0, W −1 A(θ0 )M(θ0 )A(θ0 )T W T .
48
7.3 Integrated diffusions
Sometimes a diffusion process cannot be observed directly, but that data of the form
1 i∆
Z
Yi = Xs ds, i = 1, . . . , n (7.20)
∆ (i−1)∆
are available for some fixed ∆. Such observations might be obtained when the process X
is observed after passage through an electronic filter. Another example is provided by ice-
core records. The isotope ratio 18 O/16 O in the ice, measured as an average in pieces of ice,
each piece representing a time interval with time increasing as a function of the depth, is a
proxy for paleo-temperatures. The variation of the paleo-temperature can be modelled by a
stochastic differential equation, and it is natural to model the ice-core data as an integrated
diffusion process, see Ditlevsen, Ditlevsen & Andersen (2002). Estimation based on this
type of data was considered by Gloter (2000), Bollerslev & Wooldridge (1992), Ditlevsen &
Sørensen (2004), and Gloter (2006).
The model for data of the type (7.20) is a particular case of (7.1) with
! !
b1 (x1 ; θ) σ1 (x1 ; θ) 0
b(x; θ) = , σ(x; θ) =
x1 0 0
with X2,0 = 0, where only the second coordinate is observed. A stochastic differential
equation of this form is called hypoelliptic. Clearly the second coordinate is not stationary,
but if the first coordinate is a stationary process, then the observed increments Yi = (X2,i∆ −
X2,(i−1)∆ )/∆ form a stationary sequence. In the following we will again denote the basic
diffusion by X (rather than X1 ).
Suppose that 4N’th moment of Xt is finite. The moments (7.9) and (7.19) can be
calculated by
h i
E Y1k1 Ytk12 Ytk23 Ytk3 4 =
R
A E[Xv1 · · · Xvk1 Xu1 · · · Xuk2 Xs1 · · · Xsk3 Xr1 · · · Xrk4 ] dt
∆k1 +k2 +k3 +k4
where 1 ≤ t1 ≤ t2 ≤ t3 , A = [0 , ∆]k1 × [(t1 − 1)∆ , t1 ∆]k2 × [(t2 − 1)∆ , t2 ∆]k3 × [(t3 −
1)∆ , t3 ∆]k4 , and dt = drk4 · · · dr1 dsk3 · · · ds1 duk2 · · · du1 dvk1 · · · dv1 . The domain of inte-
gration can be reduced considerably by symmetry arguments, but here the point is that we
need to calculate mixed moments of the type E(Xtκ11 · · · Xtκkk ), where t1 < · · · < tk . For
the Pearson diffusions discussed in Subsection 3.7, these mixed moments can be calculated
by a simple iterative formula obtained from (3.75) and (3.76). Moreover, for the Pearson
diffusions, E(Xtκ11 · · · Xtκkk ) depends on t1 , . . . , tk through sums and products of exponential
functions, cf. (3.75). Therefore the integral above can be explicitly calculated, so that ex-
plicit optimal estimating functions of the type considered in Example 7.1 are available for
observations of integrated Pearson diffusions.
Example 7.5 Consider observation of an integrated square root process (3.27) and a prediction-
(i−1)
based estimating function with f1 (x) = x and f2 (x) = x2 with predictors given by π1 =
(i−1)
α1,0 + α1,1 Yi−1 and π2 = α2,0 . Then the minimal mean square error predictors are
(i−1)
π̆1 (Yi−1 ; θ) = µ (1 − ă(β)) + ă(β)Yi−1 ,
(i−1)
π̆2 (θ) = α2 + ατ 2 β −3 ∆−2 (e−β∆ − 1 + β∆)
49
with
(1 − e−β∆ )2
ă(β) = .
2(β∆ − 1 + e−β∆ )
The optimal prediction-based estimating function is
   
n 1 n 0
X (i−1) X  2 (i−1)

 Yi−1  [Yi − π̄1

(Yi−1 ; θ)] +  0  [Yi − π̄2

(θ)],
i=1 0 i=1 1
from which we obtain the estimators

n
1X a(β̂)Yn − Y1
α̂ = Yi +
n i=1 (n − 1)(1 − a(β̂))
n n n
2
X X X
Yi−1 Yi = α̂(1 − a(β̂)) Yi−1 + a(β̂) Yi−1
i=2 i=2 i=2
2 Pn
β̂ 3 ∆ 2 2
i=2 (Yi − α̂ )
σ̂ 2 = .
(n − 1)α̂(e−β̂∆ − 1 + β̂∆)
The estimators are explicit apart from β̂, which can be found by solving a non-linear equation
in one variable. Details can be found in Ditlevsen & Sørensen (2004).
2
7.4 Sums of diffusions

An autocorrelation function of the form
ρ(t) = φ1 exp(−β1 t) + . . . + φD exp(−βD t), (7.21)
βi > 0, is found in many time series data. Examples are financial time series, Barndorff-
Nielsen & Shephard (2001), and turbulence, Barndorff-Nielsen, Jensen & Sørensen (1990)
and Bibby, Skovgaard & Sørensen (2005).
A simple model with autocorrelation function of the form (7.21) is the sum of diffusions
Yt = X1,t + . . . + XD,t
where
dXi,t = −βi (Xi,t − αi ) + σi (Xi,t )dWi,t , i = 1, . . . , D,
are independent. In this case
Var(Xi,t )
φi = .
Var(X1,t ) + · · · + Var(XD,t )
Sums of diffusions of this type with a pre-specified marginal distribution of Y were considered
by Bibby & Sørensen (2003) and Bibby, Skovgaard & Sørensen (2005). The same type of
autocorrelation function is obtained for sums of independent Ornstein-Uhlenbeck processes
driven by Lévy processes. This class of models was introduced and studied in Barndorff-
Nielsen, Jensen & Sørensen (1998).
50
Example 7.6 Sum of square root processes. If σi2 (x) = 2βi bx and αi = κi b for some b >
0, then the stationary distribution of Yt is a gamma-distribution with shape parameter
κ1 + · · · + κD and scale parameter b. The weights in the autocorrelation function are φi =
κi /(κ1 + · · · + κD ).
2
For sums of the Pearson diffusions presented in Subsection 3.7, we have explicit formulae
that allow calculation of (7.9) and (7.19), provided these mixed moments exists. Thus for
sums of Pearson diffusions we have explicit optimal prediction-based estimating functions of
the type considered in Example 7.1. By the multinomial formula,
E(Ytκ1 Ytν2 ) =
! !
XX κ ν κ1 ν1 κD νD
E(X1,t X1,t ) . . . E(XD,t XD,t )
κ1 , . . . , κD ν1 , . . . , νD 1 2 1 2
where !
κ κ!
=
κ1 , . . . , κD κ1 ! · · · κD !
is the multinomial coefficient, and where the first sum is over 0 ≤ κ1 , . . . , κD such that
κ1 + . . . κD = κ, and the second sum is analogous for the νi s. Higher order mixed moments
of the form (7.19) can be found by a similar formula with four sums and four multinomial
coefficients. Such formulae may appear daunting, but are easy to program. For a Pearson
diffusion, mixed moments of the form E(Xtκ11 · · · Xtκkk ) can be calculated by a simple iterative
formula obtained from (3.75) and (3.76).
Example 7.7 Sum of two skew t-diffusions. If

√
σi2 (x) = 2βi (νi − 1)−1 {x2 + 2ρ νi x + (1 + ρ2 )ν}, i = 1, 2
the stationary distribution of Xi,t is a skew t-diffusion. The distribution of Yt is a convolution

of skew t-diffusions,
ν1 ν2

Var(Y ) = (1 + ρ2 ) + ,
ν1 − 2 ν2 − 2
and φi = νi (νi − 2)−1 /{ν1 (ν1 − 2)−1 + ν2 (ν2 − 2)−1 }. To simplify the exposition we assume
that the correlation parameters β1 , β2 , φ1 , and φ2 are known or have been estimated in
advance, for instance by fitting (7.21) with D = 2 to the empirical autocorrelation function.
We will find the optimal estimating function in the simple case where predictions of Yi2 are
(i−1) (i−1)
made based on Z1,1 = 1 and Z1,2 = Yi−1 . The estimating equations take the form
n
" #
X Yi2 − σ 2 − ζ21 Yi−1
= 0, (7.22)
i=2
Yi−1 Yi2 − σ 2 Yi−1 − ζ21 Yi−1
2
with
ν1 ν2

σ 2 = Var(Yi ) = (1 + ρ2 ) + ,
ν1 − 2 ν2 − 2
( √ √
Cov(Yi−1 , Yi2 )
)
ν1 −β1 ∆ ν2 −β2 ∆
ζ21 = = 4ρ φ1 e + φ2 e .
Var(Yi ) ν1 − 3 ν2 − 3
51
Solving equation (7.22) for ζ21 and σ 2 we get
1 Pn 1 n 1 Pn
Yi−1 Yi2 − ( n−1 i=2 Yi−1 )( n−1 Yi2 )
P
n−1 i=2 i=2
ζ̂21 = 1 Pn 2 1 Pn 2
,
n−1 i=2 Yi−1 − ( n−1 i=2 Yi−1 )
σ̂ 2 = 1 Pn 2 1 Pn
n−1 i=2 Yi + ζ̂21 n−1 i=2 Yi−1 .
In order to estimate ρ we restate ζ21 as

q
q  9(1 + ρ2 ) − φ1 σ 2
ζ21 = 32(1 + ρ2 ) · ρ · φ1 e−β1 ∆
 3(1 + ρ2 ) − φ1 σ 2
q 
9(1 + ρ2 ) − φ2 σ 2 −β2 ∆

+ φ2 e
3(1 + ρ2 ) − φ2 σ 2 
and insert σ̂ 2 for σ 2 . Thus, we get a one-dimensional estimating equation, ζ21 (β, φ, σ̂ 2, ρ) =
2
ζ̂21 , which can be solved numerically. Finally by inverting φi = 1+ρ νi
σ2 νi −2
we find the estimates
2φi σ̂2
ν̂i = φi σ̂2 −(1+ρ̂2 )
, i = 1, 2.
2
7.5 Compartment models

Diffusion compartment models are multivariate diffusion models with linear drift,
dXt = (B(θ)Xt − b(θ))dt + σ(Xt ; θ)dWt , (7.23)
where only a subset of the coordinates are observed. Here B(θ) is a D × D-matrix, b(θ)
is a D-dimensional vector, σ(x; θ) is a D × D-matrix, and W a D-dimensional standard
Wiener process. Compartment models are used to model the dynamics of the flow of a
certain substance between different parts (compartments) of, for instance, an ecosystem
or the body of a human being or an animal. The process Xt is the concentration in the
compartments, and flow from a given compartment into other compartments is proportional
to the concentration in the given compartment modified by the random perturbation given
by the diffusion term. The vector b(θ) represents input to of output from the system.
Example 7.8 The two-compartment model given by

! ! !
−β1 β2 0 τ1 0
B= , b= , σ= ,
β1 −(β1 + β2 ) 0 0 τ2
where all parameters are positive, was used by Bibby (1995) to model how a radioactive
tracer moved between the water and the biosphere in a certain ecosystem. Samples could
only be taken from the water, the first compartment, so Yi = X1,ti . The model is Gaussian,
so likelihood inference is feasible and was studied by Bibby (1995). All mixed moments (7.9)
and (7.19) can be calculated explicitly, so also an explicit optimal prediction-based estimating
function of the type considered in Example 7.1 is available to estimate the parameters and
was studied by Düring (2002).
2
52
Example 7.9 A non-Gaussian diffusion compartment model is obtained by the specification
√ √
σ(x, θ) = diag(τ1 x1 , . . . xD ). This multivariate version of the square root process was
studied by Düring (2002), who used methods in Down, Meyn & Tweedie (1995) to show
that the D-dimensional process is geometrically α-mixing and established the asymptotic
normality of prediction-based estimators of the type considered in Example 7.1 when the
first compartment is observed, i.e. when Yi = X1,ti . In this case, the mixed moments (7.9)
and (7.19) must be calculated numerically.
2
8 General asymptotics results for estimating functions

In this section we review some general asymptotic results for estimators obtained from es-
timating functions for stochastic process models. Proofs can be found in Jacod & Sørensen
(2008).
Suppose as a statistical model for the data X1 , X2 , . . . , Xn that they are observations
from a stochastic process. The corresponding probability measures (Pθ ) are indexed by a p-
dimensional parameter θ ∈ Θ. An estimating function is a function of the parameter and the
observations, Gn (θ; X1 , X2 , . . . , Xn ), with values in IRp . Usually we suppress the dependence
on the observations in the notation and write Gn (θ). We get an estimator by solving the
equation (1.1) and call such an estimator a Gn -estimator. It should be noted that n might
indicate more than just the sample size: the distribution of the data X1 , X2 , . . . , Xn might
depend on n. For instance, Xi might be the observation of a diffusion process at time points
i∆n . Another example is that the diffusion coefficient might depend on n.
We will not necessarily assume that the data are observations from one of the probability
measures (Pθ )θ∈Θ . We will more generally denote the true probability measure by P . If the
statistical model contains the true model, in the sense that there exists a θ0 ∈ Θ such that
P = Pθ , then we call θ0 the true parameter value.
A priory, there might be more than one solution or no solution at all to the estimating
equation (1.1), so conditions are needed to ensure that a unique solution exists when n is
sufficiently large. Moreover, we need to be careful when formally defining our estimator.
In the following definition, δ denotes a “special” point, which we take to be outside Θ and
Θδ = Θ ∪ {δ}.
Definition 8.1 a) The domain of Gn -estimators (for a given n) is the set An of all obser-
vations x = (x1 , . . . , xn ) for which Gn (θ) = 0 for at least one value θ ∈ Θ.
b) A Gn -estimator, θ̂n (x) is any function of the data with values in Θδ , such that for
P –almost all observations we have either θ̂n (x) ∈ Θ and Gn (θ̂n (x), x) = 0 if x ∈ An , or
θ̂n (x) = δ if x ∈
/ An .
We usually suppress the dependence on the observations in the notation and write θ̂n .
The following theorem gives conditions that ensure that, for n large enough, the estimat-
ing equation (1.1) has a solution that converges to a particular parameter value θ̄. When
the statistical model contains the true model, the estimating function should preferably be
chosen such that θ̄ = θ0 . To facilitate the following discussion, we will refer to an estimator
that converges to θ̄ in probability as a θ̄–consistent estimator, meaning that it is a (weakly)
53
consistent estimator of θ̄. We assume that Gn (θ) is differentiable with respect to θ and
denote by ∂θT Gn (θ) the p × p-matrix, where the ijth entry is ∂θj Gn (θ)i .
Theorem 8.2 Suppose the existence of a parameter value θ̄ ∈ int Θ (the interior of Θ),
a connected neighbourhood M of θ̄, and a (possibly random) function W on M taking its
values in the set of p × p matrices, such that the following holds:
P
(i) Gn (θ̄) → 0 (convergence in probability, w.r.t. the true measure P ) as n → ∞.
(ii) Gn (θ) is continuously differentiable on M for all n, and
P
sup k ∂θT Gn (θ) − W (θ) k → 0. (8.1)
θ∈M
(iii) The matrix W (θ̄) is non-singular with P –probability one.

Then a sequence (θ̂n ) of Gn -estimators which is θ̄-consistent. Moreover this sequence is
eventually unique, that is if (θ̂n′ ) is any other θ̄–consistent sequence of Gn –estimators, then
P (θ̂n 6= θ̂n′ ) → 0 as n → ∞.
Note that (8.1) implies the existence of a subsequence {nk } such that ∂θT Gnk (θ) converges
uniformly to W (θ) on M with probability one. Hence W is continuous (up to a null set)
and it follows from elementary calculus that outside some P –null set there exists a unique
continuously differentiable function G satisfying ∂θT G(θ) = W (θ) for all θ ∈ M and G(θ̄) = 0.
When M is a bounded set, (8.1) implies that
P
sup |Gn (θ) − G(θ)| → 0. (8.2)
θ∈M
This observation casts light on the result of Theorem 8.2. Since Gn (θ) can be made arbitrarily
close to G(θ) by choosing n large enough, and since G(θ) has a zero at θ̄, it is intuitively
clear that Gn (θ) must have a zero near θ̄ when n is sufficiently large
If we impose an identifiability condition, we can give a stronger result on any sequence
of Gn –estimators. By B̄ǫ (θ) we denote the closed ball with radius ǫ centered at θ.
Theorem 8.3 Assume (8.2) for some subset M of θ containing θ̄, and that
!
P inf |G(θ)| > 0 = 1 (8.3)
M \B̄ǫ (θ̄)
for all ǫ > 0. Then for any sequence (θ̂n ) of Gn –estimators
P (θ̂n ∈ M\B̄ǫ (θ̄)) → 0 (8.4)
as n → ∞ for every ǫ > 0
If M = Θ, we see that any sequence (θ̂n ) of Gn –estimators is θ̄–consistent. If the conditions

of Theorem 8.3 hold for any compact subset M of Θ, then a sequence (θ̂n ) of Gn –estimators
is θ̄–consistent or converges to the boundary of Θ.
Finally, we give a result on the asymptotic distribution of a sequence (θ̂n ) of θ̄–consistent
Gn –estimators.
54
Theorem 8.4 Assume the estimating function Gn satisfies the conditions of Theorem 8.2
and that there is a sequence of real numbers an > 0 increasing to ∞ such that
! !
an Gn (θ̄) L Z
−→ , (8.5)
∂θT Gn (θ̄) W (θ̄)
where Z is a non-degenerate random variable. When W (θ̄) is non-random it is enough to

assume that
L
an Gn (θ̄) −→ Z. (8.6)
Then for any θ̄–consistent sequence (θ̂n ) of Gn –estimators,
L
an (θ̂n − θ̄) −→ −W (θ̄)−1 Z. (8.7)
When Z is normal distributed with expectation zero and covariance matrix V , and when
Z is independent of W (θ̄), then the limit distribution is the normal variance-mixture with
characteristic function

s 7→ E exp − 21 s∗ W (θ̄)−1 V W (θ̄)∗−1 s . (8.8)
If moreover W (θ̄) is non-random, then the limit distribution is a normal distribution with
expectation zero and covariance matrix W (θ̄)−1 V W (θ̄)∗−1 .
9 Optimal estimating functions: general theory

The modern theory of optimal estimating functions dates back to the papers by Godambe
(1960) and Durbin (1960), however the basic idea was in a sense already used in Fisher
(1935). The theory was extended to stochastic processes by Godambe (1985), Godambe
& Heyde (1987), Heyde (1988), and several others; see the references in Heyde (1997).
Important particular instances are likelihood inference, the quasi-likelihood of Wedderburn
(1974) and the generalized estimating equations developed by Liang & Zeger (1986) to deal
with problems of longitudinal data analysis, see also Prentice (1988) and Li (1997). The
theory is very closely related to the theory of the generalized method of moments developed
independently in parallel in the econometrics literature, see e.g. Hansen (1982), Hansen
(1985) and Hansen, Heaton & Ogaki (1988). A modern review of the theory of optimal
estimating functions can be found in Heyde (1997).
The general setup is as in the previous section. We will only consider unbiased estimating
functions, i.e., estimating functions satisfying that Eθ (Gn (θ)) = 0 for all θ ∈ Θ. This natural
requirement is also called Fisher consistency. It often implies condition (i) of Theorem 8.2
for θ̄ = θ0 , which is an essential part of the condition for existence of a consistent estimator.
Suppose we have a class Gn of unbiased estimating functions. How do we choose the best
member in Gn ? And in what sense are some estimating functions better than others? These
are the main problems in the theory of estimating functions.
To simplify the discussion, let us first assume that p = 1. The quantity
SGn (θ) = Eθ (∂θT Gn (θ)) (9.1)
is called the sensitivity function for Gn . As in the previous section, it is assumed that Gn (θ)
is differentiable with respect to θ. A large absolute value of the sensitivity implies that
55
the equation Gn (θ) = 0 tends to have a solution near the true parameter value, where the
expectation of Gn (θ) is equal to zero. Thus a good estimating function is one with a large
absolute value of the sensitivity.
Ideally, we would base the statistical inference on the likelihood function Ln (θ), and
hence use the score function Un (θ) = ∂θ log Ln (θ) as our estimating function. This usually
yields an efficient estimator. However, when Ln (θ) is not available or is difficult to calculate,
we might prefer to use an estimating function that is easier to obtain and is in some sense
close to the score function. Suppose that both Un (θ) and Gn (θ) have finite variance. Then
it can be proven under usual regularity conditions that
SGn (θ) = −Covθ (Gn (θ), Un (θ)).
Thus we can find an estimating function Gn (θ) that maximizes the absolute value of the
correlation between Gn (θ) and Un (θ) by finding one that maximizes the quantity
KGn (θ) = SGn (θ)2 /Varθ (Gn (θ)) = SGn (θ)2 /Eθ (Gn (θ)2 ), (9.2)
which is known as the Godambe information. This makes intuitive sense: the ratio KGn (θ)
is large when the sensitivity is large and when the variance of Gn (θ) is small. The Godambe
information is a natural generalization of the Fisher information. Indeed, KUn (θ) is the
Fisher information. For a discussion of information quantities in a stochastic process setting,
see Barndorff-Nielsen & Sørensen (1994). In a short while, we shall see that the Godambe
information has a large sample interpretation too. An estimating function G∗n ∈ Gn is called
Godambe-optimal in Gn if
KG∗n (θ) ≥ KGn (θ) (9.3)
for all θ ∈ Θ and for all Gn ∈ Gn .
When the parameter θ is multivariate (p > 1), the sensitivity function is the p × p-matrix
SGn (θ) = Eθ (∂θT Gn (θ)). (9.4)
For a multivariate parameter, the Godambe information is the p × p-matrix
−1
KGn (θ) = SGn (θ)T Eθ Gn (θ)Gn (θ)T SGn (θ), (9.5)
and an optimal estimating function G∗n can be defined by (9.3) with the inequality referring to
the partial ordering of the set of positive semi-definite p×p-matrices. Whether an Godambe-
optimal estimating function exists and whether it is unique depends on the class Gn . In
any case, it is only unique up to multiplication by a regular matrix that might depend on
θ. Specifically, if G∗n (θ) satisfies (9.3), then so does Mθ G∗n (θ) where Mθ is an invertible
deterministic p × p-matrix. Fortunately, the two estimating functions give rise to the same
estimator(s), and we refer to them as versions of the same estimating function. For theoretical
purposes a standardized version of the estimating functions is useful. The standardized
version of Gn (θ) is given by
−1
G(s) T
n (θ) = −SGn (θ) Eθ Gn (θ)Gn (θ)
T
Gn (θ).
The rationale behind this standardization is that G(s)

n (θ) satisfies the second Bartlett-identity

Eθ G(s) (s)
n (θ)Gn (θ)
T
= −Eθ (∂θT G(s)
n (θ)), (9.6)
56
an identity usually satisfied by the score function. The standardized estimating function
G(s)
n (θ) is therefore more directly comparable to the score function. Note that when the
second Bartlett identity is satisfied, the Godambe information equals minus the sensitivity
matrix.
An Godambe-optimal estimating function is close to the score function Un in an L2 -sense.
Suppose G∗n is Godambe-optimal in Gn . Then the standardized version G∗(s)
n (θ) satisfies the
inequality

Eθ (G(s) T (s)
n (θ) − Un (θ)) (Gn (θ) − Un (θ))

≥ Eθ (G∗(s) T ∗(s)
n (θ) − Un (θ)) (Gn (θ) − Un (θ))
for all θ ∈ Θ and for all Gn ∈ Gn , see Heyde (1988). In fact, if Gn is a closed subspace of the
L2 -space of all square integrable functions of the data, then the quasi-score function is the
orthogonal projection of the score function onto Gn . For further discussion of this Hilbert
space approach to estimating functions, see McLeish & Small (1988). The interpretation
of an optimal estimating function as an approximation to the score function is important.
By choosing a sequence of classes Gn that, as n → ∞, converges to a subspace containing
the score function Un , a sequence of estimators that is asymptotically fully efficient can be
constructed.
The following result by Heyde (1988) can often be used to find the optimal estimating
function.
Theorem 9.1 If G∗n ∈ Gn satisfies the equation

SGn (θ)−1 Eθ Gn (θ)G∗n (θ)T = SG∗n (θ)−1 Eθ G∗n (θ)G∗n (θ)T (9.7)
for all θ ∈ Θ and for all Gn ∈ Gn , then it is Godambe-optimal in Gn . When Gn is closed

under addition, any Godambe-optimal estimating function G∗n satisfies (9.7) .
The condition (9.7) can often be verified by showing that Eθ (Gn (θ)G∗n (θ)T ) = −Eθ (∂θT Gn (θ))
for all θ ∈ Θ and for all Gn ∈ Gn . In such situations, G∗n satisfies the second Bartlett-identity,
(9.6), so that
KG∗n (θ) = Eθ G∗n (θ)G∗n (θ)T .
9.1 Martingale estimating functions

More can be said about martingale estimating functions, i.e. estimating functions Gn satis-
fying that
Eθ (Gn (θ)|Fn−1 ) = Gn−1 (θ), n = 1, 2, . . . ,
where Fn−1 is the σ-field generated by the observations X1 , . . . , Xn−1 (G0 = 0 and F0 is the
trivial σ-field). In other words, the stochastic process {Gn (θ) : n = 1, 2, . . .} is a martingale
under the model given by the parameter value θ. Since the score function is usually a
martingale (see e.g. Barndorff-Nielsen & Sørensen (1994)), it is natural to approximate it by
families of martingale estimating functions.
The well-developed martingale limit theory allows a straightforward discussion of the
asymptotic theory, and motivates an optimality criterion that is particular to martingale
57
estimating functions. Suppose the estimating function Gn (θ) satisfies the conditions of the
central limit theorem for martingales and let θ̂n be a solution of the equation Gn (θ) = 0.
Under the regularity conditions of the previous section, it can be proved that
−1 D
hG(θ)in 2 Ḡn (θ)(θ̂n − θ0 ) −→ N(0, Ip ). (9.8)
Here hG(θ)in is the quadratic characteristic of Gn (θ) defined by
n
Eθ (Gi (θ) − Gi−1 (θ))(Gi (θ) − Gi−1 (θ))T |Fi−1 ,
X
hG(θ)in =
i=1
and ∂θT Gn (θ) has been replaced by its compensator

n
X
Ḡn (θ) = Eθ (∂θT Gi (θ) − ∂θT Gi−1 (θ)|Fi−1 ) ,
i=1
P
using the extra assumption that Ḡn (θ)−1 ∂θT Gn (θ) −→
θ
Ip . Details can be found in Heyde
(1988). We see that the inverse of the data-dependent matrix
IGn (θ) = Ḡn (θ)T hG(θ)i−1
n Ḡn (θ) (9.9)
estimates the co-variance matrix of the asymptotic distribution of the estimator θ̂n . Therefore
IGn (θ) can be interpreted as an information matrix, called the Heyde-information. It gener-
alizes the incremental expected information of the likelihood theory for stochastic processes,
see Barndorff-Nielsen & Sørensen (1994). Since Ḡn (θ) estimates the sensitivity function,
and hG(θ)in estimates the variance of the asymptotic distribution of Gn (θ), the Heyde-
information has a heuristic interpretation similar to that of the Godambe-information. In
fact,
Eθ Ḡn (θ) = SGn (θ) and Eθ (hG(θ)in ) = Eθ Gn (θ)Gn (θ)T .
We can thus think of the Heyde-information as an estimated version of the Godambe infor-
mation.
Let Gn be a class of martingale estimating functions with finite variance. We say that a
martingale estimating function G∗n is Heyde-optimal in Gn if
IG∗n (θ) ≥ IGn (θ) (9.10)
Pθ -almost surely for all θ ∈ Θ and for all Gn ∈ Gn .
The following useful result from Heyde (1988) is similar to Theorem 9.1. In order to
formulate it, we need the concept of the quadratic co-characteristic of two martingales, G
and G̃, both of which are assumed to have finite variance:
n
E (Gi − Gi−1 )(G̃i − G̃i−1 )T |Fi−1 .
X
hG, G̃in = (9.11)
i=1
Theorem 9.2 If G∗n ∈ Gn satisfies

Ḡn (θ)−1 hG(θ), G∗ (θ)in = Ḡ∗n (θ)−1 hG∗ (θ)in (9.12)
for all θ ∈ Θ and all Gn ∈ Gn , then it is is Heyde-optimal in Gn . When Gn is closed
under addition, any Heyde-optimal estimating function G∗n satisfies (9.12). Moreover, if
Ḡ∗n (θ)−1 hG∗ (θ)in is non-random, then G∗n is also Godambe-optimal in Gn .
58
Since in many situations condition (9.12) can be verified by showing that hG(θ), G∗(θ)in =
−Ḡn (θ) for all Gn ∈ Gn , it is in practice often the case that Heyde-optimality implies
Godambe-optimality.
Example 9.3 Let us consider a common type of estimating functions. To simplify the
exposition we assume that the observed process is Markovian. For Markov processes it is
natural to base estimating functions on functions hij (y, x; θ), j = 1, . . . , N, i = 1, . . . , n
satisfying that
Eθ (hij (Xi , Xi−1 ; θ)|Fi−1 ) = 0. (9.13)
Such functions define relationships (dependent on θ) between consecutive observation Xi and
Xi−1 that are, on average, equal to zero. It is natural to use such relationships to estimate
θ by solving the equations ni=1 hij (Xi , Xi−1 ; θ) = 0. In order to estimate θ it is necessary
P
that N ≥ p, but if N > p we have too many equations. The theory of optimal estimating
functions tells us how to combine the N functions in an optimal way. We consider the class
of p-dimensional estimating functions of the form
n
X
Gn (θ) = ai (Xi−1 ; θ)hi (Xi , Xi−1 ; θ), (9.14)
i=1
where hi denotes the N-dimensional vector (hi1 , . . . , hiN )T , and ai (x; θ) is a function from
IR × Θ into the set of p × N-matrices that is differentiable with respect to θ. It follows from
(9.13) that Gn (θ) is a p-dimensional unbiased martingale estimating function.
We will now find the matrices ai that combine the N functions hij in an optimal way.
Let Gn be the class of martingale estimating functions of the form (9.14) that have finite
variance. Then n X
Ḡn (θ) = ai (Xi−1 ; θ)Eθ (∂θT hi (Xi , Xi−1 ; θ)|Fi−1 )
i=1
and n
hG(θ), G∗ (θ)in = ai (Xi−1 ; θ)Vhi (Xi−1 ; θ)a∗i (Xi−1 ; θ)T ,
X
i=1
where n
G∗n (θ) = a∗i (Xi−1 ; θ)hi (Xi , Xi−1 ; θ),
X
(9.15)
i=1
and
Vhi (Xi−1 ; θ) = Eθ hi (Xi , Xi−1 ; θ)hi (Xi , Xi−1 ; θ)T |Fi−1
is the conditional covariance matrix of the random vector hi (Xi , Xi−1 ; θ) given Fi−1 . If we
assume that Vhi (Xi−1 ; θ) is invertible and define
a∗i (Xi−1 ; θ) = −Eθ (∂θT hi (Xi , Xi−1 ; θ)|Fi−1)T Vhi (Xi−1 ; θ)−1 , (9.16)
then the condition (9.12) is satisfied. Hence by Theorem 9.2 the estimating function G∗n (θ)
with a∗i given by (9.16) is Heyde-optimal - provided, of course, that it has finite variance.
Since Ḡ∗n (θ)−1 hG∗ (θ)in = −Ip is non-random, the estimating function G∗n (θ) is also Godambe-
optimal. If a∗i were defined without the minus, G∗n (θ) would obviously also be optimal. The
reason for the minus will be clear in the following.
59
We shall now see, in exactly what sense the optimal estimating function (9.15) approxi-
mates the score function. The following result was first given by Kessler (1996). Let pi (y; θ|x)
denote the conditional density of Xi given that Xi−1 = x. Then the likelihood function for
θ based on the data (X1 , . . . , Xn ) is
n
Y
Ln (θ) = pi (Xi ; θ|Xi−1 )
i=1
(with p1 denoting the unconditional density of X1 ). If we assume that all pi s are differentiable
with respect to θ, the score function is
n
X
Un (θ) = ∂θ log pi (Xi ; θ|Xi−1 ). (9.17)
i=1
Let us fix i, xi−1 and θ and consider the L2 -space Ki (xi−1 , θ) of functions f : IR 7→ IR for
which f (y)2pi (y; θ|xi−1 )dy < ∞. We equip Ki (xi−1 , θ) with the usual inner product
R
Z
hf, gi = f (y)g(y)pi(y; θ|xi−1)dy,
and let Hi (xi−1 , θ) denote the N-dimensional subspace of Ki (xi−1 , θ) spanned by the functions
y 7→ hij (y, xi−1 ; θ), j = 1, . . . , N. That the functions are linearly independent in Ki (xi−1 , θ)
follows from the earlier assumption that the covariance matrix Vhi (xi−1 ; θ) is regular.
Now, assume that ∂θj log pi (y|xi−1 ; θ) ∈ Ki (xi−1 , θ) for j = 1, . . . , p, denote by gij∗ the
orthogonal projection with respect to h·, ·i of ∂θj log pi onto Hi (xi−1 , θ), and define a p-
dimensional function by gi∗ = (gi1 ∗ ∗ T
, . . . , gip ) . Then (under weak regularity conditions)
gi∗(xi−1 , x; θ) = a∗i (xi−1 ; θ)hi (xi−1 , x; θ), (9.18)
where a∗i is the matrix defined by (9.16). To see this, note that g ∗ must have the form (9.18)
with a∗i satisfying the normal equations
h∂θj log pi − gj∗ , hik i = 0,
j = 1, . . . , p and k = 1, . . . , N. These equations can be expressed in the form Bi = a∗i Vhi ,

where Bi is the p × p-matrix whose (j, k)th element is h∂θj log pi , hik i. The main regularity
condition needed to prove (9.18) is that we can interchange differentiation and integration
so that
Z
∂θj [hik (y, xi−1; θ)p(y, xi−1; θ)] dy =
Z
∂θj hik (y, xi−1; θ)p(xi−1 , y; θ)dy = 0,
from which it follows that

Z
Bi = − ∂θT hi (y, xi−1; θ)p(xi−1 , y; θ)dy.
Thus a∗i is given by (9.16).

2
60
Acknowledgements
The research was supported by the Danish Center for Accounting and Finance funded by
the Danish Social Science Research Council and by the Center for Research in Econometric
Analysis of Time Series funded by the Danish National Research Foundation.
References
Aı̈t-Sahalia, Y. (2002). “Maximum likelihood estimation of discretely sampled diffusions: a
closed-form approximation approach”. Econometrica, 70:223–262.
Aı̈t-Sahalia, Y. (2008). “Closed-form likelihood expansions for multivariate diffusions”. Ann.

Statist., 36:906–937.
Aı̈t-Sahalia, Y. & Mykland, P. (2003). “The effects of random and discrete sampling when
estimating continuous-time diffusions”. Econometrica, 71:483–549.
Aı̈t-Sahalia, Y. & Mykland, P. A. (2004). “Estimators of diffusions with randomly spaced

discrete observations: a general theory”. Ann. Statist., 32:2186–2222.
Barndorff-Nielsen, O. E.; Jensen, J. L. & Sørensen, M. (1990). “Parametric Modelling of

Turbulence”. Phil. Trans. R. Soc. Lond. A, 332:439–455.
Barndorff-Nielsen, O. E.; Jensen, J. L. & Sørensen, M. (1998). “Some Stationary Processes

in Discrete and Continuous Time”. Advances in Applied Probability, 30:989–1007.
Barndorff-Nielsen, O. E.; Kent, J. & Sørensen, M. (1982). “Normal variance-mean mixtures

and z-distributions”. International Statistical Review, 50:145–159.
Barndorff-Nielsen, O. E. & Shephard, N. (2001). “Non-Gaussian Ornstein-Uhlenbeck-Based

Models and some of their Uses in Financial Econometrics (with discussion)”. Journal
of the Royal Statistical Society B, 63:167–241.
Barndorff-Nielsen, O. E. & Sørensen, M. (1994). “A review of some aspects of asymptotic

likelihood theory for stochastic processes”. International Statistical Review, 62:133–165.
Beskos, A.; Papaspiliopoulos, O.; Roberts, G. O. & Fearnhead, P. (2006). “Exact and
computationally efficient likelihood-based estimation for discretely observed diffusion
processes”. J. Roy. Statist. Soc. B, 68:333–382.
Bibby, B. M. (1995). Inference for diffusion processes with particular emphasis on compart-
mental diffusion processes. PhD thesis, University of Aarhus.
Bibby, B. M.; Skovgaard, I. M. & Sørensen, M. (2005). “Diffusion-type models with given
marginals and autocorrelation function”. Bernoulli, 11:191–220.
Bibby, B. M. & Sørensen, M. (1995). “Martingale estimation functions for discretely observed
diffusion processes”. Bernoulli, 1:17–39.
61
Bibby, B. M. & Sørensen, M. (1996). “On estimation for discretely observed diffusions: a
review”. Theory of Stochastic Processes, 2:49–56.
Bibby, B. M. & Sørensen, M. (2003). “Hyperbolic processes in finance”. In Rachev, S., editor,
Handbook of Heavy Tailed Distributions in Finance, pages 211–248. Elsevier Science.
Billingsley, P. (1961). “The Lindeberg-Lévy theorem for martingales”. Proc. Amer. Math.
Soc., 12:788–792.
Bollerslev, T. & Wooldridge, J. (1992). “Quasi-maximum likelihood estimators and inference

in dynamic models with time-varying covariances”. Econometric Review, 11:143–172.
Brockwell, P. J. & Davis, R. A. (1991). Time Series: Theory and Methods. Springer-Verlag,
New York.
Campbell, J. Y.; Lo, A. W. & MacKinlay, A. C. (1997). The Econometrics of Financial

Markets. Princeton University Press, Princeton.
Chan, K. C.; Karolyi, G. A.; Longstaff, F. A. & Sanders, A. B. (1992). “An empirical
comparison of alternative models of the short-term interest rate”. Journal of Finance,
47:1209–1227.
Clement, E. (1997). “Estimation of diffusion processes by simulated moment methods”.

Scand. J. Statist., 24:353–369.
Dacunha-Castelle, D. & Florens-Zmirou, D. (1986). “Estimation of the coefficients of a

diffusion from discrete observations”. Stochastics, 19:263–284.
De Jong, F.; Drost, F. C. & Werker, B. J. M. (2001). “A jump-diffusion model for exchange
rates in a target zone”. Statistica Neerlandica, 55:270–300.
Ditlevsen, P. D.; Ditlevsen, S. & Andersen, K. K. (2002). “The fast climate fluctuations
during the stadial and interstadial climate states”. Annals of Glaciology, 35:457–462.
Ditlevsen, S. & Sørensen, M. (2004). “Inference for observations of integrated diffusion

processes”. Scand. J. Statist., 31:417–429.
Dorogovcev, A. J. (1976). “The consistency of an estimate of a parameter of a stochastic

differential equation”. Theor. Probability and Math. Statist., 10:73–82.
Doukhan, P. (1994). Mixing, Properties and Examples. Springer, New York. Lecture Notes
in Statistics 85.
Down, D.; Meyn, S. & Tweedie, R. (1995). “Exponential and uniform ergodicity of Markov
processes”. Annals of Probability, 23:1671–1691.
Duffie, D. & Singleton, K. (1993). “Simulated moments estimation of Markov models of

asset prices”. Econometrica, 61:929–952.
Durbin, J. (1960). “Estimation of parameters in time-series regression models”. J. Roy.

Statist. Soc. B, 22:139–153.
62
Durham, G. B. & Gallant, A. R. (2002). “Numerical techniques for maximum likelihood
estimation of continuous-time diffusion processes”. J. Business & Econom. Statist.,
20:297–338.
Düring, M. (2002). “Den prediktions-baserede estimationsfunktion for diffusions puljemod-

eller”. Master’s thesis, University of Copenhagen. In Danish.
Elerian, O.; Chib, S. & Shephard, N. (2001). “Likelihood inference for discretely observed
non-linear diffusions”. Econometrica, 69:959–993.
Eraker, B. (2001). “MCMC Analysis of Diffusion Models with Application to Finance”. J.

Business & Econom. Statist., 19:177–191.
Fisher, R. A. (1935). “The logic of inductive inference”. J. Roy. Statist. Soc., 98:39–54.
Florens-Zmirou, D. (1989). “Approximate discrete-time schemes for statistics of diffusion

processes”. Statistics, 20:547–557.
Forman, J. L. & Sørensen, M. (2008). “The Pearson diffusions: A class of statistically

tractable diffusion processes”. Scand. J. Statist. To appear.
Friedman, A. (1975). Stochastic Differential Equations and Applications, Volume 1. Aca-

demic Press, New York.
Genon-Catalot, V. (1990). “Maximum contrast estimation for diffusion processes from dis-
crete observations”. Statistics, 21:99–116.
Genon-Catalot, V. & Jacod, J. (1993). “On the estimation of the diffusion coefficient for
multi-dimensional diffusion processes”. Ann. Inst. Henri Poincaré, Probabilités et Statis-
tiques, 29:119–151.
Genon-Catalot, V.; Jeantheau, T. & Larédo, C. (2000). “Stochastic volatility models as

hidden Markov models and statistical applications”. Bernoulli, 6:1051–1079.
Gloter, A. (2000). “Parameter estimation for a discrete sampling of an integrated Ornstein-

Uhlenbeck process”. Statistics, 35:225–243.
Gloter, A. (2006). “Parameter estimation for a discretely observed integrated diffusion

process”. Scand. J. Statist., 33:83–104.
Gloter, A. & Sørensen, M. (2008). “Estimation for stochastic differential equations with a
small diffusion coefficient”. Stoch. Proc. Appl. To appear.
Gobet, E. (2002). “LAN property for ergodic diffusions with discrete observations”. Ann.
Inst. Henri Poincaré, Probabilités et Statistiques, 38:711–737.
Godambe, V. P. (1960). “An optimum property of regular maximum likelihood estimation”.

Ann. Math. Stat., 31:1208–1212.
Godambe, V. P. (1985). “The foundations of finite sample estimation in stochastic pro-

cesses”. Biometrika, 72:419–428.
63
Godambe, V. P. & Heyde, C. C. (1987). “Quasi likelihood and optimal estimation”. Inter-
national Statistical Review, 55:231–244.
Gourieroux, C. & Jasiak, J. (2006). “Multivariate Jacobi process and with application to
smooth transitions”. Journal of Econometrics, 131:475–505.
Gradshteyn, I. S. & Ryzhik, I. M. (1965). Table of Integrals, Series, and Products, 4th
Edition. Academic Press, New-York.
Hall, P. & Heyde, C. C. (1980). Martingale Limit Theory and Its Applications. Academic
Press, New York.
Hansen, L. P. (1982). “Large sample properties of generalized method of moments estima-

tors”. Econometrica, 50:1029–1054.
Hansen, L. P. (1985). “A method for calculating bounds on the asymptotic covariance matri-
ces of generalized method of moments estimators”. Journal of Econometrics, 30:203–238.
Hansen, L. P.; Heaton, J. C. & Ogaki, M. (1988). “Efficiency bounds implied by multiperiod
conditional restrictions”. Journal of the American Statistical Association, 83:863–871.
Hansen, L. P. & Scheinkman, J. A. (1995). “Back to the future: generating moment impli-
cations for continuous-time Markov processes”. Econometrica, 63:767–804.
Hansen, L. P.; Scheinkman, J. A. & Touzi, N. (1998). “Spectral methods for identifying
scalar diffusions”. Journal of Econometrics, 86:1–32.
Heyde, C. C. (1988). “Fixed sample and asymptotic optimality for classes of estimating
functions”. Contemporary Mathematics, 80:241–247.
Heyde, C. C. (1997). Quasi-Likelihood and Its Application. Springer-Verlag, New York.
Hildebrandt, E. H. (1931). “Systems of polynomials connected with the Charlier expansions

and the Pearson differential and difference equations”. Ann. Math. Statist., 2:379–439.
Jacobsen, M. (2001). “Discretely observed diffusions; classes of estimating functions and

small ∆-optimality”. Scand. J. Statist., 28:123–150.
Jacobsen, M. (2002). “Optimality and small ∆-optimality of martingale estimating func-

tions”. Bernoulli, 8:643–668.
Jacod, J. & Sørensen, M. (2008). “Aspects of asymptotic statistical theory for stochastic
processes.”. Preprint, Department of Mathematical Sciences, University of Copenhagen.
In preparation.
Kelly, L.; Platen, E. & Sørensen, M. (2004). “Estimation for discretely observed diffusions
using transform functions”. J. Appl. Prob., 41:99–118.
Kessler, M. (1996). Estimation paramétrique des coefficients d’une diffusion ergodique à

partir d’observations discrètes. PhD thesis, Laboratoire de Probabilités, Université
Paris VI.
64
Kessler, M. (1997). “Estimation of an ergodic diffusion from discrete observations”. Scand.
J. Statist., 24:211–229.
Kessler, M. (2000). “Simple and explicit estimating functions for a discretely observed
diffusion process”. Scand. J. Statist., 27:65–82.
Kessler, M. & Paredes, S. (2002). “Computational aspects related to martingale estimating
functions for a discretely observed diffusion”. Scand. J. Statist., 29:425–440.
Kessler, M. & Sørensen, M. (1999). “Estimating equations based on eigenfunctions for a
discretely observed diffusion process”. Bernoulli, 5:299–314.
Kimball, B. F. (1946). “Sufficient statistical estimation functions for the parameters of the
distribution of maximum values”. Ann. Math. Statist., 17:299–309.
Kloeden, P. E. & Platen, E. (1999). Numerical Solution of Stochastic Differential Equations.
3rd revised printing. Springer-Verlag, New York.
Kusuoka, S. & Yoshida, N. (2000). “Malliavin calculus, geometric mixing, and expansion of
diffusion functionals”. Probability Theory and Related Fields, 116:457–484.
Larsen, K. S. & Sørensen, M. (2007). “A diffusion model for exchange rates in a target
zone”. Mathematical Finance, 17:285–306.
Li, B. (1997). “On the consistency of generalized estimating equations”. In Basawa, I. V.;
Godambe, V. P. & Taylor, R. L., editors, Selected Proceedings of the Symposium on
Estimating Functions, pages 115–136. Hayward: Institute of Mathematical Statistics.
IMS Lecture Notes – Monograph Series, Vol. 32.
Liang, K.-Y. & Zeger, S. L. (1986). “Longitudinal data analysis using generalized linear
model”. Biometrika, 73:13–22.
McLeish, D. L. & Small, C. G. (1988). The Theory and Applications of Statistical Inference
Functions. Springer-Verlag, New York. Lecture Notes in Statistics 44.
Nagahara, Y. (1996). “Non-Gaussian distribution for stock returns and related stochastic
differential equation”. Financial Engineering and the Japanese Markets, 3:121–149.
Overbeck, L. & Rydén, T. (1997). “Estimation in the Cox-Ingersoll-Ross model”. Econo-
metric Theory, 13:430–461.
Ozaki, T. (1985). “Non-linear time series models and dynamical systems”. In Hannan, E. J.;
Krishnaiah, P. R. & Rao, M. M., editors, Handbook of Statistics, Vol. 5, pages 25–83.
Elsevier Science Publishers.
Pearson, K. (1895). “Contributions to the Mathematical Theory of Evolution II. Skew
Variation in Homogeneous Material”. Philosophical Transactions of the Royal Society
of London. A, 186:343–414.
Pedersen, A. R. (1994). “Quasi-likelihood inference for discretely observed diffusion pro-
cesses”. Research Report No. 295, Department of Theoretical Statistics, Institute of
Mathematics, University of Aarhus.
65
Pedersen, A. R. (1995). “A new approach to maximum likelihood estimation for stochastic
differential equations based on discrete observations”. Scand. J. Statist., 22:55–71.
Poulsen, R. (1999). “Approximate maximum likelihood estimation of discretely observed

diffusion processes”. Working Paper 29, Centre for Analytical Finance, Aarhus.
Prakasa Rao, B. L. S. (1988). “Statistical inference from sampled data for stochastic pro-
cesses”. Contemporary Mathematics, 80:249–284.
Prentice, R. L. (1988). “Correlated binary regression with covariates specific to each binary
observation”. Biometrics, 44:1033–1048.
Roberts, G. O. & Stramer, O. (2001). “On inference for partially observed nonlinear diffusion
models using Metropolis-Hastings algorithms”. Biometrika, 88:603–621.
Romanovsky, V. (1924). “Generalization of some types of the frequency curves of Professor

Pearson”. Biometrika, 16:106–117.
Skorokhod, A. V. (1989). Asymptotic Methods in the Theory of Stochastic Differential Equa-

tions. American Mathematical Society, Providence, Rhode Island.
Sørensen, H. (2001). “Discretely observed diffusions: Approximation of the continuous-time

score function”. Scand. J. Statist., 28:113–121.
Sørensen, M. (2000). “Prediction-Based Estimating Functions”. Econometrics Journal,

3:123–147.
Sørensen, M. (2007). “Efficient estimation for ergodic diffusions sampled at high frequency”.
Preprint, Department of Mathematical Sciences, University of Copenhagen.
Sørensen, M. & Uchida, M. (2003). “Small-diffusion asymptotics for discretely sampled

stochastic differential equations”. Bernoulli, 9:1051–1069.
Veretennikov, A. Y. (1987). “Bounds for the mixing rate in the theory of stochastic equa-
tions”. Theory of Probability and its Applications, 32:273–281.
Wedderburn, R. W. M. (1974). “Quasi-likelihood functions, generalized linear models, and

the Gauss-Newton method”. Biometrika, 61:439–447.
Wong, E. (1964). “The construction of a class of stationary Markoff processes”. In Bellman,

R., editor, Stochastic Processes in Mathematical Physics and Engineering, pages 264–
276. American Mathematical Society, Rhode Island.
Yoshida, N. (1992). “Estimation for diffusion processes from discrete observations”. Journal
of Multivariate Analysis, 41:220–242.
66

Statistics Diffusions

Uploaded by

Copyright:

Available Formats

Statistics Diffusions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Diffusions

Uploaded by

Copyright:

Available Formats

Estimating functions for diffusion-type processes

Gn (θ; X0 , Xt1 , . . . , Xtn ).

dXt = −θ tan(Xt )dt + dWt ,

2 Low frequency asymptotics

Condition 2.1 There is a parameter value θ̄ ∈ int Θ and a neighbourhood N of θ̄ in Θ,

Q(g(θ)) 6= 0 for all θ 6= θ̄, (2.7)

Remark: By a θ̄-consistent estimator is meant that θ̂n converges in probability to θ̄ as

k(η; x1 , . . . , xr ) = sup kf (x1 , . . . , xr ; θ′ ) − f (x1 , . . . , xr ; θ)k,

where θ1 , . . . , θm ∈ K, so for every θ ∈ K we can find θℓ (ℓ ∈ {1, . . . , m}) such that

kFn (θ) − Q(f (θ))k

for all ǫ > 0. 2

inf |G(θ)| > 0,

3 Martingale estimating functions

dXt = b(Xt ; θ)dt + σ(Xt ; θ)dWt , (3.1)

where σ is a d × d-matrix and W a d-dimensional standard Wiener process. We denote the

where g is p-dimensional function that satisfies that

3.1 Likelihood inference

In this way we obtain the quasi-likelihood

where x# ∈ (ℓ, r) is arbitrary.

Condition 3.1 The following holds for all θ ∈ Θ:

Qθ (dx, dy) = µθ (x)p(∆, x, y; θ)dxdy. (3.11)

for all θ ∈ Θ. By (2.2),

Thus we have the following particular case of Theorem 2.2.

h1 (x, y; θ) = y − F (∆, x; θ),

which we shall see is approximately optimal. 2

Under weak conditions,

Hn (θ) = Fn (θ)T Mn−1 Fn (θ).

The corresponding estimating function is obtained by differentiation with respect to θ

∂θ Hn (θ) = Dn (θ)Mn−1 Fn (θ),

Hence the estimating function ∂θ Hn (θ) is asymptotically equivalent to an estimating function

a∗ (x; θ) = Bh (x; θ) Vh (x; θ)−1 , (3.22)

F (x; θ) = xe−β∆ + α(1 − e−β∆ )

We give a method to derive these expression in Section 3.6.

Here f = (f1 , . . . , fN )T maps D × Θ into IRN , and π∆

i, j = 1, . . . , N. If the functions fj are chosen to be independent of θ, then

and for d = 1 and N = 1

We will refer to estimating functions obtained by approximating the optimal weight-

This expression is obtained from (3.40) after multiplication by an invertible non-random

essentially the sample mean when n is large, and

3.4 Small ∆-optimality

dXt = b(Xt ; α)dt + σ(Xt ; β)dWt , (3.41)

(Aθ0 ϕ(s, x, y))2Qsθ0 (dx, dy) < ∞

πtθ0 (ϕ)(t, x) = ϕ(0, x, x) + tAθ0 ϕ(0, x, x) + O(t2 ), (3.45)

then (3.42) holds with

These is equality in (3.49) if

∂y g1 (0, x, x; θ0 ) = ∂α b(x; α0 )/σ 2 (x; β0 ), (3.50)

a1 (x, ∆; θ)[y − F (∆, x; θ)]

for constants aθj , implies that aθ0 = aθ1 = · · · = aθN = 0.

∂x f1 (x; θ) ∂x2 f1 (x; θ)

∂α b(x; α)/v(x; β) c(x; θ)

a(x; θ) =   D(x)−1 (3.56)

for any function c(x; θ).

∂y g1 (0, x, x; θ0 ) = ∂α b(x; α0 )T v(x; β0 )−1

has full rank d(d + 3)/2.

3.5 Simulated martingale estimating functions

where f is a p-dimensional function, and

where Y (j) (δ, θ, x), j = 1, . . . , N are independent copies of Y (δ, θ, x).

3.6 Explicit martingale estimating functions

Aθ fj (x; θ) = −λj (θ)fj (x; θ),