Learning Hessian Matrix PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 100

Derivatives, Higher Derivatives, Hessian Matrix And its Application in Numerical Optimization Dr M Zulfikar Ali Professor Department of Mathematics

University of Rajshahi

1. First-, Second- and Higher-Order Derivatives


11. Differentiable Real-Valued Function Let f be a real valued function defined on an open set in R , and let x be in . f is said to be differentiable at x if for all x R such that x + x we have that
n

f ( x + x) = f ( x ) + t ( x ) x + ( x , x) x

where t ( x ) is an n-dimensional bounded vector, and is a real-valued function of x such that lim ( x , x) = 0 f is said to be differentiable on if it is differentiable at each x in . [Obviously, if f is differentiable on the open set , it is also differentiable on any subset (open or not) of . Hence when we say that f is differentiable on some set (open or not), we shall mean that f is differentiable on some open set containing .] Theorem 1 Let f ( x) be a real-valued function defined on an open set in R , and let x in . (i) If f ( x) is differentiable at x , then f ( x) is continuous at x , and f ( x ) exists (but not conversely), and f ( x + x) = f ( x ) + ( x ) x + ( x , x) x , lim ( x , x) = 0 for x + x (ii) If f ( x) has continuous partial derivatives at x with respect to x , x ,K, x , that is f ( x ) exists and f is continuous at x , then f is differentiable at x .
x 0

x 0

Theorem 2 Let f ( x) be a real-valued function defined on an open set in R , and let x in . (i) If f ( x) is differentiable at x , then f ( x) is continuous at x , and f ( x ) exists(but not conversely), and f ( x + x) = f ( x ) + ( x ) x + ( x , x) x , lim ( x , x) = 0 for x + x (ii) If f ( x) has continuous partial derivatives at x with respect to x , x ,K, x , that is f ( x ) exists and f is continuous at x , then f is differentiable at x . 1.2. Twice Differentiable Real-Valued Function Let f be a real-valued function defined on an open set in R , and let x be in . f is said to be twice differentiable at x if for all x R such that x + x we have that
n x 0

x f ( x ) x f ( x + x ) = f ( x ) + f ( x ) x + + ( x , x)( x ) 2
2

where f ( x ) is an n n matrix of bounded elements, and is a real-valued function of x such that lim ( x , x) = 0 The n n matrix f ( x ) is called the Hessian (matrix) of f at x and its ij-th element is written as
2
x 0

[ f ( x )] = f ( x ) , i, j = 1, 2, L, n x x Obviously, if f is twice differentiable at x , it must also be differentiable at x .


2 2 ij i j

1.3. First Partial Derivative Suppose f ( x, y ) is a real-valued function of two independent variables x and y. Then the partial derivative of f ( x, y ) with respect to x is defined as f f ( x + x, y ) f ( x, y ) = lim x x
x 0 y

(1)

Similarly, the partial derivative of f ( x, y ) with respect to y is defined as f f ( x, y + y ) f ( x, y ) = lim . y y


x 0 x
2 2

(2)

Example1 If f ( x, y ) = x 2 y . Then f [( x + x) 2 y ] ( x 2 y ) f = = lim x x = 2x Similarly


2 2 2 2 x x 0 y

f [ x 2( y + y ) ] ( x 2 y ) f = = lim y y
2 2 2 2 y x 0 x

= 4 y . 1.4. Gradient of Real-Valued Functions Let f be a real-valued function defined on an open set in R , and let x be in . The n-dimensional vector of the
n

partial derivatives of f with respect to x , x ,K, x at x is called the gradient of f at x and is denoted by f ( x ) , that is,
1 2 n

f ( x ) = (f ( x ) x ,K, f ( x ) x )
1 n

Example2 Let f ( x, y ) = ( x y ) + y .Then the gradient of f,


2 2

f ( x ) = (f ( x ) x , f ( x ) x ) = (2 x 2 y, 2 x + 4 y )
1 2

1.5. Function of a Function It is well-known propert of functions of one independent variable that if f is a function of a variable u, and u is a function of a variable x, then df df du = . (3) dx du dx This result may be immediately extended to the case when f is a function of two or more independent variables. Suppose f = f (u ) and u = u ( x, y ) . Then, by the definition of a partial derivative, f df u (4) f = = , x du x f df u f = = . (5) y du y
x y y y x x

Example 3 If

f ( x, y ) = tan

y x

Then putting u = y x we have y d u f f = = (tan u ) = x +y x x du f u d x f = = (tan u ) = . y du y x + y


1

1.6. Higher Partial Derivatives Provided the first partial derivatives of function are differentiable we may differentiate them partially to obtain the second partial derivatives. The four second partial derivatives of f ( x, y ) are therefore f f f = f = , = x x x x f f f = f = = y y y y f f f = , f = = x y xy x
2

xx

(6) (7) (8)

yy

xy

and

f f = (9) f = f = . yx y y x Higher partial derivatives than the second may be obtained in a similar way.
2

yx

Example 4 If f ( x, y ) = tan

y . Then x

y x f f . , = = + + x x y x x y Hence differentiating these first derivatives partially, we obtain


2 2 2 2

y 2 xy f = ( f = )= x x x + y (x + y )
2 xx 2 2 2 2 2

and

f x 2 xy = ( f = )= y y x + y (x + y )
2 yy 2 2 2 2 2

also

f x y x = ( f = )= xy x x + y (x + y )
2 2 2 xy 2 2 2 2 2 2

and

f y y x = ( . f = )= yx y x + y (x + y )
2 yx 2 2 2 2 2

We see that

f f = xy yx
2 2

(10)
and are x y

which shows that the operators commutative. We also see that

f f + = 0. x y
2 2 2 2

(11)

The above equation is called Laplace equation in two variables. In general, any function satisfying this equation is called a harmonic function. 1.7. Total Derivatives Suppose f ( x, y ) is a continuous function defined in a f f region R of the xy-plane, and that both and are x y continuous in this region. We now consider the change in the value of the function brought about by allowing small changes in x and y.
y x

If f is the change in f ( x, y ) due to changes x and y in x and y then f = f ( x + x, y + y ) f ( x, y ) (12) = f ( x + x, y + y ) f ( x, y + y ) (13) + f ( x, y + y ) f ( x, y ) Now, by definition (1) and (2) f ( x + x, y + y ) f ( x, y + y ) f ( x, y + y ) = lim (14) x x f ( x, y + y ) f ( x, y ) f ( x, y ) = lim and (15) y y Consequently ,
x 0 y 0

f ( x + x, y + y ) f ( x, y + y ) = f ( x, y + y ) + x x

(16) and f ( x, y + y ) f ( x, y ) = f ( x, y ) + y y where and satisfy the conditions


lim = 0 and lim = 0.
x 0 y 0

(17)

(18)

Using (16) and (17) in (13) we now find ( , + ) + + ( , ) + f = f x y y x f x y y . (19) x y Furthermore, since all first derivatives are continuous by assumption, the first term of (19) may be written as f ( x, y ) f ( x, y + y ) = + x x where satisfies the condition (20)

lim = 0 .
y 0

(21)

Hence, using (20) and (19) now becomes f ( x, y ) f ( x, y ) f = x + y + ( + )x + y . (22) x y The expression f ( x, y ) f ( x, y ) f x + y (23) x y obtained by neglecting the small terms ( + )x and y in (22) represents, to the first order in x and y . The change

f in f(x, y) due to changes x and y in x and y


respectively is called the total differential of f. In case of a function of n independent variables f ( x , x ,L, x ) we have f f f f f x + x + L + x = x . x x x x (24)
1 2 n n 1 2 n r =1 r 1 2 n r

Example5. To find the change in


f ( x, y ) = xe
xy

when the values of x and y are slightly changed from 1 and 0 to 1+ x and y respectively. We first use (23) to obtain

f ( xye + e )x + x e y
xy xy

xy

Hence putting x=1, y=0 in the above expression we have

f x + y
For example, if x = 0.10 and y = 0.05 , then f 0.15 We now return to the exact expression for f given in (22). Suppose u=f(x, y) and that both x and y are differentiable functions of a variable t so that x=x(t), y=y(t) (25)

and u=u(t). (26) Hence dividing (22) by t and proceeding to the limit t 0 (which implies x 0 , y 0 and consequently , , 0 ) we have
du f dx f dy = + . (27) dt x dt y dt This expression is called the total derivative of u (t ) with respect to t. It is easily seen that if (28) u = f ( x , x ,L , x )
1 2
n

where x , x ,L, x are all differentiable functions of a variable t, then u=u(t) and
1 2
n

du f dx f dx f dx f dx +L+ = + dt x dt x dt x dt x dt (29)
n

r =1

Example6. Suppose And Then

u = f ( x, y ) = x + y
2

x = sinh t , y = t .
2

f = 2 x, x

f = 2y y

dx = cosh t , dt

dy = 2t dt

Hence
du f dx f dy = + = 2 x cosh t + 4 yt = 2 sinh t cosh t + 4t dt x dt y dt .
3

1.8. Implicit Differentiation In special case of the total derivative (27) aries when y is itself a function of x (i.e. t=x). Consequently u is a function of x only and du f f dy = + (30) dx x y dx Example7. Suppose x u = f ( x, y ) = tan y and y = sin x Then by (30), we have du y x cos x = dx x + y x + y sin x x cos x = x + sin x
1
2 2 2 2 2

When y is defined as a function of x by the equation


u = f ( x, y ) = 0

(31)

Y is called an implicit function of x. Since u is identically zero its total derivative must vanish, and consequently from (30)
dy f = dx x f y

(32)
x

Example8. Suppose

f ( x, y ) = ax + 2hxy + by + 2 gx + 2 fy + c = 0
2 2

(where a, h, b, g, f and c are constants) Then dy 2ax + 2hy + 2 g = . dx 2by + 2hx + 2 f 1.9. Higher Total Derivatives We have already seen that if u = f ( x, y ) and x and y are differentiable functions of t then du f dx f dy = + . (33) dt x dt y dt d u d To find we note (33) that the operator can be dt dt written as d dx dy + . (34) dt dt x dt y Hence d u d du dx dy f dx f dy = = + + dt dt dt dt x dt y x dt y dt
2 2 2 2

f dx dy f dy dx + +2 dt x y dt dt y dt f d x f d y + + (35) x dt y dt where we have assumed that f = f .Higher total derivatives may be obtained in a similar way. f = x
2 2 2 2 2 2 2 2 2 2 2 xy yx

A special case of (35) is when dx dy = h, = k, dt dt where h and k are constants. We then have f f f d u =h + 2hk +k , x xy dt y which, if we define the differential operator D by
2 2 2 2 2 2 2 2 2 *

+k , x y may be written symbolically as d u = h + k f = D f . y dt x Similarly we find f f du f f +k =h + 3h k + 3hk dt xy y x x y = h + k f = D f , y x assuming the commutative properties of partial differentiation. In general,
*

D=h
2

d u = h + k f = D f , y dt x where the operator h + k is to be expanded by y x means of the binomial theorem.


n n * n n
n

1.10 . Taylors Theorem for functions of Two Independent Variables Theorem3 (Taylors Theorem) If f ( x, y ) is defined in a region R of xy-plane all its partial derivatives of orders up to and including the (n+1)th are continuous in R , then for any point (a, b) in this region 1 f (a + h, b + k ) = f ( a, b) + Df ( a, b) + D f ( a, b) + L 2! 1 + D f ( a, b) + E , n! where D is the differential operator defined by D=h +k x y and
* * 2 * n n * *

D f (a, b) means h + k f ( x, y ) y x evaluated at the point (a, b) . The Lagrange error term E is given b
r

1 E = D f (a + h, b + k ) (n + 1)! where 0 < < 1.


n +1 n

Example9. Expand the function f ( x, y ) = sin xy about the point (1, ) neglecting terms of degree three and 3 higher. Here 3 f (1, ) = , 3 2
f ( x, y ) = y cos xy,
x

f ( x, y ) = x cos xy,
y

f (1, ) = 3 6 1 f (1, ) = 3 2
x y

f ( x, y ) = y sin xy,
2 xx

f (1, ) = 3 18
xx

f ( x, y ) = xy sin xy + cos xy,


xy 2 yy

3 1 + f (1, ) = 3 6 2 3 f ( x, y ) = x sin xy, f (1, ) = . 3 2


xy yy

Hence
3 1 1 sin xy = + ( x 1) + ( y )( ) + x 1) 2 6 3 2 2!
2

3 18
2

3 1 3 +y + 2( x 1) y + 3 6 2 3 2 + terms of degree 3 and higher.


2

2. Derivatives by Linearization Technique 2.1. Linear Part of a Function. Consider the function, where f ( x, y ) = x 3 xy y ; and try to approximate f ( x, y ) near the point ( x, y ) = (2,1) by simpler function. To do this, set x = 2 + and y = 1 + . Then
2 3

f ( x, y ) = f (2 + ,1 + )

= (4 + 4 + ) 3(2 + 2 + )
2

(1 + 3 3 + 3 )
2 3

= 11 + (7 9 ) + ( 3 + 3 + ). Here 11 = f (2,1) ; 7 9 is a linear function of the variables and ; and v = 3 + 3 + is small, compared to the linear function 7 9 , if and are both small enough. This means that, if and are both small enough, then f ( x, y ) 11 + (7 9 )
2 2 3 2 2 3

is a good approximation, the terms omitted being small in comparision to the terms retained. To make the idea of small enough precise, denote by d the distance from (2,1) to (2 + ,1 + ) ; then

d = [ + ] . Then d and d , so that, as d 0 , d d d = d 0 ; 3 d 3d d = 3d 0 ; and similarly for the remaining terms in v . This motivates the following definition.
2 2

1 2

Definition 1. The function f of two variables x, y is differentiable at the point (a, b) if . f ( x, y ) = f ( a , b ) + p ( x a ) + q ( y b ) + ( x, y ) , where p and q are constants, and

d 0 as d = [( x a ) + ( y b) ] 0 .
2 2

1 2

The linear function p ( x a ) + q ( y b) will be called the linear part of f at (a, b) . (some books call it the differential.) The numbers p and q can be directly calculated. If y is fixed at the value b , then d = x a , and f ( x, b ) f ( a , b ) ( x, b ) = p+0+ p+0 xa xa as d = x a 0 , thus as x a . Thus p equals the partial derivative of f with respect to x at a , holding y fixed at b . This will be written

p=
f or D f
x x

f f ( a, b) = f (a, b) = D f ( a, b) , or as or x x
x x

for short. Similarly q equals

f ( a, b) = f ( a, b) , the partial y derivative of f with respect to y at b , holding x fixed at a . Thus, in the above example, f = 2 x 3 y , and so, at ( x, y ) = (2,1) , f = 2(2) 3(1) = 7 = p . Similarly, f = 3 x 3 y = 9 = q (at ( 2,1)) .
y x x

Example 1. Calculate the linear part of f at ( ,0) where 4 f ( x, y ) = cos(2 x + 3 y ) . f ( x, y ) = cos(2 x + 3 y ) Solution. f ( x, y ) = 2 sin( 2 x + 3 y ) = 2 at ( ,0) 4
x

f ( x, y ) = 3 sin( 2 x + 3 y ) = 3 at ( ,0) 4
y

Hence the linear part of f at ( ,0) is 4


2( x ) 3( y 0) 4 x = (2 3) 4 y 0

Example 2. Calculate the linear part of g at (0, ) , where 2 g ( x, y ) = sin( x + y ).


2 2

Solution. g (0,
x

) = 0, g (0,
y

) = cos(

Hence the linear part of g at (0,

) is

0.( x 0) + cos( = cos( 2.2. Vector Viewpoint

)( y ) 4 2

)( y ) 4 2

It will be convenient to regard f as a function of a vector variable w whose components are x and y . Denote x w = , as a column vector, in matrix language; denote y a also c = . Define also row vector b f (c) = ( p, q ) = ( f (a, b), f (a, b) ).
x y

( The notation f (c) suggests a derivative; we have here a sort of vector derivative.). Then, since f is differentiable,

f ( w) = f (c) + f (c)( w c) + ( w),

where the product x a f (c)( w c) = ( p, q ) = p ( x a ) + q ( y b) y b is the usual matrix product. Now let w c denote the length of the vector w c ; then the previous d = w c , and so ( w) w c 0 Example 3. For f ( x, y ) = x 3 xy y , a = 2, b = 1, set 2 c = ; then 1 f (c) = (7,9) .
2 3

2.3. Directional Derivative. Suppose that the point (x, y) moves along a straight line through the point (a, b); thus x = l , y = m , where is the distance from (a, b) to (x, y) and l and m are constants specifying the direction of the line (with l + m = 1). In x a l vector language, let w = , c = , t = ; then y b m = w c ; and the line has equation w = c + t . The rate of increase of f ( w) , measured at w = c , as w moves along the line w = c + t , is called the directional derivative of f at c in the direction t .This may be calculated, assuming f differentiable, as follows: f (c + t ) f (c) f (c)( t ) = = f (c)t + .
2 2

The required directional derivative is the limit of this ratio l as 0 , namely f (c)t = ( p q ) = pl + qm (since m 0 as 0 ). Note that t is here a unit vector. Example 4(a) Let f ( x, y ) = x 3 xy y . The directional 2 cos derivative of at in the direction is 1 sin cos cos f (2, 1) = (7 9 ) = 7 cos 9 sin . sin sin
2 3

(b) Find the directional derivative of f ( x, y, z ) = 2 x + xy + yz at (1,-1, 2) in the direction of the vector A=(1,-2,2).
2 2

Solution. f ( x, y, z ) = (4 x + y, x + z ,2 yz ) , f (1, 1, 2) = (3, 5, 4).


2

1 2 2 Unit vector in the direction of A= (1,-2, 2) is ( , , ). 3 3 3 Therefore the directional derivative at (1,-1, 2) in the direction of A is 13 (3 5 4) 2 3 = 5 . 23

2.4. Vector Functions Let and each be differentiable real functions of the two real variables x and y . The pair of equations u = ( x, y )

v = ( x, y ) defines a mapping from the point ( x, y ) to the point (u , v) . If, instead of considering points, we consider a vector w , with components x, y , and a vector s , with components u, v, then the two equations define a mapping from the vector w to the vector s . This mapping is then specified by the vector function f = .
Definition. The vector function f is differentiable at c if there is a matrix f (c) such that

f ( w) = f (c) + f (c)( w c) + ( w)
holds, with

(1)

0 as w c 0 . wc The term f (c)( w c) is called the linear part of f at c . Example 5. Let ( x, y ) = x + y and ( x, y ) = 2 xy .these functions are differentiable at (1, 2) , and calculation shows that
2 2

( w)

u = 5 + 2( x 1) + 4( y 2) + (( x 1) + 4( y 2) ); v = 4 + 4( x 1) + 2( y 2) + ( 2( x 1)( y 2)). This pair of equations combines into a single matrix equation:
2 2

u 5 2 4 x 1 ( x 1) + 4( y 2) = + . + v y x y 4 4 2 2 2 ( 1 )( 2 ) In vector notation, this may be written as f ( w) = f (c) + f (c)( w c) + ( w) , where now f (c) is a 2 2 matrix. Since the components , of satisfy ( w) w c 0 and ( w) w c 0 as w c 0 , it follows that ( w) w c 0 as w c 0 .
2 2 1 2 1 2

Definition 2. The vector function f is differentiable at c if there is a matrix f (c) such that Equation (1) holds, with

( w)
wc

0 as w c 0 .

(2) The term f (c)( w c) is called the linear part of f at c. Example 6. For the function x + y f = = , 2 xy
2 2

1 the linear part at is 2 2 4 x 1 . 4 2 y 2


x x y 1 Example7. Let f = . Calculate f . y xy 2
2 2 2

x 2 x 2 y 1 2 4 Here f = ; f = . 2 xy y y 2 4 4
2

2.5. Functions of Functions Let the differentiable function f map the vector w to the vector s; let the differentiable function g map the vector s to the vector t. Diagrammatically,

f g
w s t

Then the composition h = g o f of the functions g and f maps w to t. Since f and g are differentiable, f ( w) f (c) = A( w c) + ; g ( f ( w)) g ( f (c)) = B( f ( w) f (c)) + ; where A and B are suitable matrices, and and can be neglected, then approximately

g ( f ( w)) g ( f (c)) BA( w c) .


The linear part of h, is in fact, BA( w c) . From the chain rule we have h(c) = g ( f (c)) f (c) .
u ( x, y ) x + y Example8. Let f = = , = v ( x , y ) 2 xy 1 g (u , v) = 3u v , c = . 2 1 1 + 2 5 f (c) = f = = 2 2 1 2 4 2x 2 y 2 4 f ( x, y ) = , , f (c) = 2 y 2x 4 2 g (u , v) = (3, 2v) = (3, 2v) , g (c) = (3, 8) The chain rules then gives 1 2 4 h = (3 8) = ( 26 4). 2 4 2 Example9 . z = 2 x y , x = cos t , y = sin t . Here x cos t f (t ) = , g = 2 x y . t sin y Taking partial derivatives, x sin t f (t ) = ; g = ( 2 x 2 y ) cos t y Hence
2 2 2 2 2 2 2 2 2

sin t dz = ( 2 x 2 y ) cos t dt sin t = ( 2 cos t 2 sin t ) =0 t cos as it should.

3. Gateaux and Frechet Derivatives Nonlinear Operators can be investigated by establishing a connection between them and linear operators-more precisely, by the technique of local approximation to the nonlinear operator by a linear one. The differential calculus for nonlinear operators is needed for this purpose.Differentiation is a technique that helps us to approximate a nonlinear operator locally. In Banach spaces there are different types of derivatives. Among them Gateaux and Frechet Derivatives are very important for applications.One of them is more general than the other but in some special circustances they are equivalent. 3.1. Gateaux Derivative The Gateaux derivative is the generalization of directional derivative. Definition.Let X and Y are Banach spaces and let P be an operator such that P : X Y . Then P is said to be Gateaux differentiable at x X if there exists a continuous linear operator U : X Y (in general depends on x ) such that
0 0

lim
t 0

P ( x + tx) P ( x ) = U ( x) t
0 0

for every x X . The above is clearly equivalent to 1 lim [ P( x + tx) P( x ) tU ( x)] = 0 t


t 0
0 0

(1)

for every x X . In the above situation, U is called the Gareaux derivative of P at x , writtenU = P ( x ) , and its value at x X is denoted by P ( x )( x) or simply P ( x ) x .
0 0 0 0

Theorem1. The Gateaux derivative, if it exists, is unique. Proof. Let U and V be two Gateaux derivatives of P at x . From the relation, for t 0 1 V ( x) U ( x) = [ P( x + tx ) P( x ) tU ( x)] t 1 [ P ( x + tx) P( x ) tV ( x)], t we obtain for t > 0 1 V ( x) U ( x) P ( x + tx ) P ( x ) tU ( x) t 1 + P ( x + tx ) P ( x ) tV ( x) t
0 0 0 0 0 0 0 0 0

and as t 0 , because of (1), both the expressions in the right hand side tend to zero. Since this is true for each x X , we see that U = V and rhe theorem is proved. We now suppose that X and Y are finite dimensional, say X = R and Y = R . Let us analyse the representation of the Gareaux derivative of an operator from X into Y. We know that if U is a linear operator from X into Y then U is given by the matrix
n m

a a L a
11 21 m1 1 2

L a a L a L L L a L a a
12 1n 22 2n m2 mn n 1 2 m

where y = U ( x), x = ( , ,L, ) X , y = ( , ,L, ) Y and = a , i = 1, 2 , L, m . (1)


n i k =1 ik k

Let P be an operator mapping an open subset G of X into Y and x = ( , ,L, ) G y = ( , ,L, ) Y and y = P ( x) . Then we see that there exist numerical functions , ,L, such that (2) = ( , ,L, ), i = 1, 2 , L, m ; Suppose that the Gateaux derivative of P exists at x = ( , ,L, ) and P ( x ) = U . Let U be given by the above matrix, If the equation
1 2 n 1 2 m 1 2 m i i 1 2 n (0) (0) (0) 0 1 2 n 0

P( x + tx) P( x ) = U ( x) t is written in full then with the help of the relation (1) and (2) we obtain m relations
lim
t 0 0 0

lim
t 0

( + t , + t ,L, + t ) ( , ,L, )
(0) (0) (0) (0) (0) (0) i 1 1 2 2 n n i 1 2 n

t
= a , i = 1, 2 , L, m
n k =1 ik k 1 2 n

(3)

The relation (3) holds for all x = ( , ,L, ) X and therefore taking in turn an x whose all coordinates are zero except one which is equal to unity, we see that the functions , ,L, have partial derivatives with respect to , ,L, and ( , ,L, ) =a where i = 1, 2 , L, m and k = 1, 2 , L, n .
1 2 m 1 2 n (0) (0) (0) i 1 2 n ik k

The derivative P ( x ) = U is therefore given by the matrix of partial derivatives of the functions , ,L,
0 1 2 m

L L P ( x ) = L L L L L which is known as the Jacobian matrix and is denoted by J (x ) .


1 1 1 2 1 n 2 1 2 2 2 n 0 m 1 m 2 m n 0

Example1. In this example, we show that the existence of the partial derivatives of , ,L, need not guarantee the existence of P ( x ) . Let m=1 and n=2 and
1 2 m 0

( , ) =
1 1 2

( + )
1 2 2 2 1 2 1

, (0, 0) = 0 , x = (0, 0) .
1 0

(0,0) (0 + h,0) (0,0) (h,0) (0,0) = lim = lim h h h.0 = lim =0 (h ) h


1 1 1 1 h0 h 0 1 h 0 2 2

(0,0) = 0. Therefore, if the derivative P ( x ) were to exist then it must be the zero operator and then (3) should give (t , t ) lim = 0. t

and similarly,

t 0

P ( x ) = lim
0 t 0

(0 + th) (0,0)
1 1 1

But

t th ( th ) (0,0) = lim t th .th 1 = lim t {(th ) + (th ) }


1 1 2 t 0 1 2 t 0 2 2 2 1 2 2 1 2 t 0 5 2 2 2

t hh = lim t {( h ) + ( h ) } hh which does not = lim t {(h ) + (h ) }


1 2 1 2 t 0 3 2 2 2 1 2

exist. (b) Let m=n=2 and P ( x , x ) = ( x , x ) . Let z = ( z , z ) be any point then we see that 0 3z P ( z ) = 0 2z .
3 2 1 2 1 2 1 2 2 1 2

3.2. Frechet Derivative The derivative of a real function of a real variable is defined by f ( x + h) f ( x ) f ( x) = lim (1) h This definition cannot be used in the case of mapping defined in Banach space because h
h0

is then a vector and division by a vector is meaningless. On the other hand, the division by a vector can be easily avoided by rewriting (1) in the form

f ( x + h) = f ( x) + f ( x)h + (h)

(2)

where is a function of h such that (h) 0 as. h 0 Equivalently, we can now say that f ( x) is the derivative of f at x if

f ( x + h) f ( x) = f ( x)h + (h)
where (h) h 0 as h 0 .

(3)

The definition based on (3) can be generalized to include mappings from a Banach space into a Banach space. This leads to the concept of the Frechet differentiability and Frechet derivative. Definition. Let f map a ball X = {w R : w c < }(with center c) into R
n
0

3.3. Higher Order Derivatives Let U = {x R : x < }.If f C (U , R) and a U , then , for each u R with a + u U , f (a + u ) f (a ) has linear part f (a )u = D f (a )u
n
1

i =1

Here D f ( a ) denotes
i n

f ( a ) . If f C (U , R) then, for each x v R with a + v U , the linear part of


2 i n i i i

[ f (a + v) f (a )]u = [ D f ( a + v ) D f (a )]u
i =1

is

f ( a )(u , v) = D f (a )u v
n i , j =1 ij i

where D f (a ) denotes
ij

f (a) . x x This process may be continued. If f C (U , R ) , denote D f (a) = LL f (a) . x x x Define then, for w , w ,LL, w R ,
j i k i1 i 2 LLLi k ik i k 1 i1 n 1 2 k

(k )

(a)( w , w ,LL, w ) = D
1 2 k i1 , i 2 ,L, i k =1 1 , i1

i1 i 2 LLLi k

f (a ) w w L w
1 , i1 2 ,i2

k ,ik

where w denotes the i component of w . If all w = w , we abbreviate f (a )( w , w ,LL, w ) to f ( a )( w) .


1 1
i

(k )

(k )

The derivative f (a ) is representd by a 1 n matrix, whose components are D f ( a ) . Also f (a) is represented by an n n matrix, M say, whose i, j element is D f (a ) ; if u and v are regarded as colums, then
i i. j

f (a)(u , v) = v Mu .
T

It is shown below that M is a symmetric matrix.

Example Define f : R R by the polynomial


n

f ( x, y ) = 3 + 7 x + 4 y + x + 3xy + 2 y + 6 x y .
2 2 2

Then
f (0 + v) f (0) = [2v + 3v + 12v v , 3v + 4v + 6v ] Taking the linear part of this expression, and applying it to u u = u , u f (0)(u , v) = (2v + 3v , 3v + 4v ) u 2 3 u ( ) = v v 3 4 u where the 2 2 matrix consists of second partial derivatives.
2 1 2 1 2 1 2 1 1 2

In more abstract terms, let L( R , R ) denote the vector space of continuous linear maps from R into R . Then, in terms of continuous linear maps, f (a ) L( R , R ) , f (a) L( L( R , R), R) and so on. Also f (a) is a bilinear map from R R into R ; this means that f (a)(u , v) is linear in u for each fixed v, and also linear in v for each fixed u.
n n n n n

Theorem 2 (Taylors) Let f C (U , R ); let a U and let a + x U . Then


k

1 1 f (a ) x + f (a )( x) + L 2! 1! 1 1 + f (a )( x) + f ( a )( x) , (k 1)! k! where c = a + x for some in 0 < < 1. f (a + x) = f (a) +


2 ( k 1 )
k 1

(k )

Example 2 For the previous polynomial example, 0 x 1 0 x 0 f + f + f 0 y 2 0 y 0 x 1 2 3 x = 3 + (7 4) + ( x y ) , y 2 3 4 y which agrees with the given polynomial function up to quadratic term.
2

Theorem3 Let f C (U , R), where U = {x R : x < }; let a U .Then f (a ) is represented by a symmetric matrix M .
2
n

Proof In order to show that M is a symmetric matrix, it is enough to consider a function f of two real variables x and f y, and then to show that f = f ; here f = and y x f f = . Hence let f C (U , R ) where x y x x U = R : < , and take a = 0 .Define y y
xy yx xy

yx

( x, y ) =

f ( x, y ) f (0, y ) f ( x,0) + f (0,0) xy

(i) and, for fixed y, let ( x) = f ( x, y ) f ( x,0) . Then the mean-value theorem shows that, for fixed y,

( x, y ) =

f (x, y ) f ( x,0) xy xy y For some in 0 < < 1. A second application of the mean-value theorem shows that ( x, y ) = f (x, y ) for some in 0 < < 1. Since f ( , ) is assumed continuous,

( x ) ( 0)

(x) x

xy

xy

( x, y ) f (0,0) < whenever x < ( ) and y < ( ) . (ii)


xy

Let y 0 ; from equation (1), 1 f ( x, y ) f ( x,0) f (0, y ) f (0,0) lim ( x, y ) = lim x y y = x [ f ( x,0) f (0,0)] . (iii) From equations (ii) and (iii), f ( x,0) f (0,0) f (0,0) < whenever x < ( ) . x Therefore f (0,0) = f (0,0) .
y 0 y 0
1

xy

yx

xy

3.4. Example of Bilinear Operator Assume that c = (c ) R is a real m-vector, that A = (a ) is a real (m, m) matrix and that B= (b ) is a bilinear operator from R R to R . Then the mapping f : R R defined by
m i ij ijk m m m m m

f ( z ) = c + Az + Bz , z R ( where Bz = Bzz )
2
m

(b1)

is called a quadratic operator. The equation f ( z ) = 0 , that is Bz + Az + c = 0


2

(b2)
m

is called a quadratic equation in R

As a simple but very important example to the quadratic equation (b2), we consider the algebraic-eigenvalue problem (b3) Tx = x where T = (t ) is a real (n,n) matrix. We assume that the eigenvector x = ( x ) has Eulidean length one
ij i

x = x =1
2 2 2
i =1 i T

(b4)
n

If we set z = ( x , x ,L, x , ) , then (b3) and (b4) can be written as a system of nonlinear equations, namely (T I ) x =0 (b5) f ( z) = 1 (1 x ) 2
1 2

2 2

It is well known that (b5) is a quadratic equation of the form (b1) where m=n+1 and 1 c = 0 0 L (b6) , c R 2
T

0 T M A= 0 0 L 0 0

(b7)

1 0 1 O 0 0 M 1 O O O M B= . 0 M L 2 O 1 0 0 1 1 L 0 0 0 L 0 1 0 0 L 0 0
2

(q8)

For the mapping (5), we get by using (q6, q7) and (q8) that f ( z ) = c + Az + Bz , f ( z ) A + 2 Bz , f ( z ) = 2 B . Therefore f ( z ) has the matrix representation x T f ( z ) = 0 x and f ( z ) is the bilinear operator defined in (b8), multiplied by the factor two.
T

0 x x For, n = 2 , x = x , z = x , c = 0 . 1 2
1 1 2 2

+ ( ) t x t x (T I ) x = t x + (t ) x f ( z) = 1 (1 x ) 2 1 (1 x x ) 2 t x t f ( z) = t t x x 0 x T I x = x 0 0 0 1 0 0 0 1 0 0 f ( z ) = 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 and for n = 3 ,
11 1 12 2 2 2 21 1 22 2 2 2 1 2
11 12 1 21 22 2 1 2
T

0 0 f ( z ) = 0 1

0 0 0 0 0 0 0 0

1 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0

0 1 0 0

0 0 0 0 0 0

0 0

1 0 0 1

0 0 0 1 0 0 1 0

0 0 0 0 1 0 0 0 0 0 0 0

4. Hessian Matrix and Unconstraint Optimization In mathematics, the Hessian matrix(or simply the Hessian) is the square matrix of second-order partial derivatives of a function, that is , it describes the local curvature of a function of many variables.The Hessian matrix was developed in 19th century by the German mathematician

Ludwig Otto Hesse and later named after him. Hesse himself had used the term functional determinants. Given the real-valued function f ( x , x ,L, x ) If all second partial derivatives of f exist, then the Hessian matrix of f is the matrix H ( f ) ( x) = D D f ( x) where x = ( x , x ,L, x ) and D is the differentiation operator with respect to the ith argument and the Hessian becomes
1 2
n ij i j

2 ( f ) 2 x1 2 ( f ) x2 x1 H( f ) = M 2 ( f ) x x n 1

2 ( f ) x1x2 2 ( f ) 2 x2 M 2 ( f ) xn x2

L 2 ( f ) x1xn L 2 ( f ) x2 xn O M L 2 ( f ) 2 xn

H ( f )( x) is frequently shortened to simply H ( x) . Some mathematicians define the Hessian as the determinant of the above matrix. Hessian matrices are used in large-scale optimization problems within Newton-type methods because they are the coefficient of the quadratic term of a local Taylor expansion of a function.That is, 1 y = f ( x + x) f ( x) + J ( x)x + x H ( x)x 2 Where J is the Jacobian matrix, which is a vector (the gradient for a scalar valued function). The full Hessian matrix can be difficult to compute in practice, in such situations, quasi-Newton algorithms have been developed that use approximations to the Hessian. The best known quasi-Newton algorithm is the BFGS algorithm.
T

4.1. Mixed derivatives and Symmetry of the Hessian The mixed derivatives of f are the entries off the main diagonal in the Hessian. Assuming that they are continuous, the order of differentiation does not matter (Clairaut theorem). For example, f f = . x y y x This can also be written as f = f . In a formal statement: if the second derivatives of f are all continuous in a neiborhood D, then the Hessian of f is a symmetric matrix throughout D.
yx xy

Example1 Consider the real-valued function f ( x, y ) = 5 xy . Then f ( x, y ) = (5 y ,15 xy ) and the Hessian matrix, f f 15 y 0 x x y = H = f ( x, y ) = . f 15 y 30 xy f yx x Example2 Let f ( x, y ) = x 3 xy y . f f Then f ( x, y ) = = (2 x 3 y 3 x 3 y ) x y
3 3 2 2 2 2 2 2 2 2 2 2

The Hessian matrix, f f 3 2 x x y = H = f ( x, y ) = . f 3 6y f yx x Example3 Let a function, f : R R is given by x + y f ( x, y ) = . 2 xy Then 2x 2 y f ( x , y ) = and 2 2 y x 2 0 0 2 f ( x, y ) = 0 2 2 0 which is not a Hessian matrix but a bilinear operator.
2 2 2 2 2 2 2

4.2. Critical points and discriminant If the gradient of f is zero at some point x, then f has a critical point (or a stationary point) at x. The determinant of the Hessian at x is then called the discriminant. If this determinant is zero then x is called a degenerate critical point of f, this is also called a non-Morse critical point of f. Otherwise, it is non-degenerate, this is called a Morse critical point of f. For a real-valued function f ( x, y ) of two variables x and y and let and be the eigenvalues of the corresponding Hessian matrix of f, then + = trace (H) and = det(H)
1 2 1 2 1 2

Example4 For the previous example, the critical point of f is given by f ( x, y ) = (2 x 3 y 3 x 3 y ) = 0 , whence we have ( x, y ) = (0, 0) 2 3 Thus, H = . The eigenvalues of H are 3 0 ( , ) = (1 + 10 , 1 10 ) Hence + = 2 =trace(H) and = 9 = det(H)
2 1 2 1 2 1 2

4.3. Functions of one Variable. Definitions. Suppose f ( x) is a real-valued function defined on some interval (The interval I may be finite or infinite, open or closed, or half-open.). A point x in I is: a global minimizer for f ( x) on I if (a) f ( x ) f ( x) for all x in I; a stritct global minimizer for f ( x) on I if (b) f ( x ) < f ( x) for all x in I such that x x ; (c) a local minimizer for f ( x ) if there is a positive number such that f ( x ) f ( x) for all x in I for which x < x < x + ; a strict local minimizer for f ( x ) if there is a (d) positive number such that f ( x ) < f ( x) for all x in I for which x < x < x + and x x ;
*

(e) a critical point of f ( x) if f ( x ) exists and is equal to zero.


*

The Taylors formula (single variable) Theorem1. Suppose that f ( x) , f ( x) , f ( x) exist on the closed interval [a, b] = {x R : a x b}.If x , x are any two different points of [a, b], then there exists a point z strictly between x and x such that
*

f ( x) = f ( x ) + f ( x )( x x ) +
* * *

f ( z ) (x x ) . 2
* 2

If x is the critical point of f ( x) then the above formula reduces to f ( x) = f ( x ) + 0 +


*

f ( z ) (x x ) , 2
* 2 * 2

or f ( x) f ( x ) =
*

for all x x .
*

f ( z ) (x x ) 2

Theorem 2. Suppose that f ( x) , f ( x) , f ( x) are all continuous on an interval I and that x I is a critical point of f ( x) . If f ( x) 0 for all x I , then x is a global (a) minimizer of f ( x) on I .
*
*

(b) If f ( x) > 0 for all x I , such that x x , then x is a strict global minimizer of f ( x) on I . If f ( x) > 0 , then x is a strict local minimizer (c) of f ( x) .
* * *

Example5. Consider f ( x) = 3x 4 x + 1. Since f ( x) = 12 x 12 x = 12 x ( x 1), the only critical points of f ( x) are x = 0 and x = 1. Also, since f ( x) = 36 x 24 x = 12 x(3x 2) we see that f (0) = 0 and f (1) = 12 . Therefore, x = 1 is a strict local minimizer of f ( x) Definition. Suppose that f ( x) be a numerical function defined on a subset D of R . A point x in D is (i) a global minimizer for f ( x) on D if f ( x ) f ( x) for all x D ; (ii) a strictly global minimizer for f ( x) on D if f ( x ) < f ( x) for all x D such that x x ; (iii) a local minimizer for f ( x) if there is a positive number such that f ( x ) f ( x) for all x D for which x B( x , ) ; (iv) a strictly local minimizer for f ( x) if there is a positive number such that f ( x ) < f ( x) for all x D for which x B ( x , ) and x x ; (v) a critical point for f ( x) if the first partial derivatives of f ( x) exist at x and
4 3 3 2 2 2 n

f ( x ) x = 0 for i = 1, 2 ,K, n ,
i

that is f ( x ) = 0 Example 6. Consider f ( xy ) = 40 + x ( x 4) + 3( y 5) . f = (4 x 12 x , 6( y 5)) . f = 0 gives two critical points ( x, y ) = {(3,5), (0,5)}
3 2 3 2

4.4. Taylors Formula (several variables). Theorem 5. Suppose that x , x are points in R and that f ( x) is a function of n variables with continuous first and second partial derivatives on some open set containing the line segment [x , x] = {w R : w = x + t ( x x ) ; 0 t 1}joining x and x . Then there exists a z [x , x ]such that
n n

1 f ( x) = f ( x ) + f ( x ) ( x x ) + ( x x ) Hf ( z )( x x ) 2 Here H denotes the Hessian Matrix This formula is used to develop tests for maximizers and minimizers among the critial points of a function. Theorem 6. Suppose that x is a critical point of a function f ( x) with continuous first and second partial derivatives on R . Then: (a) x is a global minimizer for f ( x) if ( x x ) Hf ( z )( x x ) 0 for all x R and all z [ x , x] ;
n * n *

(b) x is a strict global minimizer for f ( x ) if ( x x ) Hf ( z )( x x ) > 0 for all x R such that x x and all z [ x , x] ; (c) x is a global maximizer for f ( x) if ( x x ) Hf ( z )( x x ) 0 for all x R and all z [ x , x] ; (d) x is a strict global maximizer for f ( x ) if ( x x ) Hf ( z )( x x ) < 0 for all x R such that x x and all z [ x , x] .
n * * * n * * n * *

4.5. Quadratic Forms: We have already observed that the Hessian Hf ( x) of a function f(x) of n variables with continuous first and second partial derivatives is an n n -symmetric matrix. Any n n -symmetric matrix A determines a function Q ( y ) on R called the quadratic form associated with A.
n A

Example 6. If A is the 3 3 -symmetric matrix


2 1 2 A = 1 3 0 2 0 5 then the quadratic form Q ( y) associated with A is 2 1 2 y Q ( y ) = y Ay = y 1 3 0 y 2 0 5 y = ( y , y , y ) ( 2 y y + 2 y , y + 3 y ,2 y + 5 y ) =2 y + 3y + 5y 2 y y + 4 y y
A

In general, Q ( y ) is a sum of terms of the form c y y where i, j = 1,K, n and c is the coefficient which may be zero, that is, every term in Q ( y ) is of second degree in the variables y , y ,K, y . On the other hand, any function q ( y ,K, y ) that is the sum of second-degree terms in y , y ,K, y can be expressed as the quadratic form associated with an n n -symmetric matrix A by splitting the coefficient of y y between the (i, j ) and ( j , i ) entries of A.
A ij i j ij A

Example 7(a). The function q( y , y , y ) = y y + 4 y 2 y y + 4 y y is the sum of second degree terms in y , y , y . Splitting the coefficients of y y , we get q( y , y , y ) = y y + 4 y y y y y + 0 y y
2 2 2 1 2 3 1 2 3 1 2 2 1 2 3
i j

+ 0 y y +2 y y + 2 y y
1 3 2 3 2 2 2 2 1 2 3 1 2

= y y + 4 y y y y y + 0 y y + 0 y y +2 y y + 2 y y 1 1 0 y = ( y , y , y ) 1 1 2 y 0 2 4 y where d d 1 1 0 d A = d d d = 1 1 2 0 2 4 d d d with
2 1 1 3 3 1 2 3 3 1 1 2 3 2 3 11 12 13 21 22 23 31 32 33

d = coefficient of y 1 d or d = (coefficient of y y or y ) , i j . 2 (b) q ( y , y , y ) = y + 3 y + y


2
ii i ij ji i j ji

= y + 3 y + y + 0. y y + 0. y y + 0. y y + 0. y y 0. y y + 0. y y
2 2 2 1 2 3 1 2 1 2 1 3 1 3 2 3 2

1 0 ( y y y ) 0 3 0 0 (c) q ( y , y , y ) = ( y 2 y ) + y = y + 4y + y 4y
1 2 3 2 2 1 2 3 1 2 1 2 3 2 2 2 3 2 2 2 1 2 3 1 2 1 2 1

0 y 0 y 1 y
y
3 2

= y + 4 y + y 2 y y 2 y y + 0. y y + 0. y y
1

+ 0. y y + 0. y y
2 3 2

= (y

1 2 0 y y ) 2 4 0 y 0 1 0 y
1 3 2 3

The Hessian H = Hf ( z ) of f ( x) evaluated at a point z is an n n -symmetric matrix. For x, x R , the quadratic form Q ( y ) associated with H evaluated at x x is Q ( x x ) = ( x x ) Hf ( z )( x x )
*
n
H

4.6. Positive and Negative Semidefinite and Definiteness Definitions. Suppose that A is an n n -symmetric matrix and that Q ( y ) = y Ay is the quadratic form associated with A. Then A and Q are called: (a) positive semidefinite if Q ( y ) = y Ay 0 for all yR ; (b) positive definite if Q ( y ) = y Ay > 0 for all y R , y 0; (c) negative semidefinite if Q ( y ) = y Ay 0 for all y R , y 0; (d) negative definite if Q ( y ) = y Ay < 0 for all y R , y 0; (e) indefinite if Q ( y ) = y Ay > 0 for some y R and Q ( y ) < 0 for other y R .
A A A n A n A

With this terminology established, we can now reformulate Theorem 6 as follows: Theorem 7. Suppose x is a critical point of a function f ( x) with continuous first and second partial derivatives on R and that Hf ( x) is the Hessian of f ( x) . Then x is: (a) x is a global minimizer for f ( x) if Hf ( x) is positive semidefinite on R ; (b) x is a strict global minimizer for f ( x ) if Hf ( x) is positive definite on R ; (c) x is a global maximizer for f ( x) if Hf ( x) is negative semidefinite on R
n

(d) x is a strict global maximizer for f ( x) if Hf ( x) is negative definite on R .


n

Here are some examples. Example 8. (a) A symmetric matrix whose entries are all positive need not be positive definite. For example, the matrix 1 4 A= 4 1 is not positive definite. For if x = (1,1) , then

1 4 1 3 Q ( x) = (1,1) = (1,1) = 6 < 0. 4 1 1 3


A

(b) A symmetric matrix with some negative entries may be positive definite. For example, the matrix 1 1 A= 1 4 corresponds to the quadratic form
Q ( x) = x Ax = x 2 x x + 4 x = ( x x ) + 3x is always positive if x = ( x , x ) (0,0) .
2
A

2 2

(c) The matrix

1 0 0 A = 0 3 0 0 0 2 is positive definite because the associated quadratic form Q ( x) is


A

Q ( x) = x Ax = x + 3 x + 2 x
2 2
A

2 3

and so Q ( x) > 0 unless x = x = x = 0 .


A

(d) A 3 3 -diagonal matrix


d A= 0 0
i

0 d
2

0 0 d
3

is (1) positive definite if d > 0 for i = 1,2,3 ; (2) positive semidefinite if d 0 for i = 1,2,3 ; (3) negative definite if d < 0 for i = 1,2,3 ; (4) negative definite if d 0 for i = 1,2,3 ; (5) indefinite if at least one d is positive and at least one d is negative for i = 1,2,3 .
i i i i i

For example, in case (2), if d > 0, d > 0, d = 0, then Q ( x) = d x + d x 0 for all x 0 since d > 0, d > 0 , but if x = (0,0,1) , then Q ( x) = 0 even though x 0 .
1 2 3 2 2
A

(e) If a 2 2 -symmetric matrix a b A= b c is positive definite , then a > 0 and c > 0 . For if x = (1, 0) , then x 0 and so 0 < Q ( x) = a.1 + 2b.0.0 + c.0 = a. Similarly, if x = (0,1) , then 0 < Q ( x) = c. However, (a)shows that there are 2 2 -symmetric matrices with a > 0, c > 0 that are not positive definite. We can see that the size of b relative to the size of the product ac is the determining factor for positive definiteness.
2 2
A A

The examples show that for general symmetric matrices there is little relationship between the signs of the matrix entries and the positive or negative definite features of the matrix. They also show that for diagonal matrices, these features are completely transparent. Here are some examples of positive definite, positive semidefinite, negative definite and negative semi-definite matrices in real field: Example 9.
2 1 0 (a) A = 1 2 1 , X = ( x 0 1 2
T

>0

X AX = ( x + x ) + ( x + x ) + x + x > 0 .
2 2 2 2 1 2 2 3 1 3

The matrix A is PD
25 15 5 (b) A = 15 18 0 , X = ( x 5 0 11
T

>0

X AX = 25 x + 18 x + 11x > 0 . The matrix A is PD


2 2 2 1 2 3

1 1 1 1 1 5 5 5 , X = (x x x x ) > 0 (c) A = 1 5 14 14 1 5 14 15 X AX = ( x + x + x + x ) + 4( x + x + x ) . + 9( x + x ) + x > 0 The matrix A is PD


T

2 1 0 0 1 2 1 0 (d) A = , X = (x 0 1 2 1 0 0 1 1
T

>0

X AX = x + ( x x ) + ( x x ) + ( x x ) > 0 . The matrix A is PD


2 2 2 2 1 1 2 2 3 3 4

(e) q ( x) = x + x , x R , for any nonzero x R and x = x = 0 , 0 (0 0 x ) R gives q ( x) = x + x = 0 1 0 0 The matrix A = 0 1 0 is PSD 0 0 0 (f) q ( x) = ( x + 3 x + x ) 0 x 1 0 = ( x x x ) 0 3 0 x . 0 0 1 x
2 2 3 1 2 3 3 1 2 3 2 2 1 2 2 2 2 1 2 3 1 1 2 3 2 3

1 The matrix A = 0 0 (g) q ( x) = (( x 2 x


1

= (x

0 3 0 is ND. 0 1 ) +x ) 0 x 1 2 x ) 2 4 0 x 0 1 0 x 0
2 2 2 3 3

0 1 2 The matrix A = 2 4 0 is NSD 0 0 1 We will develop two basic tests for positive and negative definiteness-one in terms of determinant, and another in terms of eigenvalues.

4.7. Determinant Approach We begin by looking at functions of two variables. If A is a 2 2 -symmetric matrix a a A= a a then the associated quadratic form is Q ( x) = x. Ax = a x + 2a x x + a x . For any x 0 in R , either x = ( x ,0) with x 0 x = ( x , x ) with x 0 . Let us analyze the sign of Q ( x) in terms of entries of A in each of these two cases Case 1. x = ( x ,0) with x 0 . In this case, Q ( x) = a x so Q ( x) > 0 if and only if a > 0 , while Q ( x) < 0 if and only if a < 0 .
11 12 12 22 2 2
A

11

12

22

11

Case 2. x = ( x , x ) with x2 0. In this case, x = tx for some real number t and Q ( x) = [a t + 2a t + a ]x = (t ) x ,


1 2 1 2 2 2 2

11

11

where (t ) = a t + 2a t + a .Since x2 0, we see that Q ( x) > 0 for all such x if and only if (t ) > 0 for all t R .
2 11 12 22
A

11

12

22

Note that (t ) = 2a t + 2a , (t ) = 2a , so that t = a / a is a critical point of (t ) and this critical point is a strict minimizer if a > 0 and strict maximizer if a < 0 . If a > 0 and if t R , then
11 12 11 * 12 11 11 11 11

. Thus, if a > 0 and 1 1 a a = (a a a ) = a a a a a a det > 0 , then (t ) > 0 for all t R and so a a Q ( x) > 0 for all x = ( x , x ) with x2 0.On the other hand, if Q ( x) > 0 for all such x, then (t ) > 0 for all t R and so a > 0 and the discriminant of (t ) a a det 4a 4a a = 4 a a is negative , a a that is a > 0 and det > 0 . An entirely similar a a
11 11 11 2 11 12 11 22 12 11 11 12 22 11 12 12 22
A

a a (t ) (t ) = ( ) = + a a a
* 12 12

22

11

11

12

12

11

22

12

22

11

12

11

analysis shows that Q ( x) < 0 for all x = ( x , x ) with x2 0 a a if and only if a < 0 and det > 0 . This proves the a a following results:
A

12

22

11

12

11

12

22

Theorem.8. A 2 2 -symmetric matrix a a A= is


11 12

a12

22

a (a) positive definite if and if a > 0 , det a


11

11

12

a a

12

22

> 0;

(b) negative definite if and only if a < 0 , a a det > 0 . The 2 2 case and a little imagination a a suggest the correct formulation of the general case. Suppose A is an n n -symmetric matrix. Define to be the determinant of the upper left-hand corner k k -submatrix of A for 1 k n .The determinant is called the kth principal minor of A
11 11 12 12 22
k k

a a A = a M a

11

a a

12

a a

13

12

22

23

13

a M a

23

a M a

33

1n

2n

3n

L a L a L a M M L a
1n 2n 3n nn

a = a , = det a
1 11 2

11

12

a a

12

22

, L , = det A .
n

The general theorem can be formulated as follows: Theorem.9. If A is an n n -symmetric matrix and if is the kth principal minor of A for 1 k n , then:
k

(a) A is positive definite if and only if > 0 for k=1, 2 , ... , n;


k

(b) A is negative definite if and only if (1) > 0 for k=1, 2 , ... , n(that is, the principal minors alternate in sign with < 0 ).
k k 1

Mathematical induction can be used to establish this result. However, the formal inductive proof is somewhat complicated by the notation required for the step from n = k to n = k + 1.It is quite illuminating to show how this inductive step works from n = 2 (Theorem 8)to n = 3 , because this step lays bare the essential features of the general inductive step. Consequently, we include the proof of this special case at this point. Proof. For n=3. Suppose that
a A = a a
11

a a a

12

12

22

13

23

a a a
13 23 33 1 2 3

is a 3 3 -symmetric matrix and that x = ( x x x ) is a nonzero vector in R . Then one of the following two cases must hold: Either x = 0 or else x 0 and consequently x = tx , x = sx for some real numbers s , t.
3 3 3 2 3 1 3

Case 1 If x = 0 , then a brief computation shows that a a x x. Ax = ( x x ). a a x and ( x x ) (0 0) , so Theorem 8 shows that:
3 11 12 1 1 2 12 22 2 1 2

(a) x. Ax > 0 for all x 0 such that x = 0 if and only if > 0, > 0; (b) x. Ax < 0 for all x 0 such that x = 0 if and only if < 0, > 0.
3 3 1 2 3 3 1 2

Case 2 If x 0 and x = tx , x = sx for real numbers s , t , then


3 2 3 1 3

x. Ax = x (a s + a t + a + 2a st + 2a s + 2a t ). Consequently, since x 0 , it follows that x. Ax > 0 for all x 0 such that x 0 if and only if
2 2 2 3 11 22 33 12 13 23 3 3

( s, t ) = a s + a t + a + 2a st + 2a s + 2a t > 0
2 2 11 22 33 12 13 23

for real numbers s , t . In addition, x. Ax < 0 for all x 0 such that x 0 if and only if ( s, t ) < 0 for real numbers s , t .
3

The critical points of ( s, t ) are the solutions of the system

that is,

= 2a s + 2a t + 2a s 0= = 2a s + 2a t + 2a t a s + a t = a
0=
11 12 12 22 11 12 13

13

23

a s + a t = a
12 22

23

This system has unique solution ( s , t ) if and only if

a a = det a a 0, and this unique solution is given by Cramers Rule as a a a a 1 1 s = det a a , t = det a a . (1) if we multiply the equation a s +a t +a =0 by s , and multiply the equation a s +a t +a =0 by t and add the results, we obtain a ( s ) + a (t ) + 2a s t + a s + a t = 0 . Cosequently, (s , t ) = a s + a t + a , and so (1) implies that if a a a det A 1 a = . = 0, then ( s , t ) = det a a a a a (2) Since 2a 2a H ( s, t ) = det 2a 2a = 4 , it follows from Theorem 8 and Theorem 7 that ( s , t ) is a strict global minimizer for ( s, t ) if and only if > 0, > 0 . Similarly, ( s , t ) is a strict global maximizer for ( s, t ) if and only if < 0, > 0 . If > 0, > 0, > 0 , then the conclusion (a) of Case 1 shows that if x 0 and x = 0 , then x. Ax > 0 ; on the other
11 12 2 12 22 * 13 12

11

13

23

22

12

23

11

12

13

12

22

23

11

22

12

13

23

13

23

33

11

12

13

12

22

23

13

23

33

11

12

12

22

hand , the considerations in Case 2 show that if x 0, x 0, x = tx , x = sx , then x. Ax = x ( s, t ) x ( s , t ) = x > 0. Therefore x. AX > 0 for all x 0 if > 0, > 0, > 0 . On the other hand , if x. Ax > 0 for all x 0 , then the conclusion (a) of Case 1 shows that > 0, > 0 . Also, if x = ( s , t ,1) , then (2) yields = ( s , t ) = x Ax > 0, so > 0 . This proves part (a) of the theorem for n = 3
3 2 3 1 3 2 2 * * 2 3 3 3 3 2 1 2 3 1 2 * * * 3 * * * * 3 1

Example.9 (a) Minimize the function f (x , x , x ) = x + x + x x x + x x x x . The critical point of f ( x , x , x ) are the solutions of the system 2 x x x = 0,
2 2 2 1 2 3 1 2 3 1 2 2 3 1 3 1 2 3 1 2 3

x + 2 x + x = 0,
1 2 3

or

x + x + 2 x = 0. 2 1 1 x 0 1 2 1 x = 0 . 2 1 1 x 0
1 2 3 1 2 3

2 1 1 Sincedet 1 2 1 0 , x = 0, x = 0, x = 0 is the 2 1 1 one and only solution. The Hessian of f ( x , x , x ) is the constant matrix
1 2 3 1 2 3

2 1 1 Hf ( x , x , x ) = 1 2 1 1 1 2 Note that = 2, = 3, = 4, so Hf ( x , x , x ) is positive definite everywhere on R . It follows from Theorem 7 that the critical point (0, 0, 0) is a strict global minimizer for f ( x , x , x ) and this is the only one critical point of f ( x , x , x ) .
1 2 3 1 2 3 1 2 3 3 1 2 3 1 2 3

(b) Find the global minimizer of f ( x, y , z ) = e

x y

+e
x y

yx

+e + z .
x2 2
yx x2

e e 2 xe To this end, compute f ( x, y, z ) = e + e 2z and


x y yx

0 e e e + e + 4 x e + 2e Hf ( x, y, z ) = e +e 0 e e 0 0 2 Clearly, > 0 for all x, y, z because all the terms of it are positive. Also
x y yx 2 x2 x2 x y yx x y yx x y yx 1

= ( e + e ) + (e
2 x y yx 2

x y

+ e )(4 x e + 2e ) (e
yx 2 x2 x2 2 x2 x2

x y

+e )
yx

= (e

x y

+ e )(4 x e + 2e ) >0
yx

because both the factors are always positive. Finally, = 2 > 0 . Hence Hf ( x, y, z ) is positive definite at all points. Therefore by Theorem 7, f ( x, y, z ) is strictly globally minimized at any critical point ( x , y , z ). To find ( x , y , z ) , solve
3 2 * * * * * *

e e 2x e f ( x , y , z ) = e + e =0 2 z This leads to z = 0, e = e , hence 2 x e = 0 . Accordingly, x y = y x ; that is, x = y and x = 0 . Therefore ( x , y , z ) = (0, 0, 0) is the global minimizer of f ( x, y, z ) .
x* y* y* x* * x*
2

x* y*

y* x*

x* y *

y * x*

( x* ) 2

(c) Find the global minimizers of

f ( x, y ) = e + e . To this end, compute e e f ( x , y ) = e + e and


x y yx x y x y

yx

yx


x y yx

e +e Hf ( x, y ) = e e
x y x y x y yx

yx

yx

e e e +e
x y yx

Since e + e > 0 for all x, y and det Hf ( x, y ) = 0 , the Hessian Hf ( x, y ) is positive semidefinite for all x, y. Therefore by Theorem 7, f ( x, y ) is minimized at any critical point ( x , y ) of f ( x, y ) .
* *

To find ( x , y ) , solve
e 0 = f ( x , y ) = e This gives
* *
x* y *

y* x8

x* y *

+e

y * x*

e or that is

x* y *

=e
*

y* x8

x y = y x ;
* * * * *

2x = 2 y . This shows that all points of the line y = x are the global minimizers of f ( x, y ) .

(d) Find the global minimizes of f ( x, y ) = e In this case,


x y x+ y

x y

+e .
x+ y

e +e f ( x , y ) = e + e e +e e +e Hf ( x, y ) = . e e e e + + Since e + e > 0 for all x, y and det Hf ( x, y ) > 0 , then Hf ( x, y ) is positive definite for all x, y. Therefore by Theorem 7, f ( x, y ) is minimized at any critical point ( x , y ) . To find ( x , y ) , e +e 0 = f ( x , y ) = . e + e Thus e + e =0 and e +e = 0. > 0 and e > 0 for all x , y . Therefore the But e equality e + e = 0 is impossible. Thus f ( x, y ) has no critical points and hence f ( x, y ) has no global minimizers.
x y x+ y x y x+ y x y x+ y x y x+ y x y x+ y x y x+ y * * * * x* y* x8 + y*

x* y *

x8 + y*

x* y *

x8 + y*

x* y *

x8 + y*

x* y *

x* + y *

x* y *

x8 + y*

There is no disputing that global minimization is far more important than mere local minimization. Still there are certain situations in which scientists want knowledge of local minimizers of a function. Since we are in an excellent

position to understand local minimization, let us get on with it. The basic fact to understand is the next theorem. Theorem.10. Suppose that f ( x) is a function with continuous first and second partial derivatives on some set D in R . Suppose x is an interior point of D and that x is a critical point of f ( x) . Then x is:
n

(a) a strict local minimizer of f ( x ) if Hf ( x ) is positive definite; (b) a strict local maximizer of f ( x ) if Hf ( x ) is negative definite.
*

Now we briefly investigate the meaning of an indefinite Hessian at a critical point of a function. Suppose that f ( x) has continuous second partial derivatives on a set D in R , that x is an interior point of D which is a critical point of f ( x) , and that Hf ( x ) is indefinite. This means that there are nonzero vectors y, w in R such that y.Hf ( x ) y > 0, w.Hf ( x ) w < 0. Since f ( x) has continuous second partial derivatives on D, there is an > 0 such that for all t with t <. But then if Y(t), W(t) are defined Y (t ) = f ( x + ty ), W (t ) = f ( x + tw),
n

then Y (0) = 0 = W (0) and Y (0) = y.Hf ( x ) y > 0, W (0) = w.Hf ( x ) w < 0 .
* *

Therefore, t=0 is a strict local minimizer for Y(t) and a strict local maximizer for W(t). Thus, if we move from x in the direction of y or y, the values of f ( x) increase, but if we move from x in the direction of w or w, the values of f ( x) decrease. For this reason, we call the critical point x a saddle point.
*

The following result summerizes this little discussion: Theorem.11. If f ( x) is a function with continuous second partial derivatives on a set D in R , if x is an interior point of D that is a critical point of f ( x) , and if the Hessian Hf ( x ) is indefinite, then x is a saddle point for f ( x) .
n

Example.10. Let us look for the global and minimizers and maximizers (if any) of the function f ( x , x ) = x 12 x x + 8 x . In this case, the critical points are the solutions of the system f 0= = 3 x 12 x , x
3 3 1 2 1 1 2 2 2 1 2 1

0=

f = 12 x + 24 x . x
2 1 2 2

This system can be readily solved to identify the critical points (2,1) and (0,0). The Hessian of f ( x , x ) is
1 2

12 6x Hf ( x , x ) = 12 48 x .
1 1 2 2

12 12 Hf (2,1) = 12 48 and since = 12 and = 432 , it follows that the critical point (2,1) is a strict local minimizer. Now let us see whether (2,1) is a global minimizer. Observe that Hf ( x , x ) is not positive definite for all ( x , x ) ; for example, 12 0 Hf (0,1) = 12 48 is indefinite. In view of Theorem 7, this leads us to suspect that (2,1) may not be a global minimizer. The fact that lim f ( x ,0) =
1 2 1 2 1 2
x1

Since

shows conclusively that f ( x , x ) has no global minimizer. Moreover, since lim f ( x ,0) = + ,
1 2

x1 +

we see that there are no global maximizers or global minimizers. How about the critical point (0, 0)? Well, since 12 0 Hf (0,0) = , 12 0 this matrix miserably fails the tests for positive definiteness and this leads us to expect trouble at (0, 0).The Theorem 11 tells us that there is a saddle point at (0, 0).

4.8. Eigenvalues and Positive Definite Matrices If A is an n n -matrix and if x is a nonzero vector in R such that Ax = x for some real or complex number , then is called an eigenvalue of A. If is an eigenvalue of A, then any nonzero vector x that satisfies the equation Ax = x is an eigenvector of A corresponding to . Since is an eigenvalue of an n n -matrix A if and only if the homogeneous system ( A I ) x = 0 of n equations in n unknowns has a nonzero solution x, it follows that the eigenvalues of A are just the roots of the characteristic equation det( A I ) = 0 . Since det( A I ) is a polynomial of degree n in , the characteristic equation has n real or complex roots if we count the multiple roots according to heir multiplicities, so an n n -matrix A has n real or complex eigenvalues counting their multiplicities.
n

Symmetric matrices have the following special properties with respect to eigenvalues and eigenvectors: (1) All of the eigenvalues of a symmetric matrix are real numbers. (2) Eigenvectors corresponding to distinct eigenvalues of a symmetric matrix are orthogonal. (3) If is an eigenvalue of multiplicity k for a symmetric matrix A, there are k linearly independent eigenvectors corresponding to . By applying the Gram-Schmidt Orthogonalization Process, we can always replace these k linearly independent eigenvectors with a set of k mutually orthogonal eitgenvectors of unit length.

Combining (2) and (3), we see that if A is an n n symmetric matrix, there are n mutually orthogonal unit eigenvectors u ,L, u corresponding to the n eigenvalues ,L, . If P is the n n -matrix whose i-th column is the unit eigenvector u corresponding to , and if D is the diagonal matrix with the eigenvalues ,L, down the main diagonal, then the following matrix equation holds:
(1) (n) 1

(i)

AP = PD

because Au = u for i = 1,L, n .Since the matrix P is orthogonal(that is, its columns are mutually orthogonal unit vectors), P is invertible and the inverse P of P is just the transpose P of P . It follows that P AP = D , that is , the orthogonal matrix P diagonalizes A . If Q ( x) = x. Ax is the quadratic form associated with the symmetric matrix A and if x = Py , then Q ( x) = x. Ax = ( Py ) A( Py ) = y P APy = y Dy
(i) (i)

= y + y +L+ y Moreover, since P is invertible, x 0 if and only if y 0. Also, if y is the vector in R


2 2 2 1 1 2 2

(i)

with the ith component equal to 1 and all other components equal to zero, and if x = Py , then Q (x ) =
(i) (i) (i)

for all i = 1,L, n . These considerations yield the following eigenvalue test for definite, semidefinite, and indefinite matrices. Theorem 12. If A is a symmetric matrix, then: (a) the matrix A is positive definite (resp. negative definite) if and only if all the eigenvalues of A are positive (resp. negative); (b) the matrix A is positive semidefinite (resp. negative semidefinite) if and only if all the eigenvalues of A are nonnegative (resp. nonpositive); (c) the matrix A is indefinite if and only if A has at least one positive eigenvalue and at least one negative eigenvalue.

Example11. Let us locate all maximizers, minimizers, and saddle points of


f (x , x , x ) = x + x + x 4x x
2 2 2 1 2 3 1 2 3 1 1 2 3 2

The critical points of f ( x , x , x ) are solutions of the systm of equations

0= 0= 0=

f = 2x 4x , x
1 2 1

f = 4 x + 2 x , x
1 2 2 3

f = 2x . x (0, 0, 0) is the one and only one solution of this system. The Hessian of f ( x , x , x ) is the constant matrix
3 1 2 3

2 4 0 Hf ( x , x , x ) = 4 2 0 . 0 0 2 The eigenvalues of the Hessian matrix are = 2, 6, 2 , so the Hessian is indefinite at the critical point (0, 0, 0) and hence it is a saddle point for f ( x , x , x ) .
1 2 3 1 2 3

4.9. Problem posed as minimization problem. (a) Consider the system of equations Ax = b with A of order (m, n) and m>n. The problem becomes that of finding x which minimizes Ax b ; this is a quadratic function in x and hence the vector g = f ( x) = Ax b containing the first order partial derivatives is linear in x. The solution is found to be that of A Ax = A b
2 2

2 2

Note that the Hessian of f is A A which is seen to be positive definite matrix when rank of A is n, meaning that the problem is posed as a minimization problem.
T

(b) Another example is the least squares method for finding the straight line y = ax + b to the set of points ( x , y ), ( x , y ),L, ( x , y ) .This problem becomes that of finding the constants a and b which minimizes the function f (a, b) = ( y ax b) They are given by f = 0 = 2 ( y ax b)( x ) a and f = 0 = 2 ( y ax b)(1) b The above equations can be organized in the convenient form n y x b = x x a y x We can easily justify that the matrix on the left hand side is nonsingular because n x > x
1 1 2 2
n n

i =1

i =1

i =1 n

i =1 n

i =1

i =1

i =1

i =1

( )
n i =1 i i i

[ Holders Inequality:
n i =1

a b a
i =1 i i i

p i

) ( b ) ,
1
p n q

i =1

1 1 + = 1. p q

Letting a = x , b = 1, and p = q = 2 , we get

x x
i =1 i i =1

) (n)
1 2

1 2

i.e. , n x x ]
2 i =1 i i =1 i

( )
n

The matrix on the left hand side is also a Hessian matrix of the function f (a, b) ; this being positive definite means that we are again dealing with a minimization problem. Many other nonlinear functions which seem difficult to deal with, except by nonlinear techniques can be dealt with using same technique explained above, if they can be formulated as a quadratic objective function. The procedure is to take the logarithm of both sides to formulate the problem as follows: min (ln y ln a bx )
n 2 a ,b i =1 i i

If the function is not quadratic, the above method fails, for g ( x, y ) = f ( x, y ) = 0 generate a set of nnonlinear simultaneous equations. The procedure will then be to choose a guess point x and improve on it until the solution is reached. The procedure is as follows. Expand f ( x) around x by Taylors series: 1 f ( x) = f ( x ) + g ( x x ) + ( x x ) H ( x x ) 2 + o(( x x ) ) ) Now x is defined as the point at which
0 0

f = 0. x
) x

Differentiating the Taylors expansion w. r. t. x or ( x x ) gives ) 0 g + H ( x x ) , where the third term is neglected if ) x is rightly chosen near x . Hence ) xx H g where the vector y = H g is obtained by solving the linear equations Hy = g
0 0 x0 x0 0 0
1

x0

x0

) Because the above correction for x is only approximately, ) the solution x can be improved by iteration.

The convergence of the above method is guaranteed if H after every iteration is found positive definite for a minimization problem or negative definite for a maximization problem. For Example, for a minimization problem, 1 ) ) ) ) f ( x) = f (x ) + g ( x x ) + ( x x )H (x x ) 2 1 g }+ {( H g ) ( g )} = f ( x ) + g { H 2 1 = f (x ) g H g + g H g 2 1 = f (x ) g H g 2 and since H is positive definite, H is also positive definite . Hence g H g > 0
0 T 0 0 0 x0 x0 0 T
1 1

x0

x0

x0

x0

x0

x0

x0

x0

x0

x0

x0

x0

x0

x0

x0

x0

x0

x0

x0

) From this, we obtain f ( x ) < f ( x ) which proves the convergence of the method.
0

One main disadvantage of using the above method for ) calculating x that H should be calculated at each iteration and found also positive definite. It requires also the solution of a set of linear equations. Among older methods which requires the evaluation of H but which avoid the solution of linear equations are the descent methods. An illustration of those follows. If at the ith iteration, a correction x is required for x , such that x = x + d , the direction of search d is chosen to make f (x ) < f (x ). As for magnitude of search , it is obtained from minimizing f ( x + d ) .
i +1 i i i

i +1

i +1

Expanding f ( x ) around x , we obtain


f ( x ) f ( x ) + (d ) g
i +1 i i T i i x
i

i +1

< f (x )
i

i.e. d should be chosen such that (d ) g < 0 As for , it is obtained by minimizing the functio 1 f ( x + d ) = f ( x ) + ( d ) g + d H d + o( ) 2
i T xi i i i i i T
2

xi

xi

which gives approximately . d ,H d The function f ( x ) will then be less than f ( x ) if H is positive definite. Note that could be well obtained from the geometric relation 0 = d ,g = d , g + H d As the choice of d is not unique, since it should only satisfy (d ) g < 0 then in order not to leave it a matter of preference, d is chosen simply as the downhill gradient vector d = g i.e., the direction is chosen in the direction of steepest descent, and the method is then called the STEEPEST DESCENT METHOD.
i i i x
i

d ,g

xi

i +1

xi

i +1

xi

xi

In practice, the Steepest descent method usually improves f ( x) rapidly on the first few iterations and then gives rise to oscillatory progress and becomes unsatisfactory.

5. Numerical Optimization of Unconstrained Problem: Gradient Methods

5.1. Seepest Descent Method Example1 Let us compute the first three terms of the Steepest Descent Sequence for

f ( x, y ) = 4 x 4 x x + 2 x 2 with initial point x = . 3 Note that f ( x , x ) = (8 x 4 x ,4 x + 4 x ) . Consequently, f (2,3) = (16 12,8 + 12) = (4,4) = g Let d = g = (4,4) 8 4 Hf ( x , x ) = f ( x , x ) = , 4 4 hence 8 4 H = 4 4
2 2 1 1 2 2 (0) 1 2 1 2 1 2 0

x0

x0

x0

=
0

d ,g
0 x0

x0 0

d ,H d

(4,4), (4,4) 8 4 4 (4,4), 4 4 4

4 (4,4) 32 1 (32) 4 = = = = . 16 64 2 16 (4,4) (4,4), 0 0 2 1 4 2 2 0 x = x + d = + = + = 3 2 4 3 2 1


1 0 0 0

f ( x ) = 4(2) 4(2)(3) + 2(3) = 10 > 2 = f ( x )


0 2 2 1

d = g
1

x1

= f ( x , x ) = ( 4,4) = (4,4)
1 1 1 2

4 (4,4) d ,g 1 4 = = = . d ,H d 10 8 4 4 (4,4), 4 4 4 4 0 1 4 x = x + d = + = 10 6 1 10 4 10 2 f (x ) = 5 f (x ) < f (x ) < f (x )


1 x1 1 1 1 x
1

Another popular method which does not rely on solving a set of linear equations is the

5.2. Method of Conjugate Directions A set of vectors d , d ,L, d is called a set of conjugate directions with respect to a positive definite matrix H , if < d , H d >= 0, i j . From an initial approximation x , a direction of search is chosen as d = g and is obtained as before, to be < d ,g > < g ,g > = = . < d ,H d > < d ,H d >
1 2

This determines a better approximation x = x + d of the solution. The second direction of search is chosen conjugate to d as follows: < g ,g > d = g + d < g ,g > The reason why d is conjugate to d is that < g ,g > d )> < d , H d >=< d , H ( g + < g ,g > < g ,g > Hd > =< d , H g + < g ,g > < g ,g > =< d , H g > + < d ,H d > < g ,g > < g ,g > < d ,H d > = < d ,H g > + < g ,g > < g ,g > < d ,H d > = < H d ,g > + < g ,g > g g < g ,g > =< ,g > + < d ,H d > < g ,g > (Q g = g + H d )
2 1 1 1 1 2 2 2 2 1 1 1 2 1 2 2 1 2 1 2 1 1 1 1 1 2 2 1 2 1 1 1 1 1

i +1

< g ,g >
2 2

< g ,g >
1 2

< g ,g > + < d ,H d > < g ,g >


2 2 1 1 1 1 1 2 2 1 1

< g ,g > 0 < g ,g > = < d ,H d > + + < d ,H d > < g ,g > < g ,g > =0 (Q < g , g >= 0 ) And minimizing in the direction of x , we obtain
2 2 1 1 1 1 1 1 1 1 1 2 3

< d ,g > < d ,g > = = . < d ,H d > < d ,H d >


2 2 2 2 2 2 2 2 2 2 2

In general d and
i +1

< g ,g > = g + d < g ,g >


i +1 i +1 i +1 i i i i

< g ,g > = . < d ,H d > To show that d is conjugate to d and d , we will rely mainly on the property < g , g >= 0 . The above property holds because g is a linear combination of d and d . But d is orthogonal on g , and so on is d , for < g , d >=< g + H d , d >=< g , d >
i i i i
3 2 1

+ < d , H d >= 0 In working the above proof, it was assumed that the directions d , d ,L, d is a set of conjugate directions with respect to a varying positive definite Hessian H . This is why the conjugate directions method works ideally for the case when f ( x) is a quadratic, i.e., in the form 1 f ( x) = a + b x + ( x. Ax) , 2 where A is a positive definite. In this case we reach the solution in exactly n iterative steps.
2 1 2 2

Example 2 Find the minimizer of the function

f ( x, y ) = 4 x 4 xy + 2 y
2

Solution The function can be rewritten as 4 2 x f ( x, y ) = ( x y ) . 2 2 y f ( x, y ) = (8 x 4 y,4 x + 4 y ) 8 4 H = f ( x, y ) = . 4 4 2 Let us start with x = . 3 Then d = g = (16 12,8 + 12) = (4,4)
2 1 1 1

4 4 ) < d ,g > 4 = = < d ,H d > 8 4 4 < (4,4), > 4 4 4


1 1

( 4

32 32 32 1 = = = 32 16 + 16 64 2 < (4,4), > ( 4 4 ) 16 16 0 2 1 4 2 2 0 x = x + d = + = + = 3 2 4 3 2 1 g = f (0,1) = (4, 4) or we can use g = g + H d . =


2 1 1 1 2 2 1 1 1 1

4 (4 4) < g ,g > 4 (4 4) d = g + d = (4, 4) + < g ,g > 4 (4 4) 4


2 2 2 2 1 1 1

= (0 8) < d ,g > = = < d ,H d >


2 2 2 2 2 2

8), ( 4 4) 8 4 0 (0 8), 4 4 8

(0

4 8) 32 32 1 4 = = = = 32 8 32 8 32 (0 8) (0 8), 32 32 x = x + d

(0

0 1 0 0 0 0 = + = + = 1 8 8 1 1 0 0 x = is a global minimizer of the given function. 0


3

5.3. Quasi-Newton Method Suppose that we are given n functions f , f ,L, f defined on a domain in R , where x R is x = ( x , x ,L, x ) . The problem of solving the system of n nonlinear equations: F ( x) = ( f , f ,L, f ) = 0 i.e.,
n
1 2

f ( x , x , L, x ) = 0
1 1 2

f ( x , x ,L, x ) = 0
2 1 2

LLLLLL f ( x , x ,L, x ) = 0 We mean seeking the roots of F ( x) = 0 . We specify the norm in R to be


n
1 2

x = x
i =1 i

The Jacobian matrix J of the system has as its i-jth entry f ( x , x ,L, x ) . J ( x) = x The Newtons method is the iteration of the following linear system of equations J ( x )( x x ) = F ( x) . The computation of J ( x ) is, in practice, for each n may be too expensive, so some approximation is substituted for J ( x ) . Method of this type are called variously Newton-like methods (Dennis), quasi-Newton methods (Broyden), and update method (Rheinboldt and Vandergraft).These methods are often considered in the context of optimization. The descent method and conjugate direction method have enabled one to avoid the solution of a set of linear simultaneous equations which is necessary if one uses the formula (1) x =x H g where H ( x ) is the Hessian matrix of second derivatives of F so that it is just the Newton-Raphson technique for solving the system of equations
1 2
n ik k n n +1 n 1 n 1 n 1 i +1 i i i i

g(x)=0

(2)
i i

This method is unreliable; it will fail if for example H ( x ) is singular or if H ( x ) g and g are orthogonal. Many safeguards including the use of a line search along x + s where s = H ( x ) g and the use of the steepest descent direction if H ( x ) is singular can be incorporated into Newtons method. Some of these modified Newtons methods are described in detail by Wolfe (1978). The strategy of the quasi-Newton methods is that a sequence of positive definite symmetric matrices H is generated and these are used to define the search directions by
1 i i i 1 i i i i i i i i

s = H g
1 i i

(3)
i

The idea behind this is that the H ' s are generated so that the search directions are similar to steepest direction in the early steps and to the Newtons direction in the final stages. Thus the H can be generated as approximation to H ( x ) which we hope will improve as i increases.
1 i i

We now use
= x x and = g g
i i +1 i i i +1 i

(4)

To denote the changes in position and gradient on the i-th iteration then if the objective function is the quadratic given by 1 f ( x) = a + x w + x Gx 2
n

(5)

Where G is a positive definite symmetric matrix and a R and w R are constant, then we have

= G
i

(6)
i +1

In this case the Hessian matrix is G and so if H is to be an approximation to G it is natural to ask that H satisfy the quasi-Newton equation
1 i +1

H =
i +1 i i +1

(7)

We also want H to be positive definite symmetric and to be obtained in a simple way from the information already available. The other objective of the quasi-Newton methods mentioned above is that they should generate conjugate directions when applied to a quadratic objective function. If the line searches are all performed exactly then all the requirements are satisfied, for a quadratic function, if the H ' s are generated by the famous
i

5.3(a). Davidon-Fletcher-Powell (DFP) formula: H ( H ) + H =H H


i i i i i i +1 i i i i i

(8)
i

We summarize this in the following theorem Theorem Let the objective function f be given by (5) and let H be a positive definite symmetric matrix. Let s = H g and the H ' s be generated by (8). If the line searches are perfomedexactly then (i) H is positive definite symmetric for each i , (ii) the search directions are G -conjugate: s Gs (i j ) and (iii) if g 0 (i = 0,1,L, n 1) then g = 0 , x is the minimum point of f and H =G . (If g = 0 for some i < n then of course x is the minimum point) The DFP formula is just one from a large class of quasiNewton formula discovered by Broyden (1967). The general formula is given in Fletchers (1970) paramatrization by
0
i i i i i

H ( H ) H =H + + v v H
i i i i i i i +1 i i i i i i i i

(9)

where
H v = ( H ) H
1 2

(10)

and 0 is the free parameter.


Provided that > 0 it can be shown that if H is positive definite symmetric then so is any H given by (9). 5.3(b). Broyden, Fletcher, Goldfarb and Shanno Formula The particular formula which has the widest support so far is the BFGS formula due to Broyden(1970), Fletcher(1970), Goldfarb(1970) and Shanno(1970). This is obtained by putting = 1 in (9):
i i i i +1

v i vi = vi , vi H i i H i i i i = ( i H i i ) , H H i i i i i i i i i i i ( H i i ) ( H i i ) i ( H i i )( H i i ) i i = ( i H i i ) + 2 ( i i ) i i ( i H i i ) ( i H i i )( i i ) ( i H i i )( i H i i )

H i +1

H i i ( H i i ) i i = Hi + i H i i i i i ( H i i ) ( H i i ) i ( H i i )( H i i ) + ( i H i i ) i i + 2 ( ) ( H ) ( H )( ) ( H )( H ) i i i i i i i i i i i i i i i i i i
+ i H i i i i 2 = Hi + ( i i ) i i

i i

( H i i ) i i ( H i i ) i H i i i i + 1+ = Hi i i i i i i i i

( H ) ( H ) i i i i i i i i i i

Example3 Minimize the function

f ( x, y, z ) = 5 x + 2 y + 2 z + 2 xy + 2 yz 2 zx 6 z
2 2 2

starting with x = (0, 0, 0) .


0

DFP:

10 x + 2 y 2 z Here g ( x) = 2 x + 4 y + 2 z 2x + 2 y + 4z 6

x = (0, 0, 0) H =I g = (0, 0, 6) s = g = (0, 0, 6) x + s = (0, 0, 6 )


0 0 0 0 0 0 0 0 0

f ( 0 ) = 2(6 0 ) 2 6(6 0 ) = 72 02 + 36 0 , , f ( 0 ) = 144 0 36 , f ( 0 ) = 144 f ( 0 ) = 0 gives 0 = 1 and f ( 0 ) = 144 > 0 ] 4

x = x + s = (0, 0, 3 ) 2 g = ( 3 3 0) 3 = 0 0 2 = ( 3 3 6 ) H = H = 54 =9 0 0 0 = 0 0 0 0 0 9 4 3 ( H )( H ) = 3 ( 3 3 6) 6
1 0 0 0 1
0 0 0 0 0
0 0 0

9 18 9 18 = 9 9 18 18 36

1 1 5 6 6 3 5 1 H =1 6 6 3 7 1 1 3 3 12 s = H g = ( 2 2 2) 1 = 2
1 1 1 1 1

Hence we get

5 1 = x + s = 1 1 2 2 = (3 3 0 ) = (1 1 1) = (6 0 0 )
1 1

1 1 1 6 6 6 29 17 H = 1 6 30 30 37 1 17 6 30 60
2

giving

s = (0 12 5 6 5) 5 = 12
2 2

and

x = (1 2 3) .
3

BFGS:
g = ( 0 0 6 ) , s = ( 0 0 6) , =
0 0 0 1 1

1 4 x = ( 0 0 3 2) , g = ( 3 3 0) , = ( 0 0 3 2) = ( 3 3 6) , H = I , so, H = H = 54 , = 9 , 0 0 0 = 0 0 0 0 0 9 4
0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 ( H ) = 0 0 0 9 2 9 2 9 0 0 9 2 H = 0 0 9 2 0 0 9
0 0 0 0 0 0

1 0 0 0 0 9 2 0 1 1 H = 0 1 0 0 0 9 2 0 9 0 0 1 9 0 0 9 9 2 0 + (1 + 54 9)(1 9) 0 0
1

0 0 0 0 9 2 9 0 0 0 0 0 9 4

0 12 1 = 0 1 1 2 1 2 1 2 3 4
s = H g = (3 3 3) , =
1 1 1 1 2 2

x = (1 1 5 2) , g = (3 3 0) = (1 1 1) , = (6 0 0) Hence we get 1 6 1 6 1 6 H = 1 6 13 6 7 6 1 6 7 6 11 6 1 giving s = (0 6 3) , = and x = (1 2 3) . 6


1 1 2 2 2 3

1 3

Thus we see that although the sequences of H ' s are not the same, the s ' s are parallel and the same points x , x , x are generated by the two method
i i
1 2

6. Constrained Optimization: Bordered Hessian A bordered Hessian is used for second derivative test in certain constrained optimization problems. Given the function f ( x , x ,L, x ) ,
1 2

but adding a constraint function such that,

g ( x , x ,L, x ) = c , the bordered Hessian appears as g g g L 0 x x x f f f g L x x x x x x H ( f , g ) = g f f f L x x x x x x M M M O M g f f f L x x x x x x If there are, say m constraints than the zero in the north west corner is an m m block of zeros, and there are m border rows at the top and m border columns at the left.
1 2

1 2

1 2

The rules of positive definite and negative definite can not apply here since a bordered Hessian can not be definite(either positive or negative), for we have z Hz = 0 if vector z has a nonzero as its first element, followed by zeros. The second derivative test consists here of sign restrictions of the determinants of a certain set of n m submatrices of the bordered Hessian. Intuitively, think of the m constraints as reducing the problem to one with n m free variables (for example, the maximization of f ( x , x , x ) subject to the constraint x + x + x = 1can be reduced to the maximization of f ( x , x ,1 x x ) without constraint).
1 2 3 1 2 3 1 2 1 2

6.1. Vector-Valued Functions. If f is instead a function from R R , i.e., f ( x , x ,L, x ) = ( f , f ,L, f ) then the array of second partial derivatives is not a two dimensional matrix of of size n n , but rather a tensor of of order 3. This can be thought of as a multi-dimensional array with dimensions m n n , which degenerates to the usual Hessian matrix for m=1.
n m
1 2

6.2. Optimization with Equality Constraint: Sufficient Conditions for a Local Optimum for a Function of Two Variables Consider the problem Max f ( x, y ) subject to g ( x, y ) = c . Assume that g ( x, y ) y = g ( x, y ) 0 . By substituting for y using the constraint, we can reduce the problem to one in a single variable, x. Let h be implicitly defined by g ( x, h( x)) = c . Then the problem is max f ( x, h( x)) . Define F ( x) = f ( x, h( x)). Then F ( x) = f ( x, h( x)) + f ( x, h( x))h( x) . Let x be a stationary point of F (i.e., F ( x ) = 0 ). A sufficient condition for x to be a local maximizer of F is that F ( x ) < 0 .
2 1 2 * * * *

We have

F ( x ) = f ( x , h( x )) + 2 f ( x , h( x ))h( x ) + f ( x , h( x ))(h( x ))
* * * * * * 11 12 * * * 2 22

= f ( x , h( x ))h( x ). Now, since g ( x, h( x)) = c for all x, we have g ( x, h( x)) + g ( x, h( x))h( x) = 0 so that g ( x, h( x)) . h ( x) = g ( x, h( x)) Using this expression, we can find h( x ) , and substitute it into the expression for F ( x ) . After some manipulation, we find that D( x , y , ) F ( x ) = where ( g ( x , y ))
* * * 2 1 2 1 2 * * * * * * * * 2 2

D( x * , y * , * ) 0 = g1 ( x * , y * ) g ( x* , y* ) 2
*

g1 ( x * , y * ) ( x * , y * ) * g11 ( x * , y * ) f11 ( x * , y * ) * g ( x* , y * ) f 21 21
* * * *

g ( x* , y * ) 2 ( x * , y * ) * g12 ( x * , y * ) f12 ( x * , y * ) * g ( x* , y * ) f 22 22

where is the value of the Lagrange multiplier at the solution ( i.e., f ( x , y ) g ( x , y )) . The matrix of which D( x , y , ) is the determinant is known as the Bordered Hessian of the Lagrangian. In summary, we have the following result:
2 2

Proposition. Consider the problems Max f(x, y) subject to g(x, y) =c And Min f(x, y) subject to g(x, y) =c

Suppose that ( x , y ) and satisfy the first-order conditions f ( x , y ) g ( x , y ) = 0 f ( x , y ) g ( x , y ) = 0 and the constraint g ( x , y ) = c . If D ( x , y , ) > 0 then ( x , y ) is a local maximizer of f subject to the constraint g(x, y) =c. If D ( x , y , ) < 0 then ( x , y ) is a local minimizer of f subject to the constraint g(x, y) =c.
* * * * * * * * 1 1 * * * * * 2 2 * * * * * * * * * * * *

Example Find the maximum value of z = xy subject to

x + y = 6 . Check the second order conditions by examining the bordered Hessian determinant. Solution Set up the Lagrangian L = xy + (6 x y ) The necessary conditions for L are L = 6 x y = 0 L = y =0 L = x =0 Solving the above equations we have x = 3, y = 3, z = 9 , = 3 0 1 1 D ( x , y , ) = D (3, 3, 3) = 1 0 1 = 2 > 0 1 1 0 Hence z is maximized.
max

References 1. Iterative solution of Nonlinear Equations in Several Variables by J. M. Ortega and W. C. Rheinbolt 2. Functional Analysis and Numerical Mathematics Lother Collatz 3. Functional Analysis L. V. Kantorovich & G. P. Akilov 4. A Course of Applied Functional Analysis Arthur Wouk 5. Introduction to Hilbert Spaces with Applications Loknath Debnath & Piotr Miknsinski 6. Introduction to Numerical Analysis J. Stoer, R. Bulirsch 7. Principles of Operations Research for Management Frank S. Budnick, Dennis Mcleavey, Richard Mojena 8. Mathematics of Nonlinear programming Anthony L. Pressini, Francis E. Sulivan, J.J. Uhl, Jr. 9. Elements of Functional Analysis B. K. Lahiri 10. Matrix and Linear Algebra K. B. Datta

You might also like