Vector Norms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16
At a glance
Powered by AI
The key takeaways are that vector norms are used to measure the size or length of a vector, different types of vector norms such as the p-norms and infinity norm are defined, and all vector norms in finite dimensions are equivalent.

Vector norms are measures for the size of a vector. The p-norms and infinity norm are defined, with the p-norms using sums of powers and the infinity norm using the maximum element. A norm must satisfy non-negativity, identity of indiscernibles, homogeneity, and the triangle inequality.

Different vector norms are equivalent in finite dimensions, meaning they induce the same notion of convergence and have error estimates that differ by at most a constant factor.

Chapter 5

Vector and Matrix Norms

5.1 Vector Norms


A vector norm is a measure for the size of a vector.

Definition 5.1. A norm on a real or complex vector space V is a mapping


V → R with properties
(a) kvk ≥ 0 ∀v
(b) kvk = 0 ⇔ v=0

(c) kαvk = |α|kvk


(d) kv + wk ≤ kvk + kwk (triangle inequality)

Definition 5.2. The vector p-norm, 1 ≤ p < ∞, is given by


!1/p
X
kvkp = |vi |p .
i

Special cases:
X
kvk1 = |vi |
qX
kvk2 = |vi |2 (Euclidean norm)
kvk∞ = max |vi |.

The ∞-norm gets its name from

lim kvkp = kvk∞ ∀v.


p→∞

30
5.1. VECTOR NORMS 31

It is easy to verify that conditions (a), (b), (c) are satisfied for all p. The
triangle inequality is only satisfied for p ≥ 1. In fact, it goes the other way for
p < 1.

Theorem 5.3. (Hölder Inequality)


1 1
|hv, wi| ≤ kvkp kwkq if p + q = 1.

The unit balls for these various norms are nested:

Sideline: I defined before

kxk0 = number of nonzero entries in x.

The reason for that notation is

lim kxkpp = kxk0 .


p→0

As I just said, for p < 1 these are not norms, because the triangle inequality fails.
Strangely enough, kxk0 satisfies the triangle inequality again, but it is still not a norm
because (c) fails.
32 CHAPTER 5. VECTOR AND MATRIX NORMS

Slightly more generally, we have weighted p-norms


!1/p
X
p
kvkp,w = wi |vi | .
i

Here the wi > 0 are some fixed weights. This could be useful if some measure-
ments in your data are more reliable than others, or some parts of a solution
vector are more important than others.

Theorem 5.4. If A is positive definite, then

hx, yiA = hx, Ayi

defines an inner product, and


p
kxkA = hx, xiA

defines a norm.

Example: In the numerical solution of elliptic partial differential equations


by finite elements, you can show that the error (difference between numerical
solution xN and true solution x) satisfies

kx − xN kA ≤ some estimate,

but you can’t get a direct estimate for the standard norm kx − xN k.
Here the A is actually a positive definite differential operator, not a matrix,
but the idea is the same.

5.1.1 Equivalence of Norms

Sideline: A relation R on a set S is a subset of S × S (ordered pairs):

R ⊂ S × S = {(a, b) : a, b, ∈ S}.

A relation is
• transitive if (a, b) and (b, c) ⇒ (a, c)
• symmetric if (a, b) ⇔ (b, a)
• antisymmetric if (a, b) and (b, a) ⇒ a = b.
• reflexive if (a, a) for all a.
Example: The ordering ≤ on a set of real numbers is transitive, antisymmetric and
reflexive. Likewise for the partial set ordering ⊂.
A relation is called an equivalence relation if it is transitive, symmetric and
reflexive.
Example: Think of the identity =.
5.1. VECTOR NORMS 33

Definition 5.5. Two norms are equivalent if there are constants 0 <
A ≤ B so that
Akvk ≤ |||v||| ≤ Bkvk ∀v

Fact: This is an equivalence relation.


Applications:

• Two equivalent norms have the same notion of convergence. If a sequence


converges in one norm, it converges in the other, and vice versa.

• Error estimates are the same, up to a constant factor.

Theorem 5.6. (Main Theorem in this section) All vector norms in finite
dimensions are equivalent.

Lemma 5.7. For every norm k · k on Cn , there exists M so that

kvk ≤ M · kvk1 .
P P
Proof. Let ei be the ith basis vector. Then v = vi ei , kvk1 = |vi |, and

X X
kvk = vi e i ≤ |vi |kei k ≤ M kvk1


i

with M = max kei k.

This implies that the norm

k · k : Cn with norm k · k1 → R

is a continuous function (Lipschitz continuous, actually).

Proof. (of theorem)


We prove that an arbitrary norm is equivalent to the 1-norm.
A set (in a topological space, whatever that is) is compact if every open
cover has a finite subcover (whatever that means). A major theorem says “A
real-valued continuous function on a compact set takes on its maximum and
minimum values”.
In Rn or Cn , compact means the same as “closed and bounded”. The unit
sphere under the 1-norm is a closed and bounded set, and by the lemma, the
other norm is a continuous function on it. The constants A and B are the
maximum and minimum values of this norm on the unit sphere.

In infinite dimensions, the unit sphere is not compact, and the p-norms are
not equivalent.
34 CHAPTER 5. VECTOR AND MATRIX NORMS

Example: For the 1, 2, and ∞ norms we have



kvk2 ≤ kvk1 ≤ nkvk2

kvk∞ ≤ kvk2 ≤ nkvk∞
kvk∞ ≤ kvk1 ≤ nkvk∞

I believe some of these inequalities were assigned as a qualifier problem in Sum-


mer 2016.
Sample Proof: Let  
v1
 v2 
v =  . .
 
 .. 
vn
Let αi = |vi |/vi if vi 6= 0, α1 = 1 otherwise. Then |αi | = 1, and kvi k = αi vi .
Then X √
kvk1 = αi vi = hv, ᾱi i ≤ kvk2 kᾱk2 ≤ nkvk2 .
i

Question: If all (finite-dimensional) norms are equivalent, why do we even


bother defining different ones?
Answer: Sometimes we can prove estimates for one type of norm much
more easily than for another. Also, just because we can.

5.2 The Singular Value Decomposition, Part 1


For any (rectangular) matrix A, the matrix A∗ A is square, Hermitian, and
positive semidefinite.
Definition 5.8. The singular values of A are the square roots of the
nonzero eigenvalues of A∗ A. It is customary to sort them by size:

σ1 ≥ σ2 ≥ · · · ≥ σr > 0.

Here r is the rank of A.

Theorem 5.9. Any matrix A can be factored as

A = U ΣV ∗ ,

where U , V are unitary and Σ is a diagonal matrix with the σi on the


diagonal, extended by an appropriate number of 0 if necessary.
This is called the singular value decomposition (SVD).

If A is of size m × n, then U is m × m, V is n × n, and Σ is m × n.


Facts:
5.3. MATRIX NORMS 35

• The σi are also the square roots of the nonzero eigenvalues of AA∗ . A∗ A
and AA∗ are of different sizes in general, but they have the same nonzero
eigenvalues.

• The columns of V are the eigenvectors of A∗ A. The columns of U are the


eigenvectors of AA∗ .

5.3 Matrix Norms

Definition 5.10. A matrix norm must satisfy

(a) kAk ≥ 0 ∀A
(b) kAk = 0 ⇔ A=0
(c) kαAk = |α|kAk

(d) kA + Bk ≤ kAk + kBk


(e) kA · Bk ≤ kAk · kBk (submultiplicativity)

Note:

• All the matrix norms we consider are defined for matrices of all sizes.
Properties (d) and (e) only apply if the sizes are compatible.

• Some books only require (a)–(d). For me, it does not deserve to be called
a matrix norm if it does not satisfy (e) also.

• Notice that (e) implies kAn k ≤ kAkn . That will be useful later.

• As with vector norms, all matrix norms are equivalent.

Definition 5.11. A matrix norm and a vector norm are compatible if

kAvk ≤ kAk · kvk

This is a desirable property. Note that this definition requires two norms to
work together. Typically, a particular matrix norm is compatible with one or
more vector norms, but not with all of them.
There are three main sources of matrix norms: (1) vector-based norms; (2)
induced matrix norms; (3) norms based on eigenvalues.
We will now look at all of those in turn.
36 CHAPTER 5. VECTOR AND MATRIX NORMS

5.3.1 Vector-Based Norms


For a given matrix A, consider the vector vec(A) (the columns of A stacked on
top of one another), and apply a standard vector p-norm.
This produces
X
p=1: kAksum = |aij |
ij
sX
p=2: kAkF = |aij |2
ij

p=∞: kAkmax = max |aij |


ij

The p = 2-norm is called the Frobenius or Hilbert-Schmid norm.


All of them satisfy (a)–(d) automatically. We need to check (e).
Recall two special cases of the Hölder inequality for vector norms:

|hx, yi| ≤ kxk2 · kyk2 (Cauchy-Schwarz)


|hx, yi| ≤ kxk1 · kyk∞
≤ kxk1 · kyk1 (obvious)

Theorem 5.12. (a) The sum norm satisfies (e)


(b) The sum norm is compatible with the vector 1-norm.

Proof. (a) Let ri∗ be the ith row of A, cj the jth column of B. Then
X X
kABksum = |(AB)ij | = |hcj , ri i|
ij ij
X
≤ kri k1 · kcj k1 = kAksum · kBksum .
ij

(b) Essentially the same as (a):


X X X
kAvk1 = |(Av)i | = |hv, ri i| ≤ kri k1 · kvk1 = kAksum · kvk1 .
i i i

Theorem 5.13. (a) The Frobenius norm satisfies (e)


(b) The Frobenius norm is compatible with the vector 2-norm.

Proof. Basically the same proof as for the sum norm, except we use Cauchy-
Schwarz.

Lemma 5.14. kAk2F = trace(A∗ A).


5.3. MATRIX NORMS 37

Proof. Write out what trace(A∗ A) is, and observe it is equal to kAk2F .

Theorem 5.15. If U , V are unitary, then

kU AV kF = kAkF .

Proof.

kU Ak2F = trace((U A)∗ (U A)) = trace(A∗ U ∗ U A) = trace(A∗ A) = kAk2F .

Similarly for V .
There will be more properties of the Frobenius norm in section 5.3.3.
Fact: The max-norm does not satisfy (e).
Exercise: Find a counterexample.

5.3.2 Induced Matrix Norms

Definition 5.16. Given any vector norm, the induced matrix norm is
given by
kAvk
kAk = sup = sup kAvk.
v6=0 kvk kvk=1

It is easy to check that (a)–(e) are satisfied, and that these norms are auto-
matically compatible with the vector norm that produced them.
Theorem 5.17.
X
kAk1 = max |aij | (largest column sum)
j
i
X
kAk∞ = max |aij | (largest row sum)
i
j

kAk2 = largest singular value

Proof.
X XX
kAvk1 = |(Av)i | ≤ |aij | · |vj |
i i j
! !
X X X X
= |aij | |vj | ≤ max |aik | · |vj |
k
j i j i
!
X
= max |aik | · kvk1 .
k
i

This proves that X


kAk1 ≤ max |aij |.
j
i
38 CHAPTER 5. VECTOR AND MATRIX NORMS

To complete the proof, we need to find one particular v for which we get equality.
Assume that the largest column sum is in column j0 , then v = ej0 (standard
basis vector) will work.
The proof for p = ∞ is similar (exercise).
The proof for p = 2 will be done later, in corollary 5.21.
Example: Let  
3 −1 4
A = 1 5 −9 .
2 6 5i
The row sums are 8, 15, 13. The column sums are 6, 12, 18.

kAk1 = 18, kAk∞ = 15, kAk2 ≈ 13.5824.

Induced Norms of Special Matrices


For a few types of matrices, some of the induced matrix norms are easy to
calculate.
Theorem 5.18. If  
d1 0
D=
 .. 
. 
0 dn
is a diagonal matrix, then kDkp = maxi |di | for all p ≥ 1.

Proof.     
d1 v1 d 1 v1
..   ..   .. 
Dv =   .  =  . .

.
dn vn dn vn
Then
X  X  p
kDvkpp = |di |p |vi |p ≤ max |di |p |vi |p = max |di | kvkpp ,
i i
i i

so
kDvkp ≤ max |di |.
i

To show equality, you need to find one particular vector for which you have
equality. For example, the standard basis vector ei0 for the index i0 which
corresponds to the maximum |di |.

Theorem 5.19. If U is unitary, then kU k2 = 1.

Proof. kU vk22 = hU v, U vi = hv, U ∗ U vi = hv, vi = kvk22 .


5.3. MATRIX NORMS 39

Theorem 5.20. If U , V are unitary, then kU AV k2 = kAk2 .

Proof.

kU Ak2 ≤ kU k2 kAk2 = kAk2 ,


kAk2 = kU ∗ U Ak2 ≤ kU ∗ k2 kU Ak2 = kU Ak2

Likewise for V .

Corollary 5.21. kAk2 = the largest singular value.

Theorem 5.22. If U , V are unitary, then kU AV kF = kAkF .

Proof. Consider the singular value decomposition

A = U ΣV ∗ .

By theorem 5.20, kAk2 = kΣk2 . By theorem 5.18, kΣk2 = σ1 .

5.3.3 Matrix Norms Based on Eigenvalues


There cannot be any norms based on the eigenvalues of A itelf, because there
are non-zero matrices with only zero eigenvalues, for example
 
0 1
A= .
0 0

Instead, you need to base the norms on the singular values. We will get to that
in a bit. First, we prove a few theorems about eigenvalues and norms in general.

The Spectral Radius and Norms

Theorem 5.23. ρ(A) ≤ kAk for any matrix norm.

Proof. If kAk is compatible with any vector norm, this is easy: Take v to be
the eigenvector to the largest eigenvalue, then

|λ|kvk = kAvk ≤ kAkkvk.

For a general norm (which satisfies (e)) we use a littletrick: Let V be the matrix
all of whose columns are equal to v. Then

|λ|kV k = kAV k ≤ kAkkV k.


40 CHAPTER 5. VECTOR AND MATRIX NORMS

Theorem 5.24. For any (square) matrix A and any  > 0 there exists a
matrix norm so that
ρ(A) ≤ kAk ≤ ρ(A) + .

Note: This does not say that there is a single matrix norm that works for all
matrices A. It says that for each fixed A and fixed , there is such a norm.

Corollary 5.25. If ρ(A) < 1, then An → 0.

Proof. Let ρ(A) = 1 − , and find a matrix norm so that kAk < 1 − (/2) < 1.
Then kAn k ≤ kAkn → 0.

Theorem 5.26. For any matrix norm,

ρ(A) = lim kAn k1/n .


n→∞

Proof. ρ(An ) = ρ(A)n ≤ kAn k, so

ρ(A) ≤ kAn k1/n

for all n, so therefore also in the limit.


For the opposite direction, choose any  > 0. Let

1
A = A,
ρ(A) + 

then
ρ(A)
ρ(A ) = < 1.
ρ(A) + 
There is some matrix norm for which kA k < 1, so kAn k → 0. Since all matrix
norms are equivalent, this also applies to whatever matrix norm we are dealing
with. For large enough n, kAn k < 1, which implies

kAn k1/n ≤ ρ(A) + .

Remark: If a matrix A has spectral radius ρ, can kAvk every be larger than
ρ · kvk? The answer is yes, for example
    
1 1 1 2
= , ρ(A) = 1.
0 1 1 1

What the theorem says is that in the long run, if you keep applying A over and
over again, on average the vector cannot grow by any factor larger than ρ.
5.3. MATRIX NORMS 41

Sideline: A concept related to the spectral radius is the numerical radius of a


matrix. The numerical range of A is

W (A) = {hAv, vi : hv, vi = 1}.

This is a subset of the complex plane which includes the spectrum. The numerical
radius is
r(A) = max |z|.
z∈W (A)

Obviously, r(A) ≥ ρ(A).


The numerical radius does better than the spectral radius: it satisfies conditions
(a)–(d) of a matrix norm, just not (e). A counterexample is
   
0 1 0 0
A= , B=
0 0 1 0

with
r(A) = r(B) = 1/2, r(AB) = 1.

The Schatten Norms

Definition 5.27. The Schatten p-norm of A is the vector p-norm applied


to the vector of singular values

σ = (σ1 , · · · , σr )T .

As usual, the interesting cases are p = 1, p = 2, p = ∞.

X
p=1: kAk∗ = σi (the nuclear norm)
i
sX
p=2: kAkF = σi2
i

p=∞: kAk2 = max σi


i

The nuclear norm is new. The Schatten 2-norm turns out to be the Frobenius
norm. The Schatten ∞-norm is the matrix 2-norm.

Proof. We have A = U ΣV ∗ (SVD), and for both Frobenius norm and 2-norm
we proved earlier that kAk = kΣk. The rest is then obvious.

Sideline: It is interesting to consider what the Schatten 0-norm (which is not really
a norm) would be. This is the number of nonzero singular values, which is the rank
of A.
42 CHAPTER 5. VECTOR AND MATRIX NORMS

5.4 Applications of Matrix Norms


5.4.1 Fun with Matrix Power Series
Recall from calculus a few facts about power series. We will only consider power
series about 0.

• A sequence {x0 , x1 , . . .} converges to a limit L if |xn − L| → 0 as n → ∞.


You can phrase that in terms of epsilons and deltas yourself.
P∞ k
• A power series is an infinite
Pn series k=0 ak x . It converges if the
k
sequence
Pn of partial sums k=0 ak x converges. It converges absolutely
if k=0 |ak ||x|k converges.

• Every power series has a radius of convergence R. If |x| < R, the


series converges absolutely, if |x| > R, it diverges. If |x| = R, anything
can happen. Here x can be real or complex.
P∞
We can also consider power series of matrices, of the form k=0 ak Ak . Be-
cause of kAk k ≤ kAkk , we immediately get

Lemma 5.28. If kAk < R for any matrix norm, or if ρ(A) < R, then the
power series converges.

Note: If ρ(A) > R, the series will diverge. If kAk > R for some norm, that is
inconclusive, since there may be some other norm which is less than R.
Example: The power series for 1/(1 − x) is

1 X
= xk .
1−x
k=0

The radius of convergence is 1.


If ρ(A) < 1, then I − A is invertible, and

X
(I − A)−1 = Ak .
k=0

If kAk < 1 in some norm, then in the same norm

1
k(I − A)−1 k ≤ .
1 − kAk

Sideline: If x(s) is a function on some interval [a, b],


Z b
x(s) = f (s) + k(s, t)x(t) dt
a
5.4. APPLICATIONS OF MATRIX NORMS 43

is called a Fredholm integral equation. f is a known function, and k is called the


kernel.
The discrete counterpart is the linear equation x = f +Kx. By the previous exam-
ple, if ρ(K) < 1, this has a unique solution x for every f , and x depends continuously
on f .
Similar statements are true for the original integral equation.

5.4.2 The Condition Number


This is a topic from numerical analysis.
Consider a mathematical problem with input x, output y. We calculate y
from x in some fashion: y = F (x).
Example: Find the zeros of a polynomial

p(t) = an tn + · · · + a1 t + a0 .

Here the input x is the coefficient vector (an , . . . , a0 )T , and the output y is the
vector of zeros (t1 , . . . , tn ).
If we change the input a little, from x to x + ∆x, we get a different output
y+∆y. ∆x could be measurement error, or roundoff error from putting numbers
on a computer. The question is: How sensitive is the output to small changes
in input?

Definition 5.29. The absolute error is k∆xk or k∆yk. The relative


error is k∆xk/kxk or k∆yk/kyk.

Usually, the relative error is more meaningful. A relative error of 10−6 means
that we can trust 6 decimals in the number.

Definition 5.30. The magnification factor for the (relative) error is

k∆xk/kxk
.
k∆yk/kyk

This magnification factor depends on ∆x.


The condition number of the problem is the worst possible case of error
magnification:
k∆xk/kxk
cond = max .
small ∆x k∆yk/kyk

This requires some comments:

• The condition number depends on the norm used. Different norms give
different condition numbers, but usually of the same order of magnitude.

• The condition number is an interesting concept, but most of the time you
cannot actually calculate it. One exception is in linear algebra.
44 CHAPTER 5. VECTOR AND MATRIX NORMS

• I am being deliberately vague about the meaning of “small ∆x”. In linear


algebra, it does not really matter, since constant multiples cancel out. You
can just put “any ∆x” there.

• A problem with large condition number is called ill-posed. This concept


was introduced by Hadamard, who argued than anything coming up in
real life had to be well-posed (small condition number). He was wrong.
There are many problems that are ill-posed, but nevertheless can provide
meaningful results. Examples include CAT scans, and weather prediction.
The trick is not to require too much accuracy.

• There is also a difference between the condition number of the problem


(assuming you can find the mathematically exact solution), and the con-
dition number of a particular algorithm on the computer. If the problem
itself is ill-posed, no algorithm can fix that, but it is possible to have an al-
gorithm that has a worse condition number than the underlying problem.
Don’t use that one.

Example: This example explains how you would use the condition number.
Suppose you are calculating something on a standard computer. The com-
puter works with fixed accuracy, usually equivalent to about 15 decimals of
accuracy.
Suppose you know that the condition number of your problem is 1010 .
Whenever you put your numbers on the computer, you have to assume a
k∆xk of at least 10−15 , because your input gets rounded to 15 decimals. In
the worst case, that error will be magnified by 1010 , so your final relative error
could be 10−5 . That means you can only trust 5 decimals in your answer.
Caution: This is assuming the worst case, both in the original rounding
and in the error propagation. Most of the time, your answer will be correct to
more decimals. The point is that you cannot trust any more than 5 decimals.

Let us consider one special case: solving a system of linear equations Ax = b.


We will only consider the condition number for changes in the right-hand side
b.
So: A is fixed, the input is b, and the output is x. We consider

Ax = b
A(x + ∆x) = (b + ∆b)

which implies
A∆x = ∆b ⇒ ∆x = A−1 ∆b.

The condition number is

k∆xk/kxk
κ = max .
∆b k∆bk/kbk
5.4. APPLICATIONS OF MATRIX NORMS 45

Now
kbk 1 kAk
kbk ≤ kAk · kxk ⇒ kxk ≥ , ≤
kAk kxk kbk
k∆xk ≤ kA−1 kk∆bk.

Together we find that


κ ≤ kAk · kA−1 k.
It is possible to actually find specific a ∆b which achieves this bound (exercise
for the reader), so
κ = kAk · kA−1 k.
Comments:

• This is the condition number for changes in the right-hand side b. You can
also consider the condition number for changes in A. That turns out the
be the same number, but that is a coincidence. If you consider the least
squares solution of an overdetermined Ax = b, the condition numbers for
changes in b and changes in A are different.
• The condition number for the forward problem “compute y = Ax” also
happens to be the same.

• As I said above, for different norms you get different condition numbers.
For the 2-norm, κ = σ1 /σn (ratio of largest and smallest singular value).
• In theoretical linear algebra, a matrix is either singular, or it is not. In
practical linear algebra, there is no such thing as an exactly singular ma-
trix, unless you have a matrix of small integers. What happens is that
kA−1 k, and therefore κ, gets so large that you have no digits of accuracy
left, so you computations are basically meaningless.

You might also like