Kernels and Kernelized Perceptron: Instructor: Alan Ritter
Kernels and Kernelized Perceptron: Instructor: Alan Ritter
Kernels and Kernelized Perceptron: Instructor: Alan Ritter
Perceptron
• But'what'are'we'going'to'do'if'the'dataset'is'just'too'hard?''
0 x
• How'about…'mapping'data'to'a'higherIdimensional'space:'
x2
x
Feature'spaces'
• General'idea:'''map'to'higher'dimensional'space'
– if'x'is'in'Rn,'then'φ(x)'is'in'Rm'for'm>n'
– Can'now'learn'feature'weights'w#in'Rm'and'predict:''
y = sign(w · (x))
– Linear'funcXon'in'the'higher'dimensional'space'will'be'nonIlinear'in'
the'original'space'
x → φ(x)
Higher'order'polynomials'
number of monomial terms
d=4
m – input features
d – degree of polynomial
d=3
grows fast!
d = 6, m = 100
d=2
about 1.6 billion terms
number of input dimensions
⌥ x e⇥. . x. =w ⇥ j yj xj
⌥u1 .⇤w .. v1
⌥ . . .
Efficient'dotIproduct'of'polynomials'
⇥(u).⇥(v) = . =j u v + u v = u.v
⇤L 1 1 2 2
⇤L = w⌥ 2 x(1)⇥y x2
u v ⇥
⇤w =⇧ w eu1exactlyj j j⌃
jjyj x v1
Polynomials
⇤ ⇤w
⇥(u).⇥(v) of degree
⌅ j⇤
= . ⌅ d = u1 v1 + u2 v2 =
2 ju 2 v
u1 . 2 v1
. . 2
d=1 ⇥⇥⌥ ⇥
⇥ ⌥⌅ v⇤
u
⌥u1u
1 1⇤u 2 v2 1
1 ⌥ 1 v2 2 ⌅ 2 2
⇥(u).⇥(v)
⇥(u).⇥(v) =⇧
⇥(u).⇥(v)== . u1 . ⇧ ==uu1 v1v1⌃
. ⌃ ++
1
1 =u2uu
v221vv=
21 + u.v
u.v
= 2u1 v1 u2 v2
⇤Luu22u2⌥u1 u1 uvv222 v⌥ 2 v1
v1xv2
⇥(u).⇥(v)
⇤ ⌅=
= u
⇤⌥
⇧
w
2 ⌅ .
⌃ ⇧2
⌥
vj2y j j ⌃ = u 2 2
1 v1 + 2u1 v1 u2
d=2 u⇤w
1
2 2
uv21u1
2 v2 v1
⌥ u1 u2 ⌥ v1uv22 j v 2
2 v )2
⇥(u).⇥(v) = ⌥ . ⌥ = (u= v
u 2
+
v 2
u
⇧ u2 u1 ⌃⇥⇧ v2 v1 ⌃⇥1 11 1 2 21 1 2 2
+ 2u v u v + u 2 2
2 v2
u
u2 1 v 2v1 = (u v + u v ) 2
⇥(u).⇥(v) = 2 . 2 = 1u11 v1 + 2 2u v = u.v
2 2
u2 v2 = (u.v) 2
⇥(u).⇥(v) = (u.v)d
For any d (we will skip proof): ⇥(u).⇥(v) = (u.v)d
d
K(u, v) = ⇥(u).⇥(v) = (u.v)
⇥(u).⇥(v) = (u.v)d
• Cool! Taking a dot product and an exponential gives same
results as mapping into high dimensional space and then taking
dot product
The' Kernel'Trick '
• A'kernel&func*on'defines'a'dot'product'in'some'feature'space.'
& & &K(u,v)='φ(u)!#φ(v)'
• Example:''
'2Idimensional'vectors'u=[u1&&&u2]'and'v=[v1&&&v2];''let'K(u,v)=(1'+'u!v)2,'
'Need'to'show'that'K(xi,xj)='φ(xi)#!φ(xj):'
''K(u,v)=(1'+'u!v)2,='1+'u12v12&+&2'u1v1&u2v2+&u22v22&+'2u1v1&+&2u2v2=&
&&&&&&&=&[1,'u12,&&√2'u1u2,&&&u22,&&√2u1,&&√2u2]#!#[1,''v12,&&√2v1v2,&&v22,&&√2v1,&&√2v2]'='
'''''''='φ(u)#!φ(v),''''where'φ(x)'='#[1,''x12,&&√2'x1x2,&&&x22,&&&√2x1,&&√2x2]'
• Thus,'a'kernel'funcXon&implicitly&maps'data'to'a'highIdimensional'space'
(without'the'need'to'compute'each'φ(x)'explicitly).'
• But,'it'isn’t'obvious'yet'how'we'will'incorporate'it'into'actual'learning'
algorithms…'
“Kernel'trick”'for'The'Perceptron!'
• Never'compute'features'explicitly!!!'
– Compute'dot'products'in'closed'form'K(u,v)'='Φ(u)'!'Φ(v)''
• Standard'Perceptron:' • Kernelized'Perceptron:'
• set'a i=0'for'each'example'i'
• set'wi=0'for'each'feature'i'
• For't=1..T,'i=1..n:'
• set'ai=0'for'each'example'i' X
– y'' = sign(( ak (xk )) · (xi ))
• For't=1..T,'i=1..n:' k
i X
'' = sign(w · (x ))
– y ' = sign( ak K(xk , xi ))
– if'y'≠'yi' – if'y'≠'yi'
= w + y i (xi )
k
• w
''
• 'ai'+='yi' • ai'+='yi'
'
• At'all'Xmes'during'learning:'
X ' Exactly the same
w= ak (xk ) computations, but can use
k K(u,v) to avoid enumerating
the features!!!
• set'ai=0'for'each'example'i' IniXal:'
• For't=1..T,'i=1..n:' • a'='[a1,'a2,'a3,'a4]'='[0,0,0,0]'
X t=1,i=1'
– y'' = sign( ak K(xk , xi )) • ΣkakK(xk,x1)'='0x4+0x0+0x4+0x0'='0,'sign(0)=I1'
– if'y'≠'yi' k • a1'+='y1 "'a1+=1,'new'a='[1,0,0,0]'
• ai'+='yi' t=1,i=2'
' • ΣkakK(xk,x2)'='1x0+0x4+0x0+0x4'='0,'sign(0)=I1'
t=1,i=3'
x1#' x2# y# • ΣkakK(xk,x3)'='1x4+0x0+0x4+0x0'='4,'sign(4)=1'
t=1,i=4'
1' 1' 1' • ΣkakK(xk,x4)'='1x0+0x4+0x0+0x4'='0,'sign(0)=I1'
I1' 1' I1' x1 t=2,i=1'
• ΣkakK(xk,x1)'='1x4+0x0+0x4+0x0'='4,'sign(4)=1'
I1' I1' 1' …'
'
1' I1' I1' x2
'
'
K(u,v)'='(u!v)2' K# x1# x2# x3# x4# Converged!!!'
e.g.,'' x1' 4' 0' 4' 0' • y=Σk'ak'K(xk,x)'
K(x1,x2)'' '''''''='1×K(x1,x)+0×K(x2,x)+0×K(x3,x)+0×K(x4,x)'
''''='K([1,1],[I1,1])' x2' 0' 4' 0' 4'
'''''''='K(x1,x)'
''''='(1xI1+1x1)2' x3' 4' 0' 4' 0' '''''''='K([1,1],x)'''(because'x1=[1,1])'
''''''='0'
x4' 0' 4' 0' 4' '''''''='(x1+x2)2'''''''''(because''K(u,v)'='(u!v)2)'
'' '
Common'kernels'
• Polynomials'of'degree'exactly'd&
• Polynomials'of'degree'up'to'd&
• Gaussian'kernels'
• Sigmoid'
'
'
• And'many'others:'very'acXve'area'of'research!'
Overfipng?'
• Huge'feature'space'with'kernels,'what'about'
overfipng???'
– Oqen'robust'to'overfipng,'e.g.'if'you'don’t'make'
too'many'Perceptron'updates'
– SVMs'(which'we'will'see'next)'will'have'a'clearer'
story'for'avoiding'overfipng'
– But'everything'overfits'someXmes!!!'
• Can'control'by:'
– Choosing'a'be@er'Kernel'
– Varying'parameters'of'the'Kernel'(width'of'Gaussian,'etc.)'
Kernels'in'logisXc'regression'
1
P (Y = 0|X = x, w, w0 ) =
1 + exp(w0 + w · x)
• Define'weights'in'terms'of'data'points:'
X
w= ↵j (xj )
j
1
P (Y = 0|X = x, w, w0 ) = P
1 + exp(w0 + j ↵j (xj ) · (x))
1
= P
' 1 + exp(w0 + j ↵j K(xj , x))
• Derive'gradient'descent'rule'on'αj,w0'
• Similar'tricks'for'all'linear'models:'SVMs,'etc'
What'you'need'to'know'
• The'kernel'trick'
• Derive'polynomial'kernel'
• Common'kernels'
• Kernelized'perceptron'