표본 복잡성

머신러닝 알고리즘의 샘플 복잡성은 목표 함수를 성공적으로 학습하기 위해 필요한 훈련 샘플의 수를 나타낸다.

더 정확히 말하면, 샘플 복잡성은 알고리즘에 공급해야 하는 훈련 샘플의 수입니다. 그래서 알고리즘에 의해 반환되는 함수는 임의로 가장 좋은 함수의 작은 오차 내에 있고, 확률도 임의로 1에 가깝다.

샘플 복잡성에는 두 가지 변형이 있다.

약한 변종은 특정 입력-출력 분배를 수정한다.
강한 변형은 모든 입력-출력 분포에서 최악의 경우 표본 복잡성을 취한다.

아래에서 논하는 No free school 정리는 일반적으로 강한 샘플 복잡성이 무한하다는 것을 증명한다. 즉, 한정된 수의 훈련 샘플을 사용하여 세계적으로 최적화된 목표 함수를 학습할 수 있는 알고리즘이 없다는 것이다.

단, 특정 등급의 대상함수(예: 선형함수만)에만 관심이 있다면 표본 복잡성은 유한하며, 대상함수의 등급에 따라 VC 차원에 선형적으로 의존한다.^[1]

정의

$X$ $X$ 은 $X$ (는) 입력 공간, Y $Y$ 은 $Y$ (는) 출력 공간이라고 부르는 공간이며, Z ${\displaystyle$ Z}은(는 $)$ 제품 $X\times Y$ $X\times Y$ {\displaystyle X $\time$ Y $}$ 을 $Z$ $나타내도록$ 하자. 예를 들어 $X\times Y$ 이진 분류 $X$ 에서 X ${\d$ 는 일반적으로 $X$ 유한하다 $.$ 모방 벡터 공간과 $Y$ $Y$ 은 $Y$ $\{-1,1\}$ 는) { $\{-1,1\}$ - $\{-1,1\}$ , $\{-1,1\}$ ${\displaystyle \{-1\}} 집합$ 이다 $\{-1,1\}$

Fix a hypothesis space ${\mathcal {H}}$ of functions $h\colon X\to Y$ . A learning algorithm over ${\mathcal {H}}$ is a computable map from $Z^{*}$ to ${\mathcal {H}}$ . In other words, it is an algorithm that는 $훈련$ 샘플의 유한 순서를 입력으로 삼고 X $X$ 에서 $X$ $Y$ ${\displaystyle$ Y $}$ 까지의 함수를 출력한다 $Y$ 대표적인 학습 알고리즘은 티코노프 정규화 없이 경험적 위험 최소화를 포함한다.

Fix a loss function ${\mathcal {L}}\colon Y\times Y\to \mathbb {R} _{\geq 0}$ , for example, the square loss ${\mathcal {L}}(y,y')=(y-y')^{2}$ , where $h(x)=y'$ . For a given distribution ${\dis$ $X\times Y$ $X\times Y$ $X\times Y$ ${\displaystyle X\time$ Y $}$ 의 $Playstyle \rho$ $X\times Y$ 가설(함수) h $h\in {\mathcal {H}}$ $h\in {\mathcal {H}}$ ${\$ h $\in {\mathcal {H}}$ 의 예상 위험은 다음과 $h\in {\mathcal {H}}$ 같다.

{\mathcal{E}(h):=\mathb {E} _{\rho }[{\mathcal {L}(h(x),y)]]=\int _{X\time Y}{\mathcal {L}(h(x),y)\,d\rho(x,y)

In our setting, we have $h={\mathcal {A}}(S_{n})$ , where ${\mathcal {A}}$ is a learning algorithm and $S_{n}=((x_{1},y_{1}),\ldots ,(x_{n},y_{n}))\sim \rho ^{n}$ is a sequence of vec $\rho$ $\rho$ 과(와) 독립적으로 그려진 토어 $\rho$ 최적의 위험을 정의하십시오.

{\mathcal{E}_{\mathcal{H}^{*}={\underset {h\in {\mathcal{H}}{\inf{\inf}{\mathcal {E}(h)}.

Set

h_{n}={\mathcal {A}}(S_{n})

, for each

n

. Note that

h_{n}

is a random variable and depends on the random variable

S_{n}

, which is drawn from the distribution

\rho ^{n}

. 만약 E({\displaystyle{{E\mathcal}}(h_{n})}probabilistically EH({\displaystyle{{E\mathcal}}_{{H\mathcal}}^{*}}에 전진 이 알고리즘{\displaystyle{{A\mathcal}}}일치해 다른 말로, 모든 ϵ,δ 을, 0{\displaystyle \epsilon ,\delta>0}, exis라고 불린다.한 ts 양의 정수

N

{\displaystyle N

즉,

n\geq N

n

n\geq N

n N

n\geq N

{\

displaystyle

n

\geq

N}에 대해

n\geq N

다음과 같은 값을 갖는다.

\Pr_{\rho ^{n}[{\mathcal{E}-{\mathcal{E}_{\mathcal{H}^{*}\geq \varepsilon ]<\delta .

The sample complexity of

{\mathcal {A}}

is then the minimum

N

for which this holds, as a function of

\rho ,\epsilon

, and

\delta

. We write the sample complexity as

N(\rho ,\epsilon ,\delta )

to emphasize that this value of

N

depends on

\rho ,\epsilon

, and

\delta

. If

{\mathcal {A}}

is not consistent, then we set

N(\rho ,\epsilon ,\delta )=\infty

. If there exists an algorithm for which

N(\rho ,\epsilon ,\delta )

(

N(\rho ,\epsilon ,\delta )

,

N(\rho ,\epsilon ,\delta )

,

N(\rho ,\epsilon ,\delta )

Δ ) {\

displaystyle N(\rho

,\

epsilon

,\delta

)}

은 유한하다

N(\rho ,\epsilon ,\delta )

. 그렇다면 가설 공간

{\mathcal {H}}

{\

은(으) 학습이 가능하다고

{\mathcal {H}}

한다.

In others words, the sample complexity $N(\rho ,\epsilon ,\delta )$ defines the rate of consistency of the algorithm: given a desired accuracy $\epsilon$ and confidence $\delta$ , one needs to sample ${\displaystyle N(\rho ,\epsilon ,\delta )$ 출력 함수의 위험이 최소 $1-\delta$ {\ $displaystyle \epsilon }$ 의 최적 범위 $\epsilon$ 내에 있음을 보장하기 위한 데이터 $N(\rho ,\epsilon ,\delta )$ 지점 ${\$ displaystyle \epsilon }. 최소 1 Δ ${\displaystyle 1-\³ }.$ ^[2]

In probably approximately correct (PAC) learning, one is concerned with whether the sample complexity is polynomial, that is, whether $N(\rho ,\epsilon ,\delta )$ is bounded by a polynomial in $1/\epsilon$ and $1/\delta$ . If ${\d$ $isplaystyle N(\rho ,\epsilon ,\delta )}$ 은 $N(\rho ,\epsilon ,\delta )$ 일부 학습 알고리즘의 다항식이며, 그러면 가설 공간 H ${\$ 은(는) PAC 학습이 가능하다고 ${\mathcal {H}}$ 한다. 이것은 배울 수 있는 것보다 더 강한 개념이라는 것에 주목하라.

제한되지 않은 가설 공간: 무한 표본 복잡성

견본 복잡성이 강한 의미에서 유한하도록 학습 알고리즘이 존재하는지, 즉 알고리즘이 특정 표적 오류로 입출력 공간에 대한 분포를 학습할 수 있도록 필요한 표본 수에 한계가 있는지를 물을 수 있다. 좀 더 형식적으로, ${\mathcal {A}}$ 알고리즘 ${\mathcal {A}}$ A {\ $displaystyle$ {\ $mathcal$ {A}이(가) 존재하는지 질문한다 ${\mathcal {A}}$ 즉, 모든 $\epsilon ,\delta >0$ > $\epsilon ,\delta >0$ {\ $displaystyle \epsilon$ ,\ $delta$ $>0$ 에 대해, $n\geq N$ n $n\geq N$ ${\displaystystyle N}$ 이 $N$ (가)이(가) 있는 것이다 $n\geq N$

\sup _{\rho }\왼쪽(\Pr_{\rho ^{n}-{\mathcal{E}_{\mathcal {H}^{*}\geq \varepsilon ]\delta ,

where

h_{n}={\mathcal {A}}(S_{n})

, with

S_{n}=((x_{1},y_{1}),\ldots ,(x_{n},y_{n}))\sim \rho ^{n}

as above. No Free School Organization은 가설

{\mathcal {H}}

H

{\

에 대한 제한 없이

{\mathcal {H}}

샘플 복잡성이 임의로 큰 "나쁜" 분포가 항상 존재한다고 말한다.^[1]

따라서, 수량의 수렴 속도에 대한 진술을 하기 위해서.

\sup _{\rho }\left(\Pr_{\rho ^{n}-{\mathcal {E}_{\mathcal {H}^{*}\geq \varepsilon ]\right)}

어느 쪽이든 해야 한다.

확률 분포 공간 $($ parametric approach){\ $displaystyle \rho$ $\rho$ 예를 들어 파라메트릭 접근법을 통해 또는
분포가 없는 접근에서와 같이 가설 ${\mathcal {H}}$ ${\$ 의 공간을 제한한다 ${\mathcal {H}}$

제한된 가설 공간: 유한 표본 복합성

후자의 접근방식은 공간 ${\mathcal {H}}$ ${\$ 의 복잡성을 제어하는 VC 차원 및 Rademacher 복잡성과 같은 개념으로 이어진다 ${\mathcal {H}}$ 보다 작은 가설 공간은 추론 과정에 더 많은 편향을 도입하며, ${\mathcal {E}}_{\mathcal {H}}^{*}$ 는 ${\mathcal {E}}_{\mathcal {H}}^{*}$ H ${\mathcal {E}}_{\mathcal {H}}^{*}$ ${\displaystystyle {E}_{\mathcal}{H}}}}}*}}}}}}}}*}}}*}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}$ 음 더 큰 공간에서 가능한 최선의 위험보다. 그러나 가설 공간의 복잡성을 제한함으로써 알고리즘이 보다 균일하게 일관된 함수를 생성할 수 있게 된다. 이러한 절충은 정규화 개념으로 이어진다.^[2]

가설공간 ${\mathcal {H}}$ ${\$ 에 대해 다음과 같은 세 개의 문장이 동등하다는 것은 VC 이론에서 나온 정리다 ${\mathcal {H}}$

${\mathcal {H}}$ ${\$ 은 ${\mathcal {H}}$ (는) PAC 학습이 가능하다.
${\mathcal {H}}$ ${\$ 의 VC 치수는 유한하다 ${\mathcal {H}}$ .
${\mathcal {H}}$ ${\$ 은(는) 균일한 글리벤코-칸텔리 클래스다 ${\mathcal {H}}$ .

이것은 특정 가설공간이 PAC 학습가능하고, 나아가 학습가능하다는 것을 증명할 수 있는 방법을 제공한다.

PAC 학습 가능 가설 공간의 예

$X=\mathbb {R} ^{d},Y=\{-1,1\}$ , and let ${\mathcal {H}}$ be the space of affine functions on $X$ , that is, functions of the form $x\mapsto \langle w,x\rangle +b$ for some ${\$ 디스플레이 $스타일 w\in \mathb$ { $R} ^{d},b\in$ { $R} }.$ 오프셋 학습 문제가 있는 선형 분류다. 이제, 사각형의 네 개의 동일 평면점은 어떤 아핀 기능으로도 산산이 부서질 수 없다는 점에 유의하십시오. 어떤 아핀 기능도 대각선으로 반대되는 두 개의 정점에서는 양이고 나머지 두 개의 정점에서는 음수일 수 없기 때문이다. 따라서 ${\mathcal {H}}$ ${\$ 의 VC 치수는 $d+1$ + 1 $d+1$ 이므로 ${\mathcal {H}}$ 유한하다 $d+1$ ${\mathcal {H}}$ {\ $displaystyle {\mathcal{H}}$ 이 ${\mathcal {H}}$ (가 $)$ PAC 학습 가능하고, 나아가 학습이 가능하다는 것은 위의 PAC 학습 클래스의 특성화에 따른 것이다.

표본 복합성 한계

${\mathcal {H}}$ ${\$ 이(가) 이진 함수의 클래스라고 가정하십시오 ${\mathcal {H}}$ $\{0,1\}$ , 1 $\{0,1\}$ $\{0,1\$ 에 대한 기능). $\{0,1\}$ 그러면 ${\mathcal {H}}$ ${\$ 이 ${\mathcal {H}}$ $(\epsilon ,\delta )$ ${\displaystyle$ -PAC-학습 가능(크기의 표본:

N=O{\bigg (}{\frac {VC({\mathcal{H}})+\ln {1\n \delta }}{\epsilon }}}}}}}}}}

where

VC({\mathcal {H}})

is the VC dimension of

{\mathcal {H}}

. Moreover, any

(\epsilon ,\delta )

-PAC-learning algorithm for

{\mathcal {H}}

must have sample-complexity:^[4]

N=\Oomega{\bigg (}{\frac {VC({\mathcal{H}}})++\ln {1\n \delta }}{\epsilon }}}}}}}}}

따라서 표본 복합성은 가설 공간의 VC 차원의 선형 함수다.

Suppose ${\mathcal {H}}$ is a class of real-valued functions with range in $[0,T]$ . Then, ${\mathcal {H}}$ is $(\epsilon ,\delta )$ -PAC-learnable with a sample of size: ^[5]^[6]

N=O{{\bigg (}T^{2}){\frac {PD({\mathcal{H}})\ln {T \over 엡실론 }+\ln {1\overdelta }}{\epsilon ^{2}}}{\big )}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}

여기서

PD({\mathcal {H}})

PD({\mathcal {H}})

(

PD({\mathcal {H}})

)

{\displaystyle PD({\mathcal{H}})

는

PD({\mathcal {H}})

폴라드의

{\mathcal {H}}

{\

의 의사-차원이다

{\mathcal {H}}

기타 설정

감독된 학습 환경 외에도, 샘플 복잡성은 알고리즘이 많은 라벨 획득 비용을 줄이기 위해 특별히 선택한 입력물에 라벨을 요청할 ^[7]수 있는 능동적 학습을 포함한 반 감독적 학습 문제와 관련이 있다. 샘플 복잡성의 개념은 사전 학습과 같은 강화 학습,^[8] 온라인 학습 및 감독되지 않은 알고리즘에서도 나타난다.^[9]

로봇공학에서의 효율성

높은 표본 복잡성은 몬테카를로 나무 검색을 실행하기 위해 많은 계산이 필요하다는 것을 의미한다.^[10] 그것은 주 공간에서 자유로운 무차별적인 힘 검색과 같다. 반면 ^[11]고효율 알고리즘은 표본 복잡도가 낮다. 표본 복잡성을 줄이는 가능한 기법은 미터법 학습과^[12] 모델 기반 강화 학습이다.^[13]

참조

^ ^a ^b Vapnik, Vladimir (1998), Statistical Learning Theory, New York: Wiley.
^ ^a ^b Rosasco, Lorenzo (2014), Consistency, Learnability, and Regularization, Lecture Notes for MIT Course 9.520.
^ Steve Hanneke (2016). "The optimal sample complexity of PAC learning". J. Mach. Learn. Res. 17 (1): 1319–1333.
^ Ehrenfeucht, Andrzej; Haussler, David; Kearns, Michael; Valiant, Leslie (1989). "A general lower bound on the number of examples needed for learning". Information and Computation. 82 (3): 247. doi:10.1016/0890-5401(89)90002-3.
^ Anthony, Martin; Bartlett, Peter L. (2009). Neural Network Learning: Theoretical Foundations. ISBN 9780521118620.
^ Morgenstern, Jamie; Roughgarden, Tim (2015). On the Pseudo-Dimension of Nearly Optimal Auctions. NIPS. Curran Associates. pp. 136–144. arXiv:1506.03684.
^ Balcan, Maria-Florina; Hanneke, Steve; Wortman Vaughan, Jennifer (2010). "The true sample complexity of active learning". Machine Learning. 80 (2–3): 111–139. doi:10.1007/s10994-010-5174-y.
^ Kakade, Sham (2003), On the Sample Complexity of Reinforcement Learning (PDF), PhD Thesis, University College London: Gatsby Computational Neuroscience Unit.
^ Vainsencher, Daniel; Mannor, Shie; Bruckstein, Alfred (2011). "The Sample Complexity of Dictionary Learning" (PDF). Journal of Machine Learning Research. 12: 3259–3281.
^ Kaufmann, Emilie and Koolen, Wouter M (2017). Monte-carlo tree search by best arm identification. Advances in Neural Information Processing Systems. pp. 4897–4906.{{cite conference}}: CS1 maint : 복수이름 : 작성자 목록(링크)
^ Fidelman, Peggy and Stone, Peter (2006). The chin pinch: A case study in skill learning on a legged robot. Robot Soccer World Cup. Springer. pp. 59–71.{{cite conference}}: CS1 maint : 복수이름 : 작성자 목록(링크)
^ Verma, Nakul and Branson, Kristin (2015). Sample complexity of learning mahalanobis distance metrics. Advances in neural information processing systems. pp. 2584–2592.{{cite conference}}: CS1 maint : 복수이름 : 작성자 목록(링크)
^ Kurutach, Thanard and Clavera, Ignasi and Duan, Yan and Tamar, Aviv and Abbeel, Pieter (2018). "Model-ensemble trust-region policy optimization". arXiv:1802.10592 [cs.LG].{{cite arxiv}}: CS1 maint : 복수이름 : 작성자 목록(링크)

[:0-1] Vapnik, Vladimir (1998), Statistical Learning Theory, New York: Wiley.

[Rosasco-2] Rosasco, Lorenzo (2014), Consistency, Learnability, and Regularization, Lecture Notes for MIT Course 9.520.

[3] Steve Hanneke (2016). "The optimal sample complexity of PAC learning". J. Mach. Learn. Res. 17 (1): 1319–1333.

[4] Ehrenfeucht, Andrzej; Haussler, David; Kearns, Michael; Valiant, Leslie (1989). "A general lower bound on the number of examples needed for learning". Information and Computation. 82 (3): 247. doi:10.1016/0890-5401(89)90002-3.

[mr15-5] Anthony, Martin; Bartlett, Peter L. (2009). Neural Network Learning: Theoretical Foundations. ISBN 9780521118620.

[6] Morgenstern, Jamie; Roughgarden, Tim (2015). On the Pseudo-Dimension of Nearly Optimal Auctions. NIPS. Curran Associates. pp. 136–144. arXiv:1506.03684.

[Balcan-7] Balcan, Maria-Florina; Hanneke, Steve; Wortman Vaughan, Jennifer (2010). "The true sample complexity of active learning". Machine Learning. 80 (2–3): 111–139. doi:10.1007/s10994-010-5174-y.

[8] Kakade, Sham (2003), On the Sample Complexity of Reinforcement Learning (PDF), PhD Thesis, University College London: Gatsby Computational Neuroscience Unit.

[9] Vainsencher, Daniel; Mannor, Shie; Bruckstein, Alfred (2011). "The Sample Complexity of Dictionary Learning" (PDF). Journal of Machine Learning Research. 12: 3259–3281.

[10] Kaufmann, Emilie and Koolen, Wouter M (2017). Monte-carlo tree search by best arm identification. Advances in Neural Information Processing Systems. pp. 4897–4906.{{cite conference}}: CS1 maint : 복수이름 : 작성자 목록(링크)

[11] Fidelman, Peggy and Stone, Peter (2006). The chin pinch: A case study in skill learning on a legged robot. Robot Soccer World Cup. Springer. pp. 59–71.{{cite conference}}: CS1 maint : 복수이름 : 작성자 목록(링크)

[12] Verma, Nakul and Branson, Kristin (2015). Sample complexity of learning mahalanobis distance metrics. Advances in neural information processing systems. pp. 2584–2592.{{cite conference}}: CS1 maint : 복수이름 : 작성자 목록(링크)

[13] Kurutach, Thanard and Clavera, Ignasi and Duan, Yan and Tamar, Aviv and Abbeel, Pieter (2018). "Model-ensemble trust-region policy optimization". arXiv:1802.10592 [cs.LG].{{cite arxiv}}: CS1 maint : 복수이름 : 작성자 목록(링크)

[1]

[2]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

Search