-
Notifications
You must be signed in to change notification settings - Fork 4
/
thesis.tex
402 lines (328 loc) · 16.5 KB
/
thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
% Use the following line only if you're still using LaTeX 2.09.
%\documentstyle[icml2014,epsf,natbib]{article}
% If you rely on Latex2e packages, like most moden people use this:
\documentclass{article}
% use Times
\usepackage{times}
% For figures
\usepackage{graphicx} % more modern
%\usepackage{epsfig} % less modern
\usepackage{subfigure}
\usepackage{footnote}
\usepackage{amsfonts}
\usepackage{mdframed}
% For citations
\usepackage{natbib}
\usepackage{comment}
\usepackage{multirow}
\usepackage{float}
\floatstyle{plain} % optionally change the style of the new float
\newfloat{code}{H}{myc}
% For algorithms
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{amsmath}
\usepackage{listings}
\lstset{language=Python,
basicstyle=\ttfamily\footnotesize,
keywordstyle=\color{blue}\ttfamily,
stringstyle=\color{red}\ttfamily,
commentstyle=\color{green}\ttfamily,
aboveskip=0pt,
belowskip=0pt,
breaklines=true
}
\setlength{\abovedisplayskip}{0cm}
\setlength{\belowdisplayskip}{0cm}
\usepackage{mathtools}
\newcommand\myeq{\stackrel{\mathclap{\normalfont\mbox{{\tiny def}}}}{=}}
\usepackage[compact]{titlesec}
\titlespacing{\section}{0pt}{0.5ex}{0.3ex}
\titlespacing{\subsection}{0pt}{0.2ex}{0ex}
\titlespacing{\subsubsection}{0pt}{0.1ex}{0ex}
\newcommand{\startcompact}[1]{\par\vspace{-0.75em}\begin{#1}%
\allowdisplaybreaks\ignorespaces}
\newcommand{\stopcompact}[1]{\end{#1}\ignorespaces}
\usepackage{paralist}
\makeatletter
\ifcase \@ptsize \relax% 10pt
\newcommand{\miniscule}{\@setfontsize\miniscule{4}{5}}% \tiny: 5/6
\or% 11pt
\newcommand{\miniscule}{\@setfontsize\miniscule{5}{6}}% \tiny: 6/7
\or% 12pt
\newcommand{\miniscule}{\@setfontsize\miniscule{5}{6}}% \tiny: 6/7
\fi
\makeatother
\newcommand {\aplt} {\ {\raise-.5ex\hbox{$\buildrel<\over\sim$}}\ }
\newcommand{\eqn}[1]{Eqn.~\ref{eqn:#1}}
\newcommand{\fig}[1]{Fig.~\ref{fig:#1}}
\newcommand{\tab}[1]{Table~\ref{tab:#1}}
\newcommand{\secc}[1]{Section~\ref{sec:#1}}
\def\etal{{\textit{et~al.~}}}
\newcommand{\BigO}[1]{\ensuremath{\operatorname{O}\left(#1\right)}}
\usepackage[symbol*]{footmisc}
\DefineFNsymbolsTM{myfnsymbols}{% def. from footmisc.sty "bringhurst" symbols
\textasteriskcentered *
\textdagger \dagger
\textdaggerdbl \ddagger
\textsection \mathsection
\textbardbl \|%
\textparagraph \mathparagraph
}%
% As of 2011, we use the hyperref package to produce hyperlinks in the
% resulting PDF. If this breaks your system, please commend out the
% following usepackage line and replace \usepackage{icml2014} with
% \usepackage[nohyperref]{icml2014} above.
\usepackage{hyperref}
% Packages hyperref and algorithmic misbehave sometimes. We can fix
% this with the following command.
\newcommand{\theHalgorithm}{\arabic{algorithm}}
% Employ the following version of the ``usepackage'' statement for
% submitting the draft version of the paper for review. This will set
% the note in the first column to ``Under review. Do not distribute.''
%\usepackage{icml2014}
% Employ this version of the ``usepackage'' statement after the paper has
% been accepted, when creating the final version. This will set the
% note in the first column to ``Proceedings of the...''
\usepackage[accepted]{icml2014}
\begin{document}
\twocolumn[
\icmltitle{Learning to Reason\\{\small PhD Thesis Proposal}}
\icmlauthor{Wojciech Zaremba}{[email protected]}
\vskip +0.1in
\icmlauthor{Advisors:}{}
\vskip +0.03in
\icmlauthor{Rob Fergus}{[email protected]}
\vskip +0.03in
\icmlauthor{Yann LeCun}{[email protected]}
\icmladdress{New York University}
\icmlkeywords{computer vision, convolutional neural networks, recursive neural networks, natural language processing, recurrent neural networks, language model, LSTMs, program understanding, artificial intelligence}
\vskip 0.3in
]
\begin{abstract}
Neural networks have proven to be very powerful models for object
recognition \cite{krizhevsky2012imagenet}, natural language
processing \cite{mikolov2012statistical}, speech recognition
\cite{graves2013speech}, and other tasks
\cite{sutskever2014sequence}. However, there is still a huge gap
between them and a truly intelligent system. I identify several
qualities that an intelligent systems should possess, namely: (i)
reasoning ability, (ii) the capability to integrate with external
interfaces and (iii) small sample complexity. My thesis will focus on
tackling these problems.
\end{abstract}
\section{Introduction}
It is clear that improving the performance of current statistical learning systems,
will not make them intelligent, even if they achieve
$0\%$ test error. This proposal explores what skills are necessary for statistical
learning system to become ``intelligent'' and outlines attempts to
provide these skills.
While many qualities make up true intelligence, we focus on three
skills that are notably lacking from current approaches: (i) the
ability to reason, (ii) the capability to integrate with external
interfaces and (iii) small sample complexity. The goal is to provide
all of these skills within a single unified framework. This will be
approached via a wide range of tasks in different domains, to empirically
verify its robustness.
For some of these tasks, exploratory work has already been performed
giving good intuitions for the proposed work.
\section{Reasoning abilities}
Reasoning is the process of forming conclusions, judgements, or
inferences from facts or premises. An artificial system that can
reason should be able to understand high level concepts like:
\begin{itemize}
\item Scope
\item Conditioning
\item Nesting
\item Repetitions
\item Memorization
\end{itemize}
While each of this skills could be solved in a particular domain; the
resulting system would be of little use and would require
strong assumptions about possible scenarios. Moreover, an intelligent
reasoning system cannot be based only on predefined rules. Instead, it
should be based on pattern matching, and the application of learned heuristic algorithms.
There are many domains where we can test the reasoning ability of our
system to see if it is able to learn postulated concepts. These
include (a) computer programs and (b) the derivation of mathematical
theorems. Both of these are rich in scoping, branching, nesting,
pattern repetition, and so on. Drawing high-level conclusions in such
domains requires sophisticated reasoning. Eventually, the framework
should be able to handle many different domains simultaneously.
\subsection{Reasoning in computer programs}
The first reasoning task that I intend to tackle is to train
statistical models the meaning of computer programs.
Consider one possible instantiation of this, where the task is to take
a program as an input (e.g.~character-by-character), and predict the
program output. This requires the statistical model to understand
every single operand of a program. For instance, in the case of
addition it involves bit shifts, and memorization of operations on
digits. Moreover, programs can contain variable assignment, if-statements, and so on.
We were able partially to address this problem \cite{zaremba2014execute}.
Figure \ref{fig:prog} shows an exemplary program and its output, along
with a prediction of the output obtained with a recurrent neural network.
\begin{figure}
\begin{code}
\begin{mdframed}
{\bf Input:}
\begin{lstlisting}
f=(8794 if 8887<9713 else (3*8334))
print((f+574))
\end{lstlisting}
{\bf Target:} 9368. \\
{\bf Model prediction:} 9368.
\end{mdframed}
\end{code}
\caption{An example program that we take as an input. The task is
to predict the evaluation of the program. We have
achieved a high level of performance on this task \cite{zaremba2014execute}.}
\label{fig:prog}
\end{figure}
Although current results are promising, they are very limited. We are able to deal with programs
that can be evaluated by reading them once from left-to-right, but
generic programs are far more complex. Our reasoning system has to be
able to evaluate for an arbitrarily long time, if the
task requires it. Moreover, it shouldn't be limited by a fixed memory
size. Instead, memory should be available
as an interface (see Section \ref{sec:interface}).
\subsection{Reasoning in mathematics}
It is known that theorem proving is an intractable task in a computational sense.
However, humans are able to prove theorems. They do this by employing prior
knowledge, acquired from experience of solving other, related
problems, and fitting known mathematical ``tricks'' to the new problem.
More formally, we can think about proofs as an application with
multiple axioms that starts with a
hypothesis that we want to prove. Thus mathematical skills can be perceived as
a learning prior over trees of axioms. This prior ``suggests'' which sequence of axioms is more
likely to lead to a proof of the theorem.
My interest lies in training statistical machine learning systems to learn such priors.
We have started with the very constrained mathematical domain of
polynomials and demonstrate that here we can learn a prior over mathematical identities.
Our recent work \cite{zaremba2014learning} concerns identities
over polynomials over matrices. Figure \ref{fig:ident} shows example of a simple identity
over matrix polynomials. It states that, $\sum_{i,j} \sum_k A_{i, k}B_{k, j} =
\sum_{j} \sum_k (\sum_i A_{i, k})B_{k, j})$. This is an example of a toy identity. Our system is able
to derive much more sophisticated identities.
\begin{figure}
\centering
\includegraphics[width=0.48\textwidth]{imgs/example1_brute.png}\\
{\bf Is equivalent to:}\\
\includegraphics[width=0.48\textwidth]{imgs/example1_opt.png}\\
\caption{An example of two equivalent mathematical expressions. It states that
$\sum_{i,j} \sum_k A_{i, k}B_{k, j} = \sum_{j} \sum_k (\sum_i A_{i,
k})B_{k, j})$. \cite{zaremba2014learning} contains much more sophisticated examples.}
\label{fig:ident}
\end{figure}
\section{Interface learning}
\label{sec:interface}
Neural networks are very powerful statistical models, however some
concepts appears to be very difficult for them to express. For instance,
nesting seems to be a very common operation, that allows one to deal
with (1) object hierarchies, (2) natural language parsing, and (3)
compositionality. I think that, the ability to nest concepts could be provided by
adding a stack as an external interface to the neural network, with
the neural network learning how to use such an interface.
More broadly, external interfaces could enhance neural networks capabilities, and also provide
access to external resources and datasets. Potentially useful
interfaces include:
\begin{itemize}
\item stack
\item FIFO
\item external memory \cite{weston2014memory, graves2014neural}
\item search engine
\item database
\item execution environment such as a python interpreter
\end{itemize}
The state space of many interfaces is massive (e.g.\ all possible search queries).
Moreover, external interfaces are non-differentiable.
These obstacles could potentially be addressed by learning a differentiable
model that describes them (e.g.~a neural network). This network would simulate the
external interface for the purpose of being differentiable.
\section{Small sample complexity}
Current deep learning systems suffer from large sample
complexity. This hinders the potential use of these systems for online
learning (e.g.~robots). It is expected that during the initial phase of
learning any system without prior knowledge would need to consume a large number of samples.
However, we hope that as training progresses, sample complexity would drop. But this is
not observed in current systems. I propose several approaches that would
allow sample complexity to be decreased.
The meta-optimizer (see Subsection \ref{subsec:meta}) is the desired
solution, but I also propose some intermediate solutions that can also
decrease sample complexity.
\subsection{Automatic derivation of better models}
Neural networks are designed in an ad-hoc fashion, using a researcher's
intuition about what architectures will result in a high-prediction
performance. One way of addressing this issue would be to
fully understand why, and how the models work. However, mathematical
analyses give little insight and it is unclear if it is even
possible. I would like to automatically
explore a large space of models, and evaluate them on various tasks.
This involves trying different connectivity patterns, activation
functions, and hyper-parameter settings. To search over this discrete
space, I intend to use genetic programming.
\subsection{Augmentation marginalization}
For simplicity, we focus on computer vision tasks.
We often train neural networks for multiple epochs, while the same
samples are presented multiple times. The best performance is
achieved when samples are slightly modified over the course of training.
In case of computer vision, it is done by jittering images, rotating
them, and so on. This is known as data-augmentation.
Theoretically, a full image contains information about all updates obtained
with all possible augmented samples. More formally, let $x$ be an image, $y$ its label, and
$x_1, x_2 \dots x_k$ are its augmented copies. $L$ is a loss function, and $\theta_i$ are parameters at step $i$. So $\theta_1$ is a initial set of parameters, and $\theta_{k+1}$ is a set of parameters after $k$ updates.
I would like to
predict for a given $x$, the sum of the derivatives for all data augmentations:
\begin{align}
f(x, \theta_1) = \sum_{i=1}^k \partial_\theta L(x_k, y, \theta_i) = \theta_{k+1} - \theta_1
\end{align}
Instead of following gradient of $\partial_\theta L(x, y, \theta_1)$, we could
follow $f(x, \theta_1)$.
This could improve learning speed, as a single sample would provide
information about all its augmented copies.
The function $f$ might have its own parameters, and I would like to model it
as a neural network which would be trained in a supervised fashion.
\subsection{One-shot learning objective}
Human are able to recognize new objects after seeing them just once.
Contemporary deep learning systems are far from being able to do this,
requiring hundreds or even thousands of examples. We propose a new
formulation of neural network training that focuses on training for the
one-shot learning objective.
The main idea is to train one neural network to predict parameters of another neural network.
More formally, let $\phi(x, W_{in}, \theta) = (y, W_{y})$ be a neural network parametrized
by $\theta$. Weights $W_{in} \in \mathbb{R}^{f \times l}$ are top layer weights, where
$f$ is the top layer feature size, and $l$ is the number of labels. $W_{y} \in \mathbb{R}^f$ is a
single weight vector used to predict class $y$.
For a given new class $x', y'$, we compute $ W_{y'} = \phi(x', None, \theta)$ to produce weights $W_{y'}$.
We use these weights to classify
other objects belonging to the class $y'$.
This idea is currently being explored with my collaborators Elias Bingham and Brenden Lake.
\subsection{Meta-optimizer}
\label{subsec:meta}
I would like to build a meta-optimizer that would be used to train a
neural network. Such an optimizer would take as input the gradients
of a neural network, and would decide on the next update step. The
optimizer itself could be parameterized with another neural network,
which would allow it to simulate any first order, gradient-based,
learning algorithm like SGD, momentum, LBFGs etc. This implies that
such a meta-optimizer should subsume all first order, gradient-based
optimization techniques. This work is under progress,
supervised by my collaborator Rahul Krishnan and Anna Choromanska.
\section{Discussion}
Tackling the aforementioned problems would take us much closer to
real intelligent systems, and defines three core pillars
of Artificial Intelligence. However, there are many other problems which
need to be solved and integrated to achieve a fully
intelligent system, e.g. navigation, learning by imitation, cooperation, and many others.
%However, I think that all that skills can be integrated by means of external interface, and
%don't have to be modeled in any special way. For instance, navigation skill could emerge
%as an use two interfaces (1) GPS location interface, and (2) an external memory.
%\section{Disclaimer}
%This is my personal opinion, and it shouldn't be judged in a scientific way.
%I strongly believe that creation of artificial intelligence is potentially
%dangerous. However, I think, that more dangerous is avoiding to create it.
%We exhaust resources of our planet in rapid fashion, and lack of resources
%leads to wars. The only way to prevent it is to have abundance of resources.
%Artificial intelligence could provide abundance of all resources.
\bibliography{bibliography}
\bibliographystyle{icml2014}
\end{document}