[PDF][PDF] Critical points of an autoencoder can provably recover sparsely used overcomplete dictionaries
A Rangamani, A Mukherjee, A Arora… - arXiv preprint arXiv …, 2017 - researchgate.net
arXiv preprint arXiv:1708.03735, 2017•researchgate.net
Abstract In Dictionary Learning one is trying to recover incoherent matrices A∗∈ Rn× h
(typically overcomplete and whose columns are assumed to be normalized) and sparse
vectors x∗∈ Rh with a small support of size hp for some 0< p< 1 while being given access
to observations y∈ Rn where y= A∗ x∗. In this work we undertake a rigorous analysis of the
possibility that dictionary learning could be performed by gradient descent on Autoencoders,
which are Rn→ Rn neural network with a single ReLU activation layer of size h. Towards the …
(typically overcomplete and whose columns are assumed to be normalized) and sparse
vectors x∗∈ Rh with a small support of size hp for some 0< p< 1 while being given access
to observations y∈ Rn where y= A∗ x∗. In this work we undertake a rigorous analysis of the
possibility that dictionary learning could be performed by gradient descent on Autoencoders,
which are Rn→ Rn neural network with a single ReLU activation layer of size h. Towards the …
Abstract
In Dictionary Learning one is trying to recover incoherent matrices A∗∈ Rn× h (typically overcomplete and whose columns are assumed to be normalized) and sparse vectors x∗∈ Rh with a small support of size hp for some 0< p< 1 while being given access to observations y∈ Rn where y= A∗ x∗. In this work we undertake a rigorous analysis of the possibility that dictionary learning could be performed by gradient descent on Autoencoders, which are Rn→ Rn neural network with a single ReLU activation layer of size h.
Towards the above objective we propose a new autoencoder loss function which modifies the squared loss error term and also adds new regularization terms. We create a proxy for the expected gradient of this loss function which we motivate with high probability arguments, under natural distributional assumptions on the sparse code x∗. Under the same distributional assumptions on x∗, we show that, in the limit of large enough sparse code dimension, any zero point of our proxy for the expected gradient of the loss function within a certain radius of A∗ corresponds to dictionaries whose action on the sparse vectors is indistinguishable from that of A∗. We also report simulations on synthetic data in support of our theory.
researchgate.net
Showing the best result for this search. See all results