Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 291 publications
High-Precision Automated Reconstruction of Neurons with Flood-Filling Networks
Jörgen Kornfeld
Larry Lindsey
Winfried Denk
Nature Methods (2018)
Preview abstract
Reconstruction of neural circuits from volume electron microscopy data requires the tracing of cells in their entirety, including all their neurites. Automated approaches have been developed for tracing, but their error rates are too high to generate reliable circuit diagrams without extensive human proofreading. We present flood-filling networks, a method for automated segmentation that, similar to most previous efforts, uses convolutional neural networks, but contains in addition a recurrent pathway that allows the iterative optimization and extension of individual neuronal processes. We used flood-filling networks to trace neurons in a dataset obtained by serial block-face electron microscopy of a zebra finch brain. Using our method, we achieved a mean error-free neurite path length of 1.1 mm, and we observed only four mergers in a test set with a path length of 97 mm. The performance of flood-filling networks was an order of magnitude better than that of previous approaches applied to this dataset, although with substantially increased computational costs.
View details
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Ying Xiao
Yuxuan Wang
Joel Shor
International Conference on Machine Learning (2018)
Preview abstract
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the reference signal’s prosody with fine time detail. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results and audio samples from a single-speaker and 44-speaker Tacotron model on a prosody transfer task.
View details
Burst Denoising with Kernel Prediction Networks
Ben Mildenhall
Jiawen Chen
Dillon Sharlet
Ren Ng
Rob Carroll
CVPR (2018) (to appear)
Preview abstract
We present a technique for jointly denoising bursts of images taken from a handheld camera. In particular, we propose a convolutional neural network architecture for predicting spatially varying kernels that can both align and denoise frames, a synthetic data generation approach based on a realistic noise formation model, and an optimization guided by an annealed loss function to avoid undesirable local minima. Our model matches or outperforms the state-of-the-art across a wide range of noise levels on both real and synthetic data.
View details
Preview abstract
Semantic classes can be either things (objects with a
well-defined shape, e.g. car, person) or stuff (amorphous
background regions, e.g. grass, sky). While lots of classifi-
cation and detection works focus on thing classes, less at-
tention has been given to stuff classes. Nonetheless, stuff
classes are important as they allow to explain important
aspects of an image, including (1) scene type; (2) which
thing classes are likely to be present and their location
(through contextual reasoning); (3) physical attributes, ma-
terial types and geometric properties of the scene. To un-
derstand stuff and things in context we introduce COCO-
Stuff, which augments 120,000 images of the COCO dataset
with pixel-wise annotations for 91 stuff classes. We introduce an efficient stuff
annotation protocol based on superpixels which leverages
the original thing annotations. We quantify the speed versus
quality trade-off of our protocol and explore the relation be-
tween annotation time and boundary complexity. Further-
more, we use COCO-Stuff to analyze: (a) the importance of
stuff and thing classes in terms of their surface cover and
how frequently they are mentioned in image captions; (b)
the spatial relations between stuff and things, highlighting
the rich contextual relations that make our dataset unique;
(c) the performance of a modern semantic segmentation
method on stuff and thing classes, and whether stuff is
easier to segment than things.
View details
Preview abstract
Transferring the knowledge learned from large scale
datasets (e.g., ImageNet) via fine-tuning offers an effective
solution for domain-specific fine-grained visual categorization
(FGVC) tasks (e.g., recognizing bird species or car
make & model). In such scenarios, data annotation often
calls for specialized domain knowledge and thus difficult to
scale. In this work, we first tackle a problem in large scale
FGVC. Our method won first place in iNaturalist 2017 large
scale species classification challenge. Central to the success
of our approach is a training scheme that uses higher
image resolution and deals with the long-tailed distribution
of training data. Next, we study transfer learning via
fine-tuning from large scale datasets to small scale, domainspecific
FGVC datasets. We propose a measure to estimate
domain similarity via Earth Mover’s Distance and demonstrate
that transfer learning benefits from pre-training on a
source domain that is similar to the target domain by this
measure. Our proposed transfer learning outperforms ImageNet
pre-training and obtains state-of-the-art results on
multiple commonly used FGVC datasets.
View details
Stereo Magnification: Learning view synthesis using multiplane images
Tinghui Zhou
John Flynn
Graham Fyffe
ACM Trans. Graph. (Proc. SIGGRAPH), 37 (2018)
Preview abstract
The view synthesis problem—generating novel views of a scene from known imagery—has garnered recent attention due in part to compelling applications in virtual and augmented reality. In this paper, we explore an intriguing scenario for view synthesis: extrapolating views from imagery captured by narrow-baseline stereo cameras, including dual-lens camera phones and VR cameras. We call this problem stereo magnification, and propose a new learning framework that leverages a new layered representation that we call multiplane images (MPIs), as well as a massive new data source for learning view extrapolation: online videos on YouTube. Using data mined from such videos, we train a deep network that predicts an MPI from an input stereo image pair. This inferred MPI can then be used to synthesize a range of novel views of the scene, including views that extrapolate significantly beyond the input baseline. We show that our method compares favorably with several recent view synthesis methods, and demonstrate applications in magnifying narrow-baseline stereo images.
View details
Preview abstract
Accurate identification and localization of abnormalities from radiology images play an integral part in clinical diagnosis and treatment planning. Building a highly accurate prediction model for these tasks usually requires a large number of images manually annotated with labels and finding sites of abnormalities. In reality, however, such annotated data are expensive to acquire, especially the ones with location annotations. We need methods that can work well with only a small amount of location annotations. To address this challenge, we present a unified approach that simultaneously performs disease identification and localization through the same underlying model for all images. We demonstrate that our approach can effectively leverage both class information as well as limited location annotation, and significantly outperforms the comparative reference baseline in both classification and localization tasks.
View details
BLADE: Filter Learning for General Purpose Image Processing
John Isidoro
Frank Ong
International Conference on Computational Photography (2018)
Preview abstract
The Rapid and Accurate Image Super Resolution (RAISR)
method of Romano, Isidoro, and Milanfar is a computationally efficient image
upscaling method using a trained set of filters. We describe a generalization of
RAISR, which we name Best Linear Adaptive Enhancement (BLADE). This
approach is a trainable edge-adaptive filtering framework that is general, simple,
computationally efficient, and useful for a wide range of image processing
problems. We show applications to denoising, compression artifact removal,
demosaicing, and approximation of anisotropic diffusion equations.
View details
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Carl Martin Vondrick
Jitendra Malik
CVPR (2018)
Preview abstract
This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly.
AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.
View details
Preview abstract
Human vision is able to immediately recognize novel visual categories after seeing just one or a few training examples. We describe how to add a similar capability to ConvNet classifiers by directly setting the final layer weights from novel training examples during low-shot learning. We call this process \emph{weight imprinting} as it directly sets penultimate layer weights based on an appropriately scaled copy of their activations for that training example. The imprinting process provides a valuable complement to training with stochastic gradient descent, as it provides immediate good classification performance and an initialization for any further fine tuning in the future. We show how this imprinting process is related to proxy-based embeddings. However, it differs in that only a single imprinted weight vector is learned for each novel category, rather than relying on a nearest-neighbor distance to training instances as typically used with embedding methods. Our experiments show that using averaging of imprinted weights provides better generalization than using nearest-neighbor instance embeddings. A key change to traditional ConvNet classifiers is the introduction of a scaled normalization layer that allows activations to be directly imprinted as weights.
View details
In Silico Labeling: Predicting Fluorescent Labels in Unlabeled Images
Eric Christiansen
Mike Ando
Ashkan Javaherian
Gaia Skibinski
Scott Lipnick
Elliot Mount
Alison O'Neil
Kevan Shah
Alicia K. Lee
Piyush Goyal
Liam Fedus
Andre Esteva
Lee Rubin
Steven Finkbeiner
Cell (2018)
Preview abstract
Imaging is a central method in life sciences, and the drive to extract information from microscopy approaches has led to methods to fluorescently label specific cellular constituents. However, the specificity of fluorescent labels varies, labeling can confound biological measurements, and spectral overlap limits the number of labels to a few that can be resolved simultaneously. Here, we developed a deep learning computational approach called “in silico labeling (ISL)” that reliably infers information from unlabeled biological samples that would normally require invasive labeling. ISL predicts different labels in multiple cell types from independent laboratories. It makes cell type predictions by integrating in silico labels, and is not limited by spectral overlap. The network learned generalized features, enabling it to solve new problems with small training datasets. Thus, ISL provides biological insights from images of unlabeled samples for negligible additional cost that would be undesirable or impossible to measure directly.
View details
Preview abstract
We introduce Intelligent Annotation Dialogs for bound-
ing box annotation. We train an agent to automatically
choose a sequence of actions for a human annotator to pro-
duce a bounding box in a minimal amount of time. Specifi-
cally, we consider two actions: box verification [34], where
the annotator verifies a box generated by an object detector,
and manual box drawing. We explore two kinds of agents,
one based on predicting the probability that a box will be
positively verified, and the other based on reinforcement
learning. We demonstrate that (1) our agents are able to
learn efficient annotation strategies in several scenarios,
automatically adapting to the difficulty of an input image,
the desired quality of the boxes, the strenght of the detector,
and other factors; (2) in all scenarios the resulting annota-
tion dialogs speed up annotation compated to manual box
drawing alone and box verification alone, while also out-
performing any fixed combination of verification and draw-
ing in most scenarios; (3) in a realistic scenario where the
detector is iteratively re-trained, our agents evolve a series
of strategies that reflect the shifting trade-off between veri-
fication and drawing as the detector grows stronger.
View details
Unsupervised Learning of Semantic Audio Representations
Ratheet Pandya
Jiayang Liu
Proceedings of ICASSP 2018 (to appear)
Preview abstract
Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance.
View details
Fast, Trainable, Multiscale Denoising
John Isidoro
IEEE International Conference on Image Processing (ICIP) (2018) (to appear)
Preview abstract
Denoising is a fundamental imaging application. Versatile but fast filtering has been demanded for mobile camera systems. We present an approach to multiscale filtering which allows real-time applications on low-powered devices. The key idea is to learn a set of kernels that upscales, filters, and blends patches of different scales guided by local structure analysis. This approach is trainable so that learned filters are capable of treating diverse noise patterns and artifacts. Experimental results show that the presented approach produces comparable results to state-of-the-art algorithms while processing time is orders of magnitude faster.
View details
Preview abstract
We present a novel method to train machine learning algorithms to estimate scene depths from a single image, by using the information provided by a camera's aperture as supervision. Prior works use a depth sensor's outputs or images of the same scene from alternate viewpoints as supervision, while our method instead uses images from the same viewpoint taken with a varying camera aperture. To enable learning algorithms to use aperture effects as supervision, we introduce two differentiable aperture rendering functions that use the input image and predicted depths to simulate the depth-of-field effects caused by real camera apertures. We train a monocular depth estimation network end-to-end to predict the scene depths that best explain these finite aperture images as defocus-blurred renderings of the input all-in-focus image.
View details