Authors:
Kuangdai Leng
1
;
Robert Atwood
2
;
Winfried Kockelmann
3
;
Deniza Chekrygina
1
and
Jeyan Thiyagalingam
1
Affiliations:
1
Scientific Computing Department, Science and Technology Facilities Council, Rutherford Appleton Laboratory, Didcot, U.K.
;
2
Diamond Light Source, Rutherford Appleton Laboratory, Didcot, U.K.
;
3
ISIS Neutron and Muon Source, Science and Technology Facilities Council, Rutherford Appleton Laboratory, Didcot, U.K.
Keyword(s):
Unsupervised Learning, Image and Video Segmentation, Representation Learning, Regional Adjacency Graph.
Abstract:
Fully unsupervised semantic segmentation of images has been a challenging problem in computer vision. Many deep learning models have been developed for this task, most of which using representation learning guided by certain unsupervised or self-supervised loss functions towards segmentation. In this paper, we conduct dense or pixel-level representation learning using a fully-convolutional autoencoder; the learned dense features are then reduced onto a sparse graph where segmentation is encouraged from three aspects: nor-malised cut, similarity and continuity. Our method is one- or few-shot, minimally requiring only one image (i.e., the target image). To mitigate overfitting caused by few-shot learning, we compute the reconstruction loss using augmented size-varying patches sampled from the image(s). We also propose a new adjacency-based loss function for continuity, which allows the number of superpixels to be arbitrarily large whereby the creation of the sparse graph can remain ful
ly unsupervised. We conduct quantitative and qualitative experiments using computer vision images and videos, which show that segmentation becomes more accurate and robust using our sparse loss functions and patch reconstruction. For comprehensive application, we use our method to analyse 3D images acquired from X-ray and neutron tomography. These experiments and applications show that our model trained with one or a few images can be highly robust for predicting many unseen images with similar semantic contents; therefore, our method can be useful for the segmentation of videos and 3D images of this kind with lightweight model training in 2D.
(More)