1. Introduction
Deep learning, which has revolutionized over the past decade, has recently faced data-hungry problem due to the rapid growth of hardware and resources [
1,
2,
3]. Self-supervised learning, which learns meaningful data representations from unlabeled data [
4], has emerged as an alternative to supervised learning resulting from the inefficiency of labeling in terms of time and labor [
5,
6,
7].
Masked autoencoding [
8] is a method that learns representations by removing part of the input and predicting the masked part. Autoencoder [
9,
10] architecture is used for masked autoencoding, compressing high-dimensional data into latent representation with encoder and reconstruct the original data with decoder, as shown in
Figure 1. It has been successful in NLP as a method of self-supervised pre-training. The approach of learning representation by reconstructing images from corrupted images is not new; the idea was already proposed before 2017 [
11,
12]. The idea was buried after the emergence of contrastive learning since it has shown promising results on downstream tasks [
13,
14,
15]. Witnessing success of masked autoencoding in NLP fields [
16,
17,
18], many works tried to apply masked autoencoding to vision, but lag behind due to the following reasons: 1) in vision, convolutional network architecture was dominant [
19], where indicators like mask token [
17] or positional embedding [
20] are inapplicable. 2) With only a few neighboring pixels, missing parts of an image can be successfully predicted without deep understanding of an image [
21]. However, when predicting a missing part/token, complex language understanding should be investigated. In other words, the masked autoencoding in vision field might not demand fully understanding of image which results in capturing less useful features. Due to these differences between the two modalities, masked autoencoding was limitedly applied in the vision field until Vision Transformer (ViT) [
22].
Motivated by the success of masked language modeling (MLM) in language understanding, masked image modeling (MIM), following the idea of MLM, learns rich and holistic representation by reconstructing masked original information (e.g., pixel, representation) from unmasked ones. MIM has gained much importance recently showing state-of-the-art performance [
2,
23,
24] not only in ImageNet classification but alss in other downstream tasks like object detection and semantic segmentation.
Before MIM, contrastive learning (CL), which learns meaningful representation by using similarities and differences between image representations, was a dominant method in self-supervised learning [
4]. By learning embedding space, in a way that contrasts each other so that positive samples are located close and negative samples to be far away, CL learns to discriminate instances using features of the entire image [
25]. Contrary to CL, MIM does not learn instance discriminativeness since it only considers relationships between patches or pixels through the image reconstruction task [
26]. Therefore, although MIM methods exceed the performance of CL methods in fine-tuning, they are shown to be less effective in linear separability.
In this work, we propose a simple yet effective framework, adopting multi-view autoencoder architecture, utilizing contrastive learning to MIM to overcome the gap between CL and MIM. CL performs better on linear probing, while MIM shows better performance in fine-tuning setting. We note that contrastive learning-based MIM method can learn common information from two different augmented views, away from the existing pixel-level approaches that learn local representation of images. We demonstrate that local representation considering instance discriminative information can be learnt by doing so.
In more detail, we adopt asymmetric encoder-decoder architecture using ViT [
22] blocks. ViT makes the model to focus on important feature of an instance. We visualized maps of the attention of our pre-trained ViT encoder as shown in
Figure 2, taking the average of ViT heads following [
27]. CL is used to capture global information and learn discriminative representation by contrasting negative samples while pulling positive samples. By generating two augmented views by masking, with encoder, we compress them into latent representations, which are used for contrastive loss. While learning holistic information from contrastive loss, reconstruction loss helps the decoder to learn local representation by predicting patches from the masked image.
We conducted experiments to prove our work to be effective. Our work enables ViT encoder to exceed previous work, showing 84.3% ImageNet-1K classification top-1 accuracy. Though showing lower performance, but comparable, compared to other CL-based methods in linear probing, our work shows impressive performance gain compared to MIM methods achieving 76.7% accuracy. We also evaluate on transfer learning on object detection and segmentation. We record 51.3% AP and 45.6% AP on COCO and 50.2% mIOU on ADE20K showing the best or second best performance compared to previous studies. Through ablation studies, we demonstrate that utilizing CL to MIM helps the model learn better representation.
Our contributions are summarized as follows:
We propose a simple framework exploiting contrastive learning to MIM to learn rich and holistic representation. The model learns discriminative representation by contrasting two augmented views, while reconstructing original signals from the corrupted ones.
A high masking ratio works as strong augmentation. Without additional augmentation like color distortion, blur, etc., our model shows better performance than previous CL-based methods by only using masking and random crop.
Experimental results prove that our work is effective, outperforming previous MIM methods in ImageNet-1K classification, linear probing and other downstream tasks like object detection and instance segmentation.
The rest of this paper is structured as follows.
Section 2 introduces related works. In
Section 3, we give an overview and details of our framework. Then we show experimental results and analysis in
Section 4. Finally,
Section 5 concludes.
Author Contributions
Conceptualization, S.J. and J.R.; methodology S.J.; software, S.J.; validation, S.J., S.H.; investigation, S.J.; writing—original draft preparation, S.J., S.H.; writing—review and editing, S.J., S.H.; visualization, S.J. and S.H.; supervision, J.R.; project administration, J.R.; funding acquisition, J.R. All authors have read and agreed to the published version of the manuscript.