Low-Res Leads the Way:
Improving Generalization for Super-Resolution by Self-Supervised Learning

Haoyu Chen

{}^{1}

, Wenbo Li

{}^{2}

, Jinjin Gu

{}^{3}

, Jingjing Ren

{}^{1}

, Haoze Sun

{}^{4}

,
Xueyi Zou

{}^{2}

, Zhensong Zhang

{}^{2}

, Youliang Yan

{}^{2}

, Lei Zhu

{}^{1,5}

{}^{1}

The Hong Kong University of Science and Technology (Guangzhou)

{}^{2}

Huawei Noah’s Ark Lab

{}^{3}

The University of Sydney

{}^{4}

Tsinghua University

{}^{5}

The Hong Kong University of Science and Technology
Project page: https://haoyuchen.com/LWay Lei Zhu ([email protected]) is the corresponding author.

Abstract

For image super-resolution (SR), bridging the gap between the performance on synthetic datasets and real-world degradation scenarios remains a challenge. This work introduces a novel ”Low-Res Leads the Way” (LWay) training framework, merging Supervised Pre-training with Self-supervised Learning to enhance the adaptability of SR models to real-world images. Our approach utilizes a low-resolution (LR) reconstruction network to extract degradation embeddings from LR images, merging them with super-resolved outputs for LR reconstruction. Leveraging unseen LR images for self-supervised learning guides the model to adapt its modeling space to the target domain, facilitating fine-tuning of SR models without requiring paired high-resolution (HR) images. The integration of Discrete Wavelet Transform (DWT) further refines the focus on high-frequency details. Extensive evaluations show that our method significantly improves the generalization and detail restoration capabilities of SR models on unseen real-world datasets, outperforming existing methods. Our training regime is universally compatible, requiring no network architecture modifications, making it a practical solution for real-world SR applications.

1 Introduction

Refer to caption — Figure 1: Our proposed training method combine the benefits of supervised learning (SL) on synthetic data and self-supervised learning (SSL) on the unseen test images, achieve high quality and high fidelity SR results.

Image super-resolution (SR) aims to restore high-resolution (HR) images from their low-resolution (LR) or degraded counterparts. The inception of the deep-learning-based SR model can be traced back to SRCNN [14]. Recently, advancements in deep learning models have substantially enhanced SR performance [29, 27, 8, 6, 54, 1, 12, 55, 42, 59, 9, 10, 28, 57], particularly in addressing specific degradation types like bicubic downsampling. Nevertheless, the efficacy of SR models is generally restricted by the degradation strategies employed during the training phase, posing great challenges in complex real-world applications.

In the realm of real-world SR, as shown in Figure 2, training approaches can primarily be categorized into three main paradigms. (a) Unsupervised Learning with Unpaired Data: Methods within this paradigm [43, 49, 58, 2, 41, 48, 15, 3] commonly utilize Generative Adversarial Networks (GAN) architecture to learn target distributions without paired data. Using one or multiple discriminators, they distinguish between generated images and actual samples, guiding the generator to model accurately. However, as this approach heavily relies on external data, it encounters significant challenges when facing scarce target domain data, particularly in real-world scenarios. The GAN framework for unsupervised learning also has some drawbacks. Firstly, it inherently struggles with stability during training, leading to noticeable artifacts in SR outputs. Secondly, it is difficult for a single 0/1 plane modelled by a discriminator to accurately separate the target domain [33]. This can result in imprecise distribution learning. (b) Supervised Learning with Paired Synthetic Data: BSRGAN [50] and Real-ESRGAN [45] have largely enhanced the SR model’s generalization ability by simulating more realistic degradation. However, synthetic data, despite mimicking certain real-world conditions, inadequately captures the complex and variable nature of real scenarios, the gap between synthetic and real degradation persists. Consequently, the limited degradation patterns in synthetic data may lead to an over-smoothness issue, sacrificing crucial details and textures. Adapting effectively to complex, variable, or unknown degradations thus remains a formidable challenge. (c) Self-supervised Learning with a Single Image: Techniques falling within this category [40, 37, 11] leverage the intrinsic statistical characteristics of natural images, eliminating the necessity for external datasets. Generally, these methods enable self-supervised learning directly from the input LR image. Despite its inherent flexibility, this approach may exhibit reduced efficacy when handling images lacking repetitive patterns. As a result, in real-world scenarios, where necessary recurring structure are absent, these techniques tends to underperform compared to supervised learning methods that employ paired synthetic data.

It’s notable that real LR/HR image pairs in the target domain are often prohibitively expensive or unavailable. Furthermore, a significant gap persists between synthesized data and real-world data. Given the intrinsic limitations of current methodologies, a critical question arises: Is there an approach that combines the strengths of these diverse strategies? In addressing this, we propose the novel ”Low-Res Leads the Way” (LWay) training framework, which merges supervised learning (SL) pre-training with self-supervised learning (SSL) (see Figure 2 (d)). This approach aims to narrow the disparity between synthetic training data and real test images, as depicted in Figure 1. By integrating supervised learning’s predictive capabilities with the ability to swiftly adapt to unique characteristics present in test LR images, this framework effectively produces high-quality results for unseen real-world images.

The initial step involves training an LR reconstruction network specifically designed to extract a degradation embedding from the LR image. This degradation embedding is then applied to the HR image, facilitating the re-generation of LR content. Upon encountering a test image, we derive its super-resolved result from an off-the-shelf SR model pre-trained on synthetic data. This output is fed into the fixed LR reconstruction network to produce the corresponding degraded counterpart. Subsequently, a self-supervised loss is computed by comparing this degraded counterpart to the original LR image, thereby updating specific parameters within the SR model. Given our observation that pre-trained SR models adeptly handle low-frequency domains but falter in high-frequency areas, we incorporate Discrete Wavelet Transform (DWT) to isolate high-frequency elements from the LR image. This component effectively shifts the model’s focus to the recuperation of high-frequency nuances, and avoids negative impacts on low-frequency areas.

With this innovative framework, our approach eliminates the need for paired LR/HR target domain images, significantly enhancing the performance of SL pre-trained models on unseen real-world data. Our method not only retains the essential content of LR images but also adds high-definition characteristics, ensuring a balance between fidelity and quality. Moreover, this training regime requires no modifications to the network architecture, offering broad compatibility across all SR models. Through extensive evaluations on real-world datasets, we have demonstrated our method’s substantial improvements in generalization performance.

2 Related Work

2.1 Supervised Learning for Real-World SR

While recent years have witnessed significant advancements in the field of super-resolution (SR), conventional SR models such as SRCNN [14], VDSR [20], EDSR [31], RCAN [53], among others [54, 1, 12, 6, 55, 25, 26, 29, 34, 21, 27, 9, 10, 28, 57], have predominantly relied upon predefined degradation processes, such as bicubic downsampling. This simplification, while contributing to the theoretical understanding of SR, often falls short in capturing the intricate and diverse degradations inherent in real-world imaging scenarios, limiting practical adaptability across applications. Consequently, there is a pressing need to explore more sophisticated and realistic degradation models.

To this end, recent efforts have been directed toward methods capturing paired low-resolution (LR) and high-resolution (HR) images from real-world environments, as demonstrated by datasets like RealSR [4] and DRealSR [47]. However, these methods face challenges, including precise image alignment, complex hardware setups, and specific degradation characteristics (e.g., Canon 5D3 and Nikon D810 cameras in RealSR), posing obstacles to practicality and scalability. Recent techniques, including Real-ESRGAN [45] and BSRGAN [50], have attempted to address these shortcomings by synthesizing LR images with more realistic degradation. Despite these advancements, a notable disparity persists between synthesized and authentic degradation. This often results in over-smoothed images that sacrifice fine textural details, as illustrated by [52]. Certain studies [7] have endeavored to enhance the generalizability using limited degradation data; however, the practical application scenarios remain restricted.

As a result, there is a growing demand for innovative approaches that are capable of adapting to the intricate and mixed degradation patterns that typify real-world applications. The SR results should not only exhibit high resolution but also encompass rich detail, ensuring fidelity.

2.2 Unsupervised Learning for Real-world SR

Unsupervised super-resolution [43, 49, 58, 2, 41, 48, 15, 3] serves as a technique to mitigate generation bias inherent in synthetic datasets. These approaches deviate from the conventional reliance on extensive paired data by harnessing the data-generating capabilities inherent in convolutional neural networks (CNNs). Ulyanov et al. [43] posited CNNs as implicit priors for capturing natural image statistics, a concept further explored by the Zero-Shot Super-Resolution (ZSSR) [40] model, which uniquely tailors SR algorithms to the repeating patterns within the input image itself. Generative Adversarial Networks (GANs) have significantly propelled the field forward. KernelGAN [2], for instance, aligns the statistical distribution of downscaled images with their original versions, enhancing the refinement of SR methods’ outputs. CinCGAN [49] marks an early exploration into utilizing unpaired data for implicit degradation modeling. It employs a strategy that transforms LR images into noise-free ‘clean’ states through bicubic downsampling. This approach, backed by a dual CycleGAN architecture [58], fosters a cycle-consistent adaptation that eliminates the need for paired datasets. The unsupervised approach utilizing GANs also encompasses methods such as Degradation GAN [3], FSSR [15], DASR [48] and pseudo-supervision [36], which all employ discriminators to learn the distributions of HR or LR images, or even clean LR images. These methods are instrumental in constraining the network to transform the generated images to align with the corresponding distributions.

Despite considerable advancements in unsupervised methods, they still exhibit certain limitations. For instance, ZSSR and similar methods typically rely on the prerequisite assumption that images possess repetitive patterns. GAN-based approaches, in particular, require substantial data to fit certain specific degradation types effectively. They also face stability challenges during training, which often results in artifacts in SR outputs. Furthermore, the challenge for a discriminator to accurately distinguish the target domain using a binary (0/1) plane model can lead to imprecise learning of distributions. These constraints pose challenges to the practical utility of these methods in real-world scenarios. Exploring more generalized and flexible approaches becomes imperative.

3 Method

In the pursuit of practical applications for image SR, we introduce an unprecedented training methodology. This novel strategy marks a departure from established paradigms, fusing the precision of supervised pre-training with the innovation of self-supervised learning to address the complexities of real-world image degradation. Our proposed framework is detailed in Figure 3.

3.1 LR Reconstruction Pre-training

We introduce an LR reconstruction branch that plays a pivotal role in finetuning our SR model $\mathcal{S}$ on test images derived from real-world environments. Central to this process is the Degradation Encoder $\mathcal{E}$ , engineered to distill the degradation signatures from LR images $I_{\text{LR}}$ into a concise degradation embedding $\mathbf{e}$ . The dimension is 512, formulated as $\mathbf{e}=\mathcal{E}(I_{\text{LR}})$ . Subsequently, the Reconstructor $\mathcal{R}$ employs $\mathbf{e}$ and a high-resolution image $I_{\text{HR}}$ to synthesize an estimated LR image $\hat{I}_{\text{LR}}$ , aiming to fulfill $\hat{I}_{\text{LR}}=\mathcal{R}(I_{\text{HR}},\mathbf{e})$ . To ensure the integrity of $\mathbf{e}$ , we incorporate a dual-component loss function $\mathcal{L}$ , integrating both an L1 norm and the Learned Perceptual Image Patch Similarity (LPIPS) metric. The combined loss function is thus articulated as $\mathcal{L}(I_{\text{LR}},\hat{I}_{\text{LR}})=\mathcal{L}_{1}+\mathcal{L}_{% \text{LPIPS}}$ , meticulously tuning the reconstruction fidelity. Notably, LR reconstruction branch has great robustness, requiring only minimal data for training, is precisely why we advocate for the inclusion of an LR reconstruction branch. This ensures that even when faced with new forms of degradation, its support in the finetuning of the SR model remains uncompromised. The efficiency and robustness of this approach, pivotal in our methodology, will be detailed and validated in the following sections.

3.2 Self-supervised Learning on Test Images

Our approach innovatively fine-tunes a subset of parameters in a SR network, specifically tailored for processing previously unseen real-world images. This method refines the SR network to adeptly handle the complexities of actual degradation patterns. For an real-world LR test image $I_{\text{LR}}^{\text{test}}$ , the SR network $\mathcal{S}$ initially produces a super-resolved image $I_{\text{SR}}^{\text{init}}$ . The pre-trained LR reconstruction branch, with its parameters frozen, extracts a degradation embedding $\mathbf{e}^{\text{test}}$ from $I_{\text{LR}}^{\text{test}}$ , expressed as $\mathbf{e}^{\text{test}}=\mathcal{E}(I_{\text{LR}}^{\text{test}})$ . The self-supervised fine-tuning then commences, leveraging $I_{\text{SR}}^{\text{init}}$ and $\mathbf{e}^{\text{test}}$ to adjust a specific subset of the SR network’s parameters $\theta_{\text{ft}}$ . This fine-tuning is formulated as an optimization problem:

\theta_{\text{ft}}^{*}=\arg\min_{\theta_{\text{ft}}}\mathcal{L}(\mathcal{R}(% \mathcal{S}_{\theta}(I_{\text{LR}}^{\text{test}}),\mathbf{e}^{\text{test}}),I_% {\text{LR}}^{\text{test}})\,,

where $\theta_{\text{ft}}^{*}$ is the optimized parameters from full model $\theta$ .

This strategic adjustment enhances the SR network’s capability to reconstruct images with high fidelity to the LR inputs, enhances the SR network’s ability to generalize to real-world degradation without the need for paired data.

Focused enhancement of high-frequency details. Conventional SR methods tend to proficiently reconstruct low-frequency regions but often neglect or inadequately restore high-frequency details. In addition, the low-frequency regions do not require LR reconstruction due to the absence of detailed texture. Therefore, our approach aims to concentrate the LR reconstruction process specifically on high-frequency areas, thereby preventing the introduction of artifacts into the low-frequency areas. Specifically, we apply Discrete Wavelet Transform (DWT) to obtain the high-frequency component, and then normalize it to yield a weight map $\mathbf{W}\in[0,1]$ . This weight map is then utilized to calculate a weighted loss, ensuring the fidelity to high-frequency details:

\mathcal{L}=\mathcal{L}_{1}(\mathbf{W}\odot\hat{I}_{\text{LR}}^{\text{test}},% \mathbf{W}\odot I_{\text{LR}}^{\text{test}})+\mathcal{L}_{\text{LPIPS}}(% \mathbf{W}\odot\hat{I}_{\text{LR}}^{\text{test}},\mathbf{W}\odot I_{\text{LR}}% ^{\text{test}})\,,

where $\odot$ denotes element-wise multiplication. The combined loss effectively guides the network to restore high-frequency details with greater precision, improving the perceptual quality of the super-resolved image without compromising low-frequency content.

3.3 Discussion

By combining supervised learning (SL) on synthetic data with self-supervised learning (SSL) on test images with unknown degradation, we dynamically adjust the modeling space based on the intrinsic features of test images, steering the SL space towards a more precise SSL space. Figure 4 shows the effectiveness of our method during the fine-tuning process. Our method achieves high-quality and high-fidelity SR while maintaining general compatibility across all models. The primary advantages of our approach compared to other methods are included in the following:

General Degradation Modeling. The transformation from LR to HR images is recognized as a challenging task, while the reverse HR to LR transformation is comparatively simpler and more robust. Our method capitalizes on this observation, avoiding excessive reliance on extensive paired datasets. Instead, we opt to pre-train a universal degradation embedding extraction and LR reconstruction model. This characteristic ensures that our approach is not bound by assumptions of uniform degradation across image datasets. During the training of the SR model, these parameters remain fixed, allowing the SR model for flexible adaption to unknown distributions in real-world scenarios. On the contrary, CycleGAN-based methods simultaneously learn the mappings from LR to HR and HR to LR. This process relies heavily on a substantial amount of data. Furthermore, because CycleGAN implicitly learns the HR to LR mapping without an explicit degradation extraction process, its underlying assumption is that the degradation across the entire dataset is consistent. Consequently, it can only fit certain degradation patterns, largely impacting its performance in real-world scenarios with limited data availability.

Dense pixelwise self-supervision. Through self-supervised learning, our method operates independently of external labels, leveraging dense LR pixel-level signals for supervision. This allows the model to learn richer texture features from the intrinsic image structure. This stands in contrast to traditional supervised approaches that rely on discriminators, which may learn inaccurate features due to the sparsity of supervision signals, leading to suboptimal results.

Robust regularization. Our approach can be viewed as a form of regularization constraint. By integrating degradation embedding extraction and decoupling it from the LR image reconstruction, our method maintains effectiveness in guiding the reconstruction process even when faced with imperfect degradation prediction. This substantially boosts the robustness of our approach, enabling it to learn rich and accurate texture information from the test images.

4 Experiments

			CNN-based						Transformer-based			VQ-based			Diffusion-based
Dataset	Sensors	Metrics	Real-ESRGAN+	+ LWay	Gain	BSRGAN	+ LWay	Gain	SwinIR-GAN	+ LWay	Gain	FeMaSR	+ LWay	Gain	StableSR	+ LWay	Gain
		PSNR $\uparrow$	27.51	29.18	+1.67	28.81	28.85	+0.04	28.12	28.96	+0.84	25.72	28.16	+2.44	25.50	27.22	+1.72
		SSIM $\uparrow$	0.8348	0.8688	+0.034	0.8473	0.8496	+0.0023	0.8486	0.8579	+0.0093	0.7811	0.8383	+0.0572	0.7684	0.8043	+0.0359
		LPIPS $\downarrow$	0.1947	0.1479	-0.0468	0.1988	0.1572	-0.0416	0.1850	0.1469	-0.0381	0.2543	0.1747	-0.0796	0.2636	0.2019	-0.0617
		MAD $\downarrow$	133.96	111.91	-22.05	119.08	116.77	-2.31	125.17	111.71	-13.46	143.38	117.48	-25.90	145.36	124.15	-21.21
		NLPD $\downarrow$	0.2807	0.2437	-0.037	0.2594	0.2569	-0.0025	0.2670	0.2541	-0.0129	0.3239	0.2778	-0.0461	0.3426	0.3074	-0.0352
	Canon	DISTIS $\downarrow$	0.1621	0.1444	-0.0177	0.1794	0.1558	-0.0236	0.1557	0.1352	-0.0205	0.2116	0.1808	-0.0308	0.1897	0.1596	-0.0301
		PSNR $\uparrow$	26.81	28.58	+1.77	28.13	28.65	+0.52	27.54	28.55	+1.01	25.41	27.87	+2.46	25.54	26.92	+1.38
		SSIM $\uparrow$	0.7861	0.8249	+0.0388	0.8012	0.8057	+0.0045	0.8043	0.813	+0.0087	0.7314	0.7936	+0.0622	0.7370	0.7686	+0.0316
		LPIPS $\downarrow$	0.2300	0.1769	-0.0531	0.2302	0.1750	-0.0552	0.2154	0.176	-0.0394	0.2738	0.2028	-0.071	0.2711	0.2156	-0.0555
		MAD $\downarrow$	131.62	108.18	-23.44	118.48	105.64	-12.84	122.65	106.73	-15.92	137.54	110.79	-26.75	139.26	119.29	-19.97
		NLPD $\downarrow$	0.3061	0.2667	-0.0394	0.2805	0.2758	-0.0047	0.2844	0.272	-0.0124	0.3419	0.297	-0.0449	0.3513	0.3215	-0.0298
RealSR	Nikon	DISTIS $\downarrow$	0.1950	0.1714	-0.0236	0.2102	0.1791	-0.0311	0.1842	0.1639	-0.0203	0.2340	0.2042	-0.0298	0.2131	0.1837	-0.0294
		PSNR $\uparrow$	30.16	31.4	+1.24	30.47	31.23	+0.76	29.92	30.77	+0.85	27.51	29.75	+2.24	28.63	29.28	+0.65
		SSIM $\uparrow$	0.8326	0.8597	+0.0271	0.8260	0.8442	+0.0182	0.8213	0.8398	+0.0185	0.7725	0.8096	+0.0371	0.7648	0.7785	+0.0137
		LPIPS $\downarrow$	0.2488	0.2341	-0.0147	0.2685	0.2469	-0.0216	0.2565	0.2383	-0.0182	0.3228	0.2931	-0.0297	0.3331	0.3017	-0.0314
		MAD $\downarrow$	125.20	112.1	-13.10	123.22	115.14	-8.08	124.85	114.09	-10.76	140.50	125.52	-14.98	141.13	130.01	-11.12
		NLPD $\downarrow$	0.3032	0.2751	-0.0281	0.3034	0.2857	-0.0177	0.3105	0.2895	-0.021	0.3502	0.3152	-0.035	0.3503	0.3402	-0.0101
	sony	DISTIS $\downarrow$	0.1859	0.1765	-0.0094	0.2115	0.1934	-0.0181	0.1883	0.1783	-0.01	0.2314	0.2168	-0.0146	0.2296	0.2176	-0.012
		PSNR $\uparrow$	29.53	29.88	+0.35	29.16	29.4	+0.24	28.94	29.57	+0.63	26.42	28.26	+1.84	28.69	29.05	+0.36
		SSIM $\uparrow$	0.8050	0.8206	+0.0156	0.7931	0.7944	+0.0013	0.8002	0.8071	+0.0069	0.6976	0.7557	+0.0581	0.7460	0.7487	+0.0027
		LPIPS $\downarrow$	0.3107	0.308	-0.0027	0.3275	0.2926	-0.0349	0.3184	0.3093	-0.0091	0.4129	0.3762	-0.0367	0.3853	0.3800	-0.0053
		MAD $\downarrow$	127.91	125.04	-2.87	130.94	126.87	-4.07	131.73	126.04	-5.69	151.35	138.85	-12.50	137.60	132.71	-4.89
		NLPD $\downarrow$	0.3016	0.2899	-0.0117	0.3157	0.3129	-0.0028	0.3093	0.3005	-0.0088	0.3897	0.3425	-0.0472	0.3410	0.3353	-0.0057
	olympus	DISTIS $\downarrow$	0.2130	0.2118	-0.0012	0.2276	0.2145	-0.0131	0.2181	0.2109	-0.0072	0.2552	0.2406	-0.0146	0.2412	0.2371	-0.0041
		PSNR $\uparrow$	29.81	30.83	+1.02	29.98	31.05	+1.07	29.11	30.94	+1.83	27.83	29.44	+1.61	29.13	29.88	+0.75
		SSIM $\uparrow$	0.8094	0.8283	+0.0189	0.7987	0.8236	+0.0249	0.7918	0.8193	+0.0275	0.7413	0.7798	+0.0385	0.7428	0.7554	+0.0126
		LPIPS $\downarrow$	0.2592	0.2581	-0.0011	0.2738	0.2624	-0.0114	0.2688	0.2517	-0.0171	0.3144	0.2973	-0.0171	0.3143	0.3021	-0.0122
		MAD $\downarrow$	124.51	116.18	-8.33	124.38	114.04	-10.34	126.61	112.79	-13.82	137.50	124.81	-12.69	132.36	122.85	-9.51
		NLPD $\downarrow$	0.304	0.2825	-0.0215	0.3109	0.2852	-0.0257	0.3184	0.2869	-0.0315	0.3604	0.3215	-0.0389	0.3444	0.3312	-0.0132
DRealSR	panasonic	DISTIS $\downarrow$	0.2000	0.1974	-0.0026	0.2130	0.2021	-0.0109	0.2046	0.1948	-0.0098	0.2243	0.2121	-0.0122	0.2255	0.2196	-0.0059

Table 1: The performance improvements across various model types utilizing our proposed training methodology.

# of Fine-tuning

Images Per Model

Description

LPIPS

\downarrow

DISTIS

\downarrow

MAD

\downarrow

baseline,

without fine-tuning

0.3136

0.2353

117.71

fine-tuning

on every single images

0.2351

0.1919

111.46

fine-tuning

on the entire testset

0.2536

0.2044

111.63

fine-tuning with 40 additional

images from the same sensors

0.2571

0.2037

108.62

Table 2: The impact of the number of images used for a single fine-tuning training. Our method can be fine-tuned either on individual images or on the entire test set, which greatly reduces cost.

4.1 Experimental Settings

Testing methods. Our proposed method serves as a universally applicable self-supervised learning strategy for various cutting-edge blind SR models, eliminating the necessity for architectural modifications. We conduct evaluations on a diverse range of advanced SR methods, including BSRGAN [50] and Real-ESRGAN+ [45] employing conventional CNN frameworks, SwinIR-GAN [29] integrating Transformer structures, FeMaSR [5] utilizing VQGAN, and StableSR [44] based on pre-trained diffusion. We use officially released SR models as baselines and conduct self-supervised fine-tuning on targeted test datasets. While fine-tuning a single image can lead to superior performance, for improved training efficiency, we opt to fine-tune the entire test dataset collectively. All experiments are conducted under this configuration unless otherwise specified.

Implementation details. We adopt the Adam [22] optimizer. For StableSR, we set the learning rate to 5e-5 and the batch size to 1. For the remaining models, a learning rate of 2e-6 and a batch size of 6 are used. Each model undergoes rapid fine-tuning on a single V100 GPU. The duration of training varies among models and images, typically spanning 150 to 500 iterations. More details are provided in the supplementary materials.

Training datasets. Our self-supervised fine-tuning approach is directly applied to the test set, without the need for a separate training set. The only prerequisite training is allocated for the LR reconstruction network, which is trained using 6,000 real paired images collected in-house. It is critical to note that these data were invisible to the SR network.

Testing datasets. Our method is evaluated on real-world paired datasets, including RealSR [4] and DRealSR [47]. These datasets are meticulously curated from diverse device sensors to reflect various degradation characteristics. To ensure a fair comparison with other methods, we follow the standard setting of cropping each image into multiple patches for a 4 $\times$ SR. The LR image patch size is 128 $\times$ 128, while the corresponding HR size is 512 $\times$ 512.

Evaluation metrics. We employ LPIPS [51], DISTIS [13], and NLPD [17] metrics that closely align with human perception [18, 16]. Additionally, traditional metrics such as PSNR, SSIM [46], and MAD [23] are included for a comprehensive assessment. Six different metrics provide a comprehensive evaluation.

4.2 Improvements on Existing Methods

The results outlined in Table 1 compellingly demonstrate our method’s effectiveness in significantly advancing SR quality. Notably, improvements are consistently observed across all models, datasets, and metrics, underscoring the universal applicability of our approach. For CNN-based models like Real-ESRGAN+, our method achieves a notable enhancement on the Nikon dataset, delivering a 1.77dB improvement in PSNR and a 0.0388 increase in SSIM. These improvements contribute to more precise reconstruction of high-quality images. Furthermore, the validation of enhanced perceptual quality is evident through an LPIPS reduction of 0.0532. Additionally, when applied to Transformer models such as SwinIR-GAN, our method showcases considerable improvements. On the Olympus dataset, we observe a 0.63 dB increase in PSNR and a significant decrease in MAD by 5.69, highlighting the framework’s capacity to enhance fidelity and sharpness.

As depicted in Figure 5, in the first example, all SR models fail to preserve the original textures present in the input images, resulting in excessively smoothed fabric patterns. However, upon applying our self-supervised fine-tuning method, significant improvements are observed across all approaches, successfully reconstructing clear fabric textures. A similar improvement is evident in the second example of oil paintings. The existing SR models struggle to capture the intricate details of the paintings. Conversely, our method effectively restores the artistic effects, particularly showcasing notable enhancement for StableSR. For other examples, the results are similar as well, our method significantly improving high-frequency detail recovery, yielding results that were both sharp and rich in detail.

4.3 Application on Real-world Scenes

Old films often exhibit issues like graininess, color fading, and lower resolution, making them an ideal testbed for evaluating the practical capabilities of SR models. To conduct a comprehensive comparison, we curate a selection of state-of-the-art real-world SR models. These encompass various methodologies: ZSSR [40], a self-supervised learning model; DASR [30], a degradation-adaptive approach; large diffusion models such as LDM [39], DiffBIR [32], and StableSR [44]; DARSR [56], which leverages unsupervised techniques for enhanced model performance; and CAL_GAN [38], a photo-realistic SR model. We employ StableSR as the base model and implement the proposed self-supervised learning strategy. The first case in Figure 6 involves a 480p low-resolution film, namely “My Fair Lady”. Among the assessed models, ZSSR, DASR, and DARSR exhibit minimal improvements, while DiffBIR introduces unpleasing artifacts. Other models achieve slightly smoother results. Notably, our model not only accurately reproduces the hat with clear fabric textures but also effectively restores facial features, including wrinkles and contours. In contrast to some methods that may introduce unnatural effects or overly smooth distortions, our model adeptly balances the restoration of fine textures with preserving overall image clarity.

User study. We conducted a user study with the participation of 24 experienced researchers. Each participant was tasked with assigning a visual perceptual quality score ranging from 0 to 10 to every image. The results, depicted in the Figure 9, reveal a significant lead of our proposed method over alternative approaches, surpassing the second-best method by more than 2 points. Notably, the scores for DASR, DiffBIR, and DARSR were even lower than those for LR images, indicating a limited effectiveness of these methods in handling real-world images.

Training Type

Number of

Sensors

Number of

Images

LPIPS

\downarrow

DISTIS

\downarrow

- (baseline)

0.2302

0.2102

Synthetic Data

0.1836

0.1885

0.1816

0.1873

Real-world Data

0.6K

0.2003

0.1970

0.1785

0.1793

0.1722

0.1772

0.1800

0.1830

Table 3: Ablation on training data of LR reconstruction.

	baseline	128	256	512	1024	2048	4096
PSNR $\uparrow$	28.13	28.92	28.54	28.85	29.10	29.56	29.20
LPIPS $\downarrow$	0.2302	0.1804	0.1776	0.1722	0.1736	0.1629	0.1669
DISTS $\downarrow$	0.2192	0.1792	0.1818	0.1772	0.1749	0.1630	0.1656

Table 4: Ablation on dimensions of degradation embedding.

Table 5: Our method versus supervised real data fine-tuning.

Method	LPIPS $\downarrow$	DISTIS $\downarrow$
baseline	0.2302	0.2102
baseline + real data	0.2268	0.1989
LWay (ours)	0.1722	0.1772

HF Loss	LPIPS $\downarrow$	DISTIS $\downarrow$
\usym 2613	0.1858	0.1879
\usym 1F5F8	0.1722	0.1772

Table 5: Our method versus supervised real data fine-tuning.

Table 6: Ablation study on high-frequency (HF) loss.

4.4 Ablation Study

We conducted an ablation study on the RealSR Nikon test set using BSRGAN. We trained 65% of the model parameters to achieve the lowest LPIPS score on this test set.

Training data of LR reconstruction. In this section, we demonstrate the robustness of the LR reconstruction network trained with limited data, which forms the cornerstone of our design. As depicted in Table 3, we incorporated two types of training data. The first category includes synthetic data created using BSRGAN degradation, while the second involves real paired images collected for training. Both settings result in improved performance. Specifically, compared to synthetic data, which brings a 0.0486 improvement in LPIPS, the utilization of only 600 images brings a 0.0299 improvement, and 4000 images notably boosts LPIPS by 0.058. Adding more images beyond this threshold did not yield any further advancement. We attribute this to the inherent ease in mapping from HR to LR compared to the reverse LR to HR mapping, mitigating the necessity for extensive training data. This assertion finds further support in Figure 9, where t-SNE visualization distinctly separates distinct degradations, even for unseen degradation types.

Degradation embedding dimensions. Table 4 tests different embedding dimensions, indicating that all variants significantly enhance performance. While a dimension of 512 (default) is effective, higher one (2048) can further improve results.

Our method versus supervised fine-tuning. To comprehensively illustrate the efficacy of our method, we conduct additional supervised fine-tuning of the baseline model using the gathered real paired data. As depicted in Table 6, we note marginal improvements. This aligns with our contention that LR to HR mapping poses inherent difficulties. Training with data from one sensor type showed negligible benefits for another, suggesting a significant gap in degradation patterns. This was further corroborated by Figure 7, where it generates over-smoothed outputs. Conversely, our method showcases robustness and substantially enhances the final SR quality. This proves that our proposed training strategy is more effective.

Number of images used in fine-tuning. We employ self-supervised LR reconstruction fine-tuning on test images to optimize the SR model. This section investigates the impact of the number of fine-tuning images on the final performance. As indicated in Table 2, we establish a baseline without fine-tuning using ten real-world images. Conducting single-shot fine-tuning on individual images yields the most favorable results, allowing models to best adapt to the distribution of input images. Next, we conduct experiments involving collective fine-tuning of ten images. Results show significant improvements compared to the baseline but are not as effective as fine-tuning individual images separately. Furthermore, we extend our study by fine-tuning the model using an additional forty images to investigate whether acquiring more images from the same sensor would refine the model further. Our findings indicate that compared to training on ten images, there is a decline in LPIPS, while DISTIS and MAD exhibit slight improvements. This suggests a trade-off between fine-tuning performance and efficiency.

High-frequency loss. Table 6 illustrates the impact of the introduced high-frequency loss. The integration of the high-frequency loss results in a notable improvement, affirming the efficacy of our design. Importantly, it enhances high-frequency recovery and avoids the negative impact of our training method on low-frequency areas.

Fine-tuning parameters. In our exploration of parameter fine-tuning, we observe that increasing the number of trained parameters results in higher PSNR values. However, the LPIPS score reaches its optimal point at approximately 60% - 70% of the parameters, as depicted in Figure 10. Considering the limitation of PSNR, we prioritize the use of LPIPS as our reference. It’s important to note that different network and testsets may yield varied conclusions. The supplementary materials show more details.

5 Conclusion

In conclusion, our proposed super-resolution training strategy, termed “Low-Res Leads the Way”, represents an innovative approach that successfully bridges the disparity between synthetic data supervised training and real-world test image self-supervision. Demonstrating impressive performance and robustness across various SR frameworks and real-world benchmarks, our method marks a significant advancement toward achieving effective real-world applications.

References

Ahn et al. [2018] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European conference on computer vision (ECCV), pages 252–268, 2018.
Bell-Kligler et al. [2019] Sefi Bell-Kligler, Assaf Shocher, and Michal Irani. Blind super-resolution kernel estimation using an internal-gan. Advances in Neural Information Processing Systems, 32, 2019.
Bulat et al. [2018] Adrian Bulat, Jing Yang, and Georgios Tzimiropoulos. To learn image super-resolution, use a gan to learn how to do image degradation first. In Proceedings of the European conference on computer vision (ECCV), pages 185–200, 2018.
Cai et al. [2019] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019.
Chen et al. [2022] Chaofeng Chen, Xinyu Shi, Yipeng Qin, Xiaoming Li, Xiaoguang Han, Tao Yang, and Shihui Guo. Real-world blind super-resolution via feature matching with implicit high-resolution priors. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1329–1338, 2022.
Chen et al. [2021] Haoyu Chen, Jinjin Gu, and Zhi Zhang. Attention in attention network for image super-resolution. arXiv preprint arXiv:2104.09497, 2021.
Chen et al. [2023a] Haoyu Chen, Jinjin Gu, Yihao Liu, Salma Abdel Magid, Chao Dong, Qiong Wang, Hanspeter Pfister, and Lei Zhu. Masked image training for generalizable deep image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1692–1703, 2023a.
Chen et al. [2023b] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super-resolution transformer. In CVPR, pages 22367–22377, 2023b.
Chen et al. [2023c] Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xiaokang Yang, and Fisher Yu. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12312–12321, 2023c.
Chen et al. [2024] Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, and Xiaokang Yang. Recursive generalization transformer for image super-resolution. In International Conference on Learning Representations (ICLR), 2024.
Cheng et al. [2020] Xi Cheng, Zhenyong Fu, and Jian Yang. Zero-shot image super-resolution with depth guided internal degradation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 265–280. Springer, 2020.
Dai et al. [2019] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11065–11074, 2019.
Ding et al. [2020] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5):2567–2581, 2020.
Dong et al. [2015] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
Fritsche et al. [2019] Manuel Fritsche, Shuhang Gu, and Radu Timofte. Frequency separation for real-world super-resolution. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3599–3608. IEEE, 2019.
Gu et al. [2020] Jinjin Gu, Haoming Cai, Haoyu Chen, Xiaoxing Ye, Jimmy Ren, and Chao Dong. Image quality assessment for perceptual image restoration: A new dataset, benchmark and metric. arXiv preprint arXiv:2011.15002, 2020.
Hepburn et al. [2019] Alexander Hepburn, Valero Laparra, Ryan McConville, and Raul Santos-Rodriguez. Enforcing perceptual consistency on generative adversarial networks by using the normalised laplacian pyramid distance. arXiv preprint arXiv:1908.04347, 2019.
Jinjin et al. [2020] Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S Ren, and Dong Chao. Pipal: a large-scale image quality assessment dataset for perceptual image restoration. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 633–651. Springer, 2020.
Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
Kim et al. [2016a] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016a.
Kim et al. [2016b] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1645, 2016b.
Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
Larson and Chandler [2010] Eric C Larson and Damon M Chandler. Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of electronic imaging, 19(1):011006–011006, 2010.
Ledig et al. [2017] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
Li et al. [2020] Wenbo Li, Kun Zhou, Lu Qi, Nianjuan Jiang, Jiangbo Lu, and Jiaya Jia. Lapar: Linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond. NeurIPS, 33:20343–20355, 2020.
Li et al. [2022a] Wenbo Li, Kun Zhou, Lu Qi, Liying Lu, and Jiangbo Lu. Best-buddy gans for highly detailed image super-resolution. In AAAI, pages 1412–1420, 2022a.
Li et al. [2023] Wenbo Li, Xin Lu, Shengju Qian, and Jiangbo Lu. On efficient transformer-based image pre-training for low-level vision. In IJCAI-23, pages 1089–1097. International Joint Conferences on Artificial Intelligence Organization, 2023. Main Track.
Li et al. [2022b] Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Jinjin Gu, Yu Qiao, and Chao Dong. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 833–843, 2022b.
Liang et al. [2021] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021.
Liang et al. [2022] Jie Liang, Hui Zeng, and Lei Zhang. Efficient and degradation-adaptive network for real-world image super-resolution. In European Conference on Computer Vision, pages 574–591. Springer, 2022.
Lim et al. [2017] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
Lin et al. [2023] Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Ben Fei, Bo Dai, Wanli Ouyang, Yu Qiao, and Chao Dong. Diffbir: Towards blind image restoration with generative diffusion prior. arXiv preprint arXiv:2308.15070, 2023.
Liu et al. [2022] Anran Liu, Yihao Liu, Jinjin Gu, Yu Qiao, and Chao Dong. Blind image super-resolution: A survey and beyond. IEEE transactions on pattern analysis and machine intelligence, 45(5):5461–5480, 2022.
Liu et al. [2020] Jie Liu, Wenjie Zhang, Yuting Tang, Jie Tang, and Gangshan Wu. Residual feature aggregation network for image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2359–2368, 2020.
Liu et al. [2023] Yihao Liu, Jingwen He, Jinjin Gu, Xiangtao Kong, Yu Qiao, and Chao Dong. Degae: A new pretraining paradigm for low-level vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23292–23303, 2023.
Maeda [2020] Shunta Maeda. Unpaired image super-resolution using pseudo-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 291–300, 2020.
Neshatavar et al. [2023] Reyhaneh Neshatavar, Mohsen Yavartanoo, Sanghyun Son, and Kyoung Mu Lee. Icf-srsr: Invertible scale-conditional function for self-supervised real-world single image super-resolution. arXiv preprint arXiv:2307.12751, 2023.
Park et al. [2023] JoonKyu Park, Sanghyun Son, and Kyoung Mu Lee. Content-aware local gan for photo-realistic super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10585–10594, 2023.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Shocher et al. [2018] Assaf Shocher, Nadav Cohen, and Michal Irani. “zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3118–3126, 2018.
Soh et al. [2020] Jae Woong Soh, Sunwoo Cho, and Nam Ik Cho. Meta-transfer learning for zero-shot super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3516–3525, 2020.
Sun et al. [2023] Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Renjing Pei, Xueyi Zou, Youliang Yan, and Yujiu Yang. Coser: Bridging image and language for cognitive super-resolution. arXiv preprint arXiv:2311.16512, 2023.
Ulyanov et al. [2018] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9446–9454, 2018.
Wang et al. [2023] Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. In arXiv preprint arXiv:2305.07015, 2023.
Wang et al. [2021] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021.
Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
Wei et al. [2020] Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qixiang Ye, Wangmeng Zuo, and Liang Lin. Component divide-and-conquer for real-world image super-resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020.
Wei et al. [2021] Yunxuan Wei, Shuhang Gu, Yawei Li, Radu Timofte, Longcun Jin, and Hengjie Song. Unsupervised real-world image super resolution via domain-distance aware training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13385–13394, 2021.
Yuan et al. [2018] Yuan Yuan, Siyuan Liu, Jiawei Zhang, Yongbing Zhang, Chao Dong, and Liang Lin. Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 701–710, 2018.
Zhang et al. [2021] Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021.
Zhang et al. [2018a] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018a.
Zhang et al. [2023] Ruofan Zhang, Jinjin Gu, Haoyu Chen, Chao Dong, Yulun Zhang, and Wenming Yang. Crafting training degradation distribution for the accuracy-generalization trade-off in real-world super-resolution. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023.
Zhang et al. [2018b] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV), pages 286–301, 2018b.
Zhang et al. [2018c] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018c.
Zhang et al. [2019] Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. Residual non-local attention networks for image restoration. arXiv preprint arXiv:1903.10082, 2019.
Zhou et al. [2023] Hongyang Zhou, Xiaobin Zhu, Jianqing Zhu, Zheng Han, Shi-Xue Zhang, Jingyan Qin, and Xu-Cheng Yin. Learning correction filter via degradation-adaptive regression for blind single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12365–12375, 2023.
Zhou et al. [2022] Lin Zhou, Haoming Cai, Jinjin Gu, Zheyuan Li, Yingqi Liu, Xiangyu Chen, Yu Qiao, and Chao Dong. Efficient image super-resolution using vast-receptive-field attention. In European Conference on Computer Vision, pages 256–272. Springer, 2022.
Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
Zou et al. [2022] Wenbin Zou, Tian Ye, Weixin Zheng, Yunchen Zhang, Liang Chen, and Yi Wu. Self-calibrated efficient transformer for lightweight super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 930–939, 2022.

\thetitle

Supplementary Material

A Experimental Details

A.1 Fine-tuning Details

Due to the different network architectures of different types of models, we trained different parts of the parameters for them. The rationale behind parameter selection for training is corroborated by empirical experiments detailed further in the text.

•

For the training phases specific to BSRGAN, Real-ESRGAN+, and SwinIR-GAN, selective freezing of initial layers is implemented to concentrate training on the deeper parameters.
•

In the case of FeMaSR, which is based on the VQGAN (Vector Quantized Generative Adversarial Network) structure, the focus is placed on the parameters of the VQGAN encoder.
•

StableSR, which utilizes a pre-trained diffusion model, applies a controllable feature wrapping (CFW) module with an adjustable coefficient to refine the outputs of the diffusion model during the decoding process of the autoencoder. We choose to fine-tune the designed Collaborative Feature Weighting module and part of the encoder.

It usually takes 150 to 500 iterations to train. The time depends on the baseline network size, ranging from seconds to a few minutes. Our method can be fine-tuned either on individual images or on the entire test set assuming consistent degradation across the test set, which greatly reduces computational cost. Table 7 shows that our method takes only 8 minutes to fine-tune on the whole test set, much faster than others. Individual fine-tuning can improve the results if needed.

A.2 Testing Datasets

The validation of the effectiveness of our training method in real-world scenarios is conducted using real-world paired datasets, RealSR [4], and DRealSR [47]. These datasets are meticulously curated from various sensors to reflect different degradation characteristics inherent in each device. Furthermore, the datasets are segmented based on the capturing equipment. For RealSR, a 2 $\times$ scale factor is employed, with separate subsets for Canon and Nikon. In the case of DRealSR, a 4 $\times$ scale is applied across three subsets corresponding to Sony, Panasonic, and Olympus. To ensure a fair comparison with other models, we follow common settings employed by most methods. Each image is segmented into multiple smaller patches for performing 4 $\times$ super-resolution, with the patch size for LR images being 128 $\times$ 128 and for HR images being 512 $\times$ 512.

	LPIPS $\downarrow$	DISTS $\downarrow$	PSNR $\uparrow$	fine-tuning time $\downarrow$
Ours ( $d=2048$ )	0.1629	0.1630	29.56	8 min
ZSSR	0.2424	0.1889	29.14	19 hr
KernelGAN+ZSSR	0.3315	0.2774	23.52	72 hr
deep plug-and-play	0.2604	0.2524	29.44	1.4 hr
deep image prior	0.2091	0.2054	29.32	28 hr

Table 7: Our method can be fine-tuned on the entire test set assuming consistent degradation across the test set, which greatly reduces computational cost. Other methods need to be trained on each individual image.

A.3 Evaluation Metrics

Following prior research [18, 16], our study adopts a carefully curated set of perceptual metrics, ones that have shown a higher correlation with human perception, including Learned Perceptual Image Patch Similarity (LPIPS) [51], Deep Image Structure and Texture Similarity (DISTS) [13], and Normalized Laplacian Pyramid Distance (NLPD) [17]. LPIPS and DISTS have been empirically validated in [18, 16] as more closely aligned with human visual assessment than other metrics. We also include traditional metrics, such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) [46], and most apparent distortion (MAD) [23].

A previous study [18, 16] investigated the correlation between human visual perception quality of images and various Image Quality Assessment (IQA) metrics, with the findings summarized in the Table 8. The experimental results reveal that MAD, LPIPS, and DISTS outperform traditional PSNR and SSIM across various aspects in the context of super-resolution evaluation. Specifically, MAD demonstrates superior accuracy in assessing traditional SR methods. On the other hand, both LPIPS and DISTS exhibit higher precision when evaluating GAN-based SR methods. In the overall comparison, DISTS emerges as the most effective metric for super-resolution assessment. These findings underscore the limitations of relying solely on conventional metrics such as PSNR and SSIM, emphasizing the importance of incorporating newer metrics like MAD, LPIPS, and DISTS for a more comprehensive and accurate evaluation of super-resolution techniques.

Method	SR Full	Traditional SR	PSNR. SR	GAN-based SR
PSNR	0.4099	0.4782	0.5462	0.2839
SSIM	0.5209	0.5856	0.6897	0.3388
MAD	0.5424	0.6720	0.7575	0.3494
LPIPS	0.5614	0.5487	0.6782	0.4882
DISTS	0.6544	0.6685	0.7733	0.5527

Table 8: The Spearman rank correlation coefficient (SRCC) between MOS (Mean Opinion Score) and various IQA (Image Quality Assessment) metrics across different distortion sub-types.

B LR Reconstruction Network

B.1 Degradation Encoder

Following the methodology proposed by Liu et al. [35], our degradation encoder is constructed by integrating a pre-trained SR-GAN model [24] and downsampling layers. This collaborative framework aims to produce degradation embeddings, denoted as $e$ , with a dimensionality of 512. The choice of a relatively small dimension for $e$ ensures that the degradation embeddings do not encapsulate intrinsic image information but are sufficiently representative of content pertaining specifically to the degradation process. This design principle is crucial in isolating and preserving only the features relevant to degradation, avoiding contamination with the original image characteristics.

B.2 Reconstructor

In our methodology, we incorporate a modulation-demodulation-convolution strategy reminiscent of Instance Normalization as employed in StyleGAN2 [19]. This approach effectively utilizes the degradation embedding $e$ to facilitate LR reconstruction when combined with the SR network’s output $I_{SR}$ . To delve deeper into the specifics of this strategy, during modulation, a style is learned from the provided degradation embedding $e$ . The modulation operation scales each input feature map of the convolution using the acquired style, as denoted by the equation

w_{ijk}=s_{i}\cdot w_{ijk},

the variables $w$ and $w^{\prime}$ represent the original and modulated weights, respectively. The scale factor, denoted as $s_{i}$ , corresponds to the $i$ th input feature map. The indices $j$ and $k$ are used to iterate over the output feature maps and spatial footprint of the convolution, respectively. This modulation process ensures that the convolutional features are adaptively adjusted based on the characteristics embedded in the degradation embedding. Following modulation, a demodulation step is executed to obtain the demodulated convolution weights, represented as

w_{ijk}^{\prime\prime}=\frac{w_{ijk}^{\prime}}{\sqrt{\sum_{i,k}w_{ijk}^{\prime% }{}^{2}+\epsilon}}.

The primary objective of demodulation is to restore the outputs to a unit standard deviation, providing stability and normalizing the feature representations. It is crucial to emphasize that this modulation-demodulation-convolution strategy facilitates the integration of degradation-specific information into the LR reconstruction process. The adaptability of the convolutional features based on the learned style ensures that the network can effectively reconstruct LR inputs, enhancing the overall performance of the SR framework.

C More Experiment Results

C.1 The Effect of Fine-tuning Parameters for Different Network Architecture.

In our investigation into the impact of training parameters on the performance of the FeMaSR and SwinIR networks, the influence is shown in Figure 12 and Figure 13. Specifically, for the FeMaSR network, the optimal PSNR is achieved when training parameters constitute 86%, while the optimal LPIPS is obtained at 100%. In contrast, SwinIR attains the best PSNR and LPIPS values almost simultaneously at 100% of training parameters.

C.2 Ablation on Model Size

Table 9 delineates the efficacy of our proposed model across a spectrum of sizes, demonstrating that our method retains robust performance notwithstanding the model’s capacity. From a comprehensive model with 12.9 million parameters to a compact version with merely 495 thousand parameters, our approach consistently outperforms the baseline.

	Paras	FLOPS	PSNR $\uparrow$	MAD $\downarrow$	LPIPS $\downarrow$	DISTS $\downarrow$
baseline	-	-	28.13	118.48	0.2302	0.2102
+ LWay (Large)	12.9 M	589.4 G	28.85	104.71	0.1722	0.1772
+ LWay (Medium)	5.38 M	117.6 G	29.50	99.76	0.1798	0.1810
+ LWay (Small)	2.77 M	44.49 G	28.69	106.42	0.1837	0.1862
+ LWay (Tiny)	495 K	19.38 G	28.70	104.92	0.1808	0.1842

Table 9: The performence of different model size. LWay is not contingent on the parameter count of the LR reconstruction, demonstrating effectiveness even with a small parameter volume.

C.3 Performance on Different Degradation

Table 10 demonstrates the robustness of our ’LWay’ method in handling various types of image degradations. It presents notable improvements in PSNR and reductions in LPIPS for both real-world degradation and synthetic distortions such as blurring and JPEG compression, signifying our method’s efficacy in maintaining image integrity across different degradation scenarios.

	real-world degradation		synthetic, blur 17 $\times$ 17		synthetic, blur 11 $\times$ 11		synthetic, JPEG $q=15$
	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS	PSNR	LPIPS
baseline	28.13	0.2302	27.55	0.4065	27.5	0.3922	26.60	0.4240
+LWay ( $d$ =2048)	29.56	0.1629	28.39	0.2755	29.02	0.2265	26.85	0.3122

Table 10: Performance on different degradation. LWay improve image quality under a range of degradations.

D More Visual Results

D.1 LR Reconstruction Visualization

The visual outcomes of the LR reconstruction network are illustrated in Figure 14, encompassing HR, LR, and the reconstructed LR images. Notably, our network demonstrates the capability to restore LR images that closely approximate the ground truth LR by extracting a 512-dimensional degradation embedding solely from the LR input and subsequently integrating it with the HR image. This process demonstrates the effectiveness of our LR reconstruction approach in achieving visually compelling results. The showcased robustness of our LR reconstruction network is particularly noteworthy. Given that the transition from HR to LR is generally considered easier compared to the reverse process, our method exhibits a heightened degree of resilience with limited data. Leveraging only a finite dataset, our approach achieves a robust performance, underscoring its capacity to generalize and adapt well to diverse LR input scenarios.

D.2 More Visual Comparison

Figure 15 and Figure 16 provide additional comparisons of our proposed method with other state-of-the-art (sota) approaches. Our method excels in effectively restoring the texture and fine details of images. In contrast, DASR, DiffBIR, and StableSR tend to produce smoother results at the expense of losing texture details. ZSSR, on the other hand, exhibits limited restoration capabilities, resulting in less clear outcomes that are less faithful to the LR input. The results generated by LDM display inconsistencies in texture details compared to the ground truth. DARSR, while prone to failure and introducing significant color bias, and CAL_GAN both exhibit varying degrees of artifacts in their outputs. These visual comparisons underscore the superior performance of our proposed method in preserving intricate details and textures during the super-resolution process. The tendency of other methods to sacrifice fine details for smoother results, introduce artifacts, or inaccurately represent texture details highlights the unique strengths of our approach.

E Discussion

Differences to optimization-based methods. Optimization-based methods, relying on pre-defined degradation models or downsampling operators, have limited capabilities in handling complex degradation. They are also time-consuming and difficult with large data. In contrast, our approach incorporates a more general and robust degradation modeling. Moreover, our method marries the benefits of supervised and unsupervised training, outperforming optimization-based methods that only use test images.

Differences to KernelGAN. KernelGAN’s discriminator only makes binary judgments (0/1), while LWay uses pixel-level regression to better capture the distribution. Moreover, local KernelGAN’s kernels have limited information and robustness in real-world, while our embedding has richer external priors rather than relying on solely learning test images and is robust as demonstrated by validation.

F Limitation

The proposed architecture excels in extracting and restoring information from low-resolution (LR) images, especially when they contain discernible texture details. It is within these conditions that our method showcases its maximum effectiveness. However, a limitation might arise when the LR images themselves lack texture details, impeding the model’s capability to execute effective restoration.

Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning