HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: utfsym
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.02601v1 [eess.IV] 05 Mar 2024

Low-Res Leads the Way:
Improving Generalization for Super-Resolution by Self-Supervised Learning

Haoyu Chen11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Wenbo Li22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT,  Jinjin Gu33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Jingjing Ren11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Haoze Sun44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT,
Xueyi Zou22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Zhensong Zhang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Youliang Yan22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Lei Zhu1,515{}^{1,5}start_FLOATSUPERSCRIPT 1 , 5 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTThe Hong Kong University of Science and Technology (Guangzhou) 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTHuawei Noah’s Ark Lab 33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTThe University of Sydney
44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTTsinghua University 55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTThe Hong Kong University of Science and Technology
Project page: https://haoyuchen.com/LWay
Lei Zhu ([email protected]) is the corresponding author.
Abstract

For image super-resolution (SR), bridging the gap between the performance on synthetic datasets and real-world degradation scenarios remains a challenge. This work introduces a novel ”Low-Res Leads the Way” (LWay) training framework, merging Supervised Pre-training with Self-supervised Learning to enhance the adaptability of SR models to real-world images. Our approach utilizes a low-resolution (LR) reconstruction network to extract degradation embeddings from LR images, merging them with super-resolved outputs for LR reconstruction. Leveraging unseen LR images for self-supervised learning guides the model to adapt its modeling space to the target domain, facilitating fine-tuning of SR models without requiring paired high-resolution (HR) images. The integration of Discrete Wavelet Transform (DWT) further refines the focus on high-frequency details. Extensive evaluations show that our method significantly improves the generalization and detail restoration capabilities of SR models on unseen real-world datasets, outperforming existing methods. Our training regime is universally compatible, requiring no network architecture modifications, making it a practical solution for real-world SR applications.

1 Introduction

Refer to caption
Figure 1: Our proposed training method combine the benefits of supervised learning (SL) on synthetic data and self-supervised learning (SSL) on the unseen test images, achieve high quality and high fidelity SR results.

Image super-resolution (SR) aims to restore high-resolution (HR) images from their low-resolution (LR) or degraded counterparts. The inception of the deep-learning-based SR model can be traced back to SRCNN [14]. Recently, advancements in deep learning models have substantially enhanced SR performance [29, 27, 8, 6, 54, 1, 12, 55, 42, 59, 9, 10, 28, 57], particularly in addressing specific degradation types like bicubic downsampling. Nevertheless, the efficacy of SR models is generally restricted by the degradation strategies employed during the training phase, posing great challenges in complex real-world applications.

In the realm of real-world SR, as shown in Figure 2, training approaches can primarily be categorized into three main paradigms. (a) Unsupervised Learning with Unpaired Data: Methods within this paradigm [43, 49, 58, 2, 41, 48, 15, 3] commonly utilize Generative Adversarial Networks (GAN) architecture to learn target distributions without paired data. Using one or multiple discriminators, they distinguish between generated images and actual samples, guiding the generator to model accurately. However, as this approach heavily relies on external data, it encounters significant challenges when facing scarce target domain data, particularly in real-world scenarios. The GAN framework for unsupervised learning also has some drawbacks. Firstly, it inherently struggles with stability during training, leading to noticeable artifacts in SR outputs. Secondly, it is difficult for a single 0/1 plane modelled by a discriminator to accurately separate the target domain [33]. This can result in imprecise distribution learning. (b) Supervised Learning with Paired Synthetic Data: BSRGAN [50] and Real-ESRGAN [45] have largely enhanced the SR model’s generalization ability by simulating more realistic degradation. However, synthetic data, despite mimicking certain real-world conditions, inadequately captures the complex and variable nature of real scenarios, the gap between synthetic and real degradation persists. Consequently, the limited degradation patterns in synthetic data may lead to an over-smoothness issue, sacrificing crucial details and textures. Adapting effectively to complex, variable, or unknown degradations thus remains a formidable challenge. (c) Self-supervised Learning with a Single Image: Techniques falling within this category [40, 37, 11] leverage the intrinsic statistical characteristics of natural images, eliminating the necessity for external datasets. Generally, these methods enable self-supervised learning directly from the input LR image. Despite its inherent flexibility, this approach may exhibit reduced efficacy when handling images lacking repetitive patterns. As a result, in real-world scenarios, where necessary recurring structure are absent, these techniques tends to underperform compared to supervised learning methods that employ paired synthetic data.

It’s notable that real LR/HR image pairs in the target domain are often prohibitively expensive or unavailable. Furthermore, a significant gap persists between synthesized data and real-world data. Given the intrinsic limitations of current methodologies, a critical question arises: Is there an approach that combines the strengths of these diverse strategies? In addressing this, we propose the novel ”Low-Res Leads the Way” (LWay) training framework, which merges supervised learning (SL) pre-training with self-supervised learning (SSL) (see Figure 2 (d)). This approach aims to narrow the disparity between synthetic training data and real test images, as depicted in Figure 1. By integrating supervised learning’s predictive capabilities with the ability to swiftly adapt to unique characteristics present in test LR images, this framework effectively produces high-quality results for unseen real-world images.

Refer to caption
Figure 2: Comparison of different learning approaches for real-world image SR.

The initial step involves training an LR reconstruction network specifically designed to extract a degradation embedding from the LR image. This degradation embedding is then applied to the HR image, facilitating the re-generation of LR content. Upon encountering a test image, we derive its super-resolved result from an off-the-shelf SR model pre-trained on synthetic data. This output is fed into the fixed LR reconstruction network to produce the corresponding degraded counterpart. Subsequently, a self-supervised loss is computed by comparing this degraded counterpart to the original LR image, thereby updating specific parameters within the SR model. Given our observation that pre-trained SR models adeptly handle low-frequency domains but falter in high-frequency areas, we incorporate Discrete Wavelet Transform (DWT) to isolate high-frequency elements from the LR image. This component effectively shifts the model’s focus to the recuperation of high-frequency nuances, and avoids negative impacts on low-frequency areas.

With this innovative framework, our approach eliminates the need for paired LR/HR target domain images, significantly enhancing the performance of SL pre-trained models on unseen real-world data. Our method not only retains the essential content of LR images but also adds high-definition characteristics, ensuring a balance between fidelity and quality. Moreover, this training regime requires no modifications to the network architecture, offering broad compatibility across all SR models. Through extensive evaluations on real-world datasets, we have demonstrated our method’s substantial improvements in generalization performance.

2 Related Work

2.1 Supervised Learning for Real-World SR

While recent years have witnessed significant advancements in the field of super-resolution (SR), conventional SR models such as SRCNN [14], VDSR [20], EDSR [31], RCAN [53], among others [54, 1, 12, 6, 55, 25, 26, 29, 34, 21, 27, 9, 10, 28, 57], have predominantly relied upon predefined degradation processes, such as bicubic downsampling. This simplification, while contributing to the theoretical understanding of SR, often falls short in capturing the intricate and diverse degradations inherent in real-world imaging scenarios, limiting practical adaptability across applications. Consequently, there is a pressing need to explore more sophisticated and realistic degradation models.

To this end, recent efforts have been directed toward methods capturing paired low-resolution (LR) and high-resolution (HR) images from real-world environments, as demonstrated by datasets like RealSR [4] and DRealSR [47]. However, these methods face challenges, including precise image alignment, complex hardware setups, and specific degradation characteristics (e.g., Canon 5D3 and Nikon D810 cameras in RealSR), posing obstacles to practicality and scalability. Recent techniques, including Real-ESRGAN [45] and BSRGAN [50], have attempted to address these shortcomings by synthesizing LR images with more realistic degradation. Despite these advancements, a notable disparity persists between synthesized and authentic degradation. This often results in over-smoothed images that sacrifice fine textural details, as illustrated by [52]. Certain studies [7] have endeavored to enhance the generalizability using limited degradation data; however, the practical application scenarios remain restricted.

As a result, there is a growing demand for innovative approaches that are capable of adapting to the intricate and mixed degradation patterns that typify real-world applications. The SR results should not only exhibit high resolution but also encompass rich detail, ensuring fidelity.

Refer to caption
Figure 3: The proposed training pipeline (LWay) consists of two steps. In Step 1, we pre-train a LR reconstruction network to capture degradation embedding from LR images. This embedding is then applied to HR images, regenerating LR content. Moving to Step 2, for test images, a pre-trained SR model generates SR outputs, which are then degraded by the fixed LR reconstruction network. We iteratively update the SR model using a self-supervised learning loss applied to LR images, with a focus on high-frequency details through weighted loss. This refinement process enhances the SR model’s generalization performance on previously unseen images.

2.2 Unsupervised Learning for Real-world SR

Unsupervised super-resolution [43, 49, 58, 2, 41, 48, 15, 3] serves as a technique to mitigate generation bias inherent in synthetic datasets. These approaches deviate from the conventional reliance on extensive paired data by harnessing the data-generating capabilities inherent in convolutional neural networks (CNNs). Ulyanov et al. [43] posited CNNs as implicit priors for capturing natural image statistics, a concept further explored by the Zero-Shot Super-Resolution (ZSSR) [40] model, which uniquely tailors SR algorithms to the repeating patterns within the input image itself. Generative Adversarial Networks (GANs) have significantly propelled the field forward. KernelGAN [2], for instance, aligns the statistical distribution of downscaled images with their original versions, enhancing the refinement of SR methods’ outputs. CinCGAN [49] marks an early exploration into utilizing unpaired data for implicit degradation modeling. It employs a strategy that transforms LR images into noise-free ‘clean’ states through bicubic downsampling. This approach, backed by a dual CycleGAN architecture [58], fosters a cycle-consistent adaptation that eliminates the need for paired datasets. The unsupervised approach utilizing GANs also encompasses methods such as Degradation GAN [3], FSSR [15], DASR [48] and pseudo-supervision [36], which all employ discriminators to learn the distributions of HR or LR images, or even clean LR images. These methods are instrumental in constraining the network to transform the generated images to align with the corresponding distributions.

Despite considerable advancements in unsupervised methods, they still exhibit certain limitations. For instance, ZSSR and similar methods typically rely on the prerequisite assumption that images possess repetitive patterns. GAN-based approaches, in particular, require substantial data to fit certain specific degradation types effectively. They also face stability challenges during training, which often results in artifacts in SR outputs. Furthermore, the challenge for a discriminator to accurately distinguish the target domain using a binary (0/1) plane model can lead to imprecise learning of distributions. These constraints pose challenges to the practical utility of these methods in real-world scenarios. Exploring more generalized and flexible approaches becomes imperative.

3 Method

In the pursuit of practical applications for image SR, we introduce an unprecedented training methodology. This novel strategy marks a departure from established paradigms, fusing the precision of supervised pre-training with the innovation of self-supervised learning to address the complexities of real-world image degradation. Our proposed framework is detailed in Figure 3.

3.1 LR Reconstruction Pre-training

We introduce an LR reconstruction branch that plays a pivotal role in finetuning our SR model 𝒮𝒮\mathcal{S}caligraphic_S on test images derived from real-world environments. Central to this process is the Degradation Encoder \mathcal{E}caligraphic_E, engineered to distill the degradation signatures from LR images ILRsubscript𝐼LRI_{\text{LR}}italic_I start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT into a concise degradation embedding 𝐞𝐞\mathbf{e}bold_e. The dimension is 512, formulated as 𝐞=(ILR)𝐞subscript𝐼LR\mathbf{e}=\mathcal{E}(I_{\text{LR}})bold_e = caligraphic_E ( italic_I start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT ). Subsequently, the Reconstructor \mathcal{R}caligraphic_R employs 𝐞𝐞\mathbf{e}bold_e and a high-resolution image IHRsubscript𝐼HRI_{\text{HR}}italic_I start_POSTSUBSCRIPT HR end_POSTSUBSCRIPT to synthesize an estimated LR image I^LRsubscript^𝐼LR\hat{I}_{\text{LR}}over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT, aiming to fulfill I^LR=(IHR,𝐞)subscript^𝐼LRsubscript𝐼HR𝐞\hat{I}_{\text{LR}}=\mathcal{R}(I_{\text{HR}},\mathbf{e})over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT = caligraphic_R ( italic_I start_POSTSUBSCRIPT HR end_POSTSUBSCRIPT , bold_e ). To ensure the integrity of 𝐞𝐞\mathbf{e}bold_e, we incorporate a dual-component loss function \mathcal{L}caligraphic_L, integrating both an L1 norm and the Learned Perceptual Image Patch Similarity (LPIPS) metric. The combined loss function is thus articulated as (ILR,I^LR)=1+LPIPSsubscript𝐼LRsubscript^𝐼LRsubscript1subscriptLPIPS\mathcal{L}(I_{\text{LR}},\hat{I}_{\text{LR}})=\mathcal{L}_{1}+\mathcal{L}_{% \text{LPIPS}}caligraphic_L ( italic_I start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT , over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT, meticulously tuning the reconstruction fidelity. Notably, LR reconstruction branch has great robustness, requiring only minimal data for training, is precisely why we advocate for the inclusion of an LR reconstruction branch. This ensures that even when faced with new forms of degradation, its support in the finetuning of the SR model remains uncompromised. The efficiency and robustness of this approach, pivotal in our methodology, will be detailed and validated in the following sections.

3.2 Self-supervised Learning on Test Images

Our approach innovatively fine-tunes a subset of parameters in a SR network, specifically tailored for processing previously unseen real-world images. This method refines the SR network to adeptly handle the complexities of actual degradation patterns. For an real-world LR test image ILRtestsuperscriptsubscript𝐼LRtestI_{\text{LR}}^{\text{test}}italic_I start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT, the SR network 𝒮𝒮\mathcal{S}caligraphic_S initially produces a super-resolved image ISRinitsuperscriptsubscript𝐼SRinitI_{\text{SR}}^{\text{init}}italic_I start_POSTSUBSCRIPT SR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT. The pre-trained LR reconstruction branch, with its parameters frozen, extracts a degradation embedding 𝐞testsuperscript𝐞test\mathbf{e}^{\text{test}}bold_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT from ILRtestsuperscriptsubscript𝐼LRtestI_{\text{LR}}^{\text{test}}italic_I start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT, expressed as 𝐞test=(ILRtest)superscript𝐞testsuperscriptsubscript𝐼LRtest\mathbf{e}^{\text{test}}=\mathcal{E}(I_{\text{LR}}^{\text{test}})bold_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT = caligraphic_E ( italic_I start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ). The self-supervised fine-tuning then commences, leveraging ISRinitsuperscriptsubscript𝐼SRinitI_{\text{SR}}^{\text{init}}italic_I start_POSTSUBSCRIPT SR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT and 𝐞testsuperscript𝐞test\mathbf{e}^{\text{test}}bold_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT to adjust a specific subset of the SR network’s parameters θftsubscript𝜃ft\theta_{\text{ft}}italic_θ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT. This fine-tuning is formulated as an optimization problem:

θft*=argminθft((𝒮θ(ILRtest),𝐞test),ILRtest),superscriptsubscript𝜃ftsubscriptsubscript𝜃ftsubscript𝒮𝜃superscriptsubscript𝐼LRtestsuperscript𝐞testsuperscriptsubscript𝐼LRtest\theta_{\text{ft}}^{*}=\arg\min_{\theta_{\text{ft}}}\mathcal{L}(\mathcal{R}(% \mathcal{S}_{\theta}(I_{\text{LR}}^{\text{test}}),\mathbf{e}^{\text{test}}),I_% {\text{LR}}^{\text{test}})\,,italic_θ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( caligraphic_R ( caligraphic_S start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ) , bold_e start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ) , italic_I start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ) ,

where θft*superscriptsubscript𝜃ft\theta_{\text{ft}}^{*}italic_θ start_POSTSUBSCRIPT ft end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is the optimized parameters from full model θ𝜃\thetaitalic_θ.

This strategic adjustment enhances the SR network’s capability to reconstruct images with high fidelity to the LR inputs, enhances the SR network’s ability to generalize to real-world degradation without the need for paired data.

Focused enhancement of high-frequency details. Conventional SR methods tend to proficiently reconstruct low-frequency regions but often neglect or inadequately restore high-frequency details. In addition, the low-frequency regions do not require LR reconstruction due to the absence of detailed texture. Therefore, our approach aims to concentrate the LR reconstruction process specifically on high-frequency areas, thereby preventing the introduction of artifacts into the low-frequency areas. Specifically, we apply Discrete Wavelet Transform (DWT) to obtain the high-frequency component, and then normalize it to yield a weight map 𝐖[0,1]𝐖01\mathbf{W}\in[0,1]bold_W ∈ [ 0 , 1 ]. This weight map is then utilized to calculate a weighted loss, ensuring the fidelity to high-frequency details:

=1(𝐖I^LRtest,𝐖ILRtest)+LPIPS(𝐖I^LRtest,𝐖ILRtest),subscript1direct-product𝐖superscriptsubscript^𝐼LRtestdirect-product𝐖superscriptsubscript𝐼LRtestsubscriptLPIPSdirect-product𝐖superscriptsubscript^𝐼LRtestdirect-product𝐖superscriptsubscript𝐼LRtest\mathcal{L}=\mathcal{L}_{1}(\mathbf{W}\odot\hat{I}_{\text{LR}}^{\text{test}},% \mathbf{W}\odot I_{\text{LR}}^{\text{test}})+\mathcal{L}_{\text{LPIPS}}(% \mathbf{W}\odot\hat{I}_{\text{LR}}^{\text{test}},\mathbf{W}\odot I_{\text{LR}}% ^{\text{test}})\,,caligraphic_L = caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_W ⊙ over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT , bold_W ⊙ italic_I start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT LPIPS end_POSTSUBSCRIPT ( bold_W ⊙ over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT , bold_W ⊙ italic_I start_POSTSUBSCRIPT LR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ) ,

where direct-product\odot denotes element-wise multiplication. The combined loss effectively guides the network to restore high-frequency details with greater precision, improving the perceptual quality of the super-resolved image without compromising low-frequency content.

Refer to caption
Figure 4: The SR model advances through the proposed fine-tuning iterations, moving from the supervised learning (SL) space of synthetic degradation to the self-supervised learning (SSL) space learned from test images. This results in enhanced SR quality and fidelity.

3.3 Discussion

By combining supervised learning (SL) on synthetic data with self-supervised learning (SSL) on test images with unknown degradation, we dynamically adjust the modeling space based on the intrinsic features of test images, steering the SL space towards a more precise SSL space. Figure 4 shows the effectiveness of our method during the fine-tuning process. Our method achieves high-quality and high-fidelity SR while maintaining general compatibility across all models. The primary advantages of our approach compared to other methods are included in the following:

General Degradation Modeling. The transformation from LR to HR images is recognized as a challenging task, while the reverse HR to LR transformation is comparatively simpler and more robust. Our method capitalizes on this observation, avoiding excessive reliance on extensive paired datasets. Instead, we opt to pre-train a universal degradation embedding extraction and LR reconstruction model. This characteristic ensures that our approach is not bound by assumptions of uniform degradation across image datasets. During the training of the SR model, these parameters remain fixed, allowing the SR model for flexible adaption to unknown distributions in real-world scenarios. On the contrary, CycleGAN-based methods simultaneously learn the mappings from LR to HR and HR to LR. This process relies heavily on a substantial amount of data. Furthermore, because CycleGAN implicitly learns the HR to LR mapping without an explicit degradation extraction process, its underlying assumption is that the degradation across the entire dataset is consistent. Consequently, it can only fit certain degradation patterns, largely impacting its performance in real-world scenarios with limited data availability.

Dense pixelwise self-supervision. Through self-supervised learning, our method operates independently of external labels, leveraging dense LR pixel-level signals for supervision. This allows the model to learn richer texture features from the intrinsic image structure. This stands in contrast to traditional supervised approaches that rely on discriminators, which may learn inaccurate features due to the sparsity of supervision signals, leading to suboptimal results.

Robust regularization. Our approach can be viewed as a form of regularization constraint. By integrating degradation embedding extraction and decoupling it from the LR image reconstruction, our method maintains effectiveness in guiding the reconstruction process even when faced with imperfect degradation prediction. This substantially boosts the robustness of our approach, enabling it to learn rich and accurate texture information from the test images.

4 Experiments

CNN-based Transformer-based VQ-based Diffusion-based
Dataset Sensors Metrics Real-ESRGAN+ + LWay Gain BSRGAN + LWay Gain SwinIR-GAN + LWay Gain FeMaSR + LWay Gain StableSR + LWay Gain
PSNR \uparrow 27.51 29.18 +1.67 28.81 28.85 +0.04 28.12 28.96 +0.84 25.72 28.16 +2.44 25.50 27.22 +1.72
SSIM \uparrow 0.8348 0.8688 +0.034 0.8473 0.8496 +0.0023 0.8486 0.8579 +0.0093 0.7811 0.8383 +0.0572 0.7684 0.8043 +0.0359
LPIPS \downarrow 0.1947 0.1479 -0.0468 0.1988 0.1572 -0.0416 0.1850 0.1469 -0.0381 0.2543 0.1747 -0.0796 0.2636 0.2019 -0.0617
MAD \downarrow 133.96 111.91 -22.05 119.08 116.77 -2.31 125.17 111.71 -13.46 143.38 117.48 -25.90 145.36 124.15 -21.21
NLPD \downarrow 0.2807 0.2437 -0.037 0.2594 0.2569 -0.0025 0.2670 0.2541 -0.0129 0.3239 0.2778 -0.0461 0.3426 0.3074 -0.0352
Canon DISTIS \downarrow 0.1621 0.1444 -0.0177 0.1794 0.1558 -0.0236 0.1557 0.1352 -0.0205 0.2116 0.1808 -0.0308 0.1897 0.1596 -0.0301
PSNR \uparrow 26.81 28.58 +1.77 28.13 28.65 +0.52 27.54 28.55 +1.01 25.41 27.87 +2.46 25.54 26.92 +1.38
SSIM \uparrow 0.7861 0.8249 +0.0388 0.8012 0.8057 +0.0045 0.8043 0.813 +0.0087 0.7314 0.7936 +0.0622 0.7370 0.7686 +0.0316
LPIPS \downarrow 0.2300 0.1769 -0.0531 0.2302 0.1750 -0.0552 0.2154 0.176 -0.0394 0.2738 0.2028 -0.071 0.2711 0.2156 -0.0555
MAD \downarrow 131.62 108.18 -23.44 118.48 105.64 -12.84 122.65 106.73 -15.92 137.54 110.79 -26.75 139.26 119.29 -19.97
NLPD \downarrow 0.3061 0.2667 -0.0394 0.2805 0.2758 -0.0047 0.2844 0.272 -0.0124 0.3419 0.297 -0.0449 0.3513 0.3215 -0.0298
RealSR Nikon DISTIS \downarrow 0.1950 0.1714 -0.0236 0.2102 0.1791 -0.0311 0.1842 0.1639 -0.0203 0.2340 0.2042 -0.0298 0.2131 0.1837 -0.0294
PSNR \uparrow 30.16 31.4 +1.24 30.47 31.23 +0.76 29.92 30.77 +0.85 27.51 29.75 +2.24 28.63 29.28 +0.65
SSIM \uparrow 0.8326 0.8597 +0.0271 0.8260 0.8442 +0.0182 0.8213 0.8398 +0.0185 0.7725 0.8096 +0.0371 0.7648 0.7785 +0.0137
LPIPS \downarrow 0.2488 0.2341 -0.0147 0.2685 0.2469 -0.0216 0.2565 0.2383 -0.0182 0.3228 0.2931 -0.0297 0.3331 0.3017 -0.0314
MAD \downarrow 125.20 112.1 -13.10 123.22 115.14 -8.08 124.85 114.09 -10.76 140.50 125.52 -14.98 141.13 130.01 -11.12
NLPD \downarrow 0.3032 0.2751 -0.0281 0.3034 0.2857 -0.0177 0.3105 0.2895 -0.021 0.3502 0.3152 -0.035 0.3503 0.3402 -0.0101
sony DISTIS \downarrow 0.1859 0.1765 -0.0094 0.2115 0.1934 -0.0181 0.1883 0.1783 -0.01 0.2314 0.2168 -0.0146 0.2296 0.2176 -0.012
PSNR \uparrow 29.53 29.88 +0.35 29.16 29.4 +0.24 28.94 29.57 +0.63 26.42 28.26 +1.84 28.69 29.05 +0.36
SSIM \uparrow 0.8050 0.8206 +0.0156 0.7931 0.7944 +0.0013 0.8002 0.8071 +0.0069 0.6976 0.7557 +0.0581 0.7460 0.7487 +0.0027
LPIPS \downarrow 0.3107 0.308 -0.0027 0.3275 0.2926 -0.0349 0.3184 0.3093 -0.0091 0.4129 0.3762 -0.0367 0.3853 0.3800 -0.0053
MAD \downarrow 127.91 125.04 -2.87 130.94 126.87 -4.07 131.73 126.04 -5.69 151.35 138.85 -12.50 137.60 132.71 -4.89
NLPD \downarrow 0.3016 0.2899 -0.0117 0.3157 0.3129 -0.0028 0.3093 0.3005 -0.0088 0.3897 0.3425 -0.0472 0.3410 0.3353 -0.0057
olympus DISTIS \downarrow 0.2130 0.2118 -0.0012 0.2276 0.2145 -0.0131 0.2181 0.2109 -0.0072 0.2552 0.2406 -0.0146 0.2412 0.2371 -0.0041
PSNR \uparrow 29.81 30.83 +1.02 29.98 31.05 +1.07 29.11 30.94 +1.83 27.83 29.44 +1.61 29.13 29.88 +0.75
SSIM \uparrow 0.8094 0.8283 +0.0189 0.7987 0.8236 +0.0249 0.7918 0.8193 +0.0275 0.7413 0.7798 +0.0385 0.7428 0.7554 +0.0126
LPIPS \downarrow 0.2592 0.2581 -0.0011 0.2738 0.2624 -0.0114 0.2688 0.2517 -0.0171 0.3144 0.2973 -0.0171 0.3143 0.3021 -0.0122
MAD \downarrow 124.51 116.18 -8.33 124.38 114.04 -10.34 126.61 112.79 -13.82 137.50 124.81 -12.69 132.36 122.85 -9.51
NLPD \downarrow 0.304 0.2825 -0.0215 0.3109 0.2852 -0.0257 0.3184 0.2869 -0.0315 0.3604 0.3215 -0.0389 0.3444 0.3312 -0.0132
DRealSR panasonic DISTIS \downarrow 0.2000 0.1974 -0.0026 0.2130 0.2021 -0.0109 0.2046 0.1948 -0.0098 0.2243 0.2121 -0.0122 0.2255 0.2196 -0.0059
Table 1: The performance improvements across various model types utilizing our proposed training methodology.
# of Fine-tuning
Images Per Model
Description LPIPS \downarrow DISTIS \downarrow MAD \downarrow
0
baseline,
without fine-tuning
0.3136 0.2353 117.71
1
fine-tuning
on every single images
0.2351 0.1919 111.46
10
fine-tuning
on the entire testset
0.2536 0.2044 111.63
50
fine-tuning with 40 additional
images from the same sensors
0.2571 0.2037 108.62
Table 2: The impact of the number of images used for a single fine-tuning training. Our method can be fine-tuned either on individual images or on the entire test set, which greatly reduces cost.

4.1 Experimental Settings

Testing methods. Our proposed method serves as a universally applicable self-supervised learning strategy for various cutting-edge blind SR models, eliminating the necessity for architectural modifications. We conduct evaluations on a diverse range of advanced SR methods, including BSRGAN [50] and Real-ESRGAN+ [45] employing conventional CNN frameworks, SwinIR-GAN [29] integrating Transformer structures, FeMaSR [5] utilizing VQGAN, and StableSR [44] based on pre-trained diffusion. We use officially released SR models as baselines and conduct self-supervised fine-tuning on targeted test datasets. While fine-tuning a single image can lead to superior performance, for improved training efficiency, we opt to fine-tune the entire test dataset collectively. All experiments are conducted under this configuration unless otherwise specified.

Implementation details. We adopt the Adam [22] optimizer. For StableSR, we set the learning rate to 5e-5 and the batch size to 1. For the remaining models, a learning rate of 2e-6 and a batch size of 6 are used. Each model undergoes rapid fine-tuning on a single V100 GPU. The duration of training varies among models and images, typically spanning 150 to 500 iterations. More details are provided in the supplementary materials.

Training datasets. Our self-supervised fine-tuning approach is directly applied to the test set, without the need for a separate training set. The only prerequisite training is allocated for the LR reconstruction network, which is trained using 6,000 real paired images collected in-house. It is critical to note that these data were invisible to the SR network.

Testing datasets. Our method is evaluated on real-world paired datasets, including RealSR [4] and DRealSR [47]. These datasets are meticulously curated from diverse device sensors to reflect various degradation characteristics. To ensure a fair comparison with other methods, we follow the standard setting of cropping each image into multiple patches for a 4×\times× SR. The LR image patch size is 128 ×\times× 128, while the corresponding HR size is 512 ×\times× 512.

Evaluation metrics. We employ LPIPS [51], DISTIS [13], and NLPD [17] metrics that closely align with human perception [18, 16]. Additionally, traditional metrics such as PSNR, SSIM [46], and MAD [23] are included for a comprehensive assessment. Six different metrics provide a comprehensive evaluation.

4.2 Improvements on Existing Methods

The results outlined in Table 1 compellingly demonstrate our method’s effectiveness in significantly advancing SR quality. Notably, improvements are consistently observed across all models, datasets, and metrics, underscoring the universal applicability of our approach. For CNN-based models like Real-ESRGAN+, our method achieves a notable enhancement on the Nikon dataset, delivering a 1.77dB improvement in PSNR and a 0.0388 increase in SSIM. These improvements contribute to more precise reconstruction of high-quality images. Furthermore, the validation of enhanced perceptual quality is evident through an LPIPS reduction of 0.0532. Additionally, when applied to Transformer models such as SwinIR-GAN, our method showcases considerable improvements. On the Olympus dataset, we observe a 0.63 dB increase in PSNR and a significant decrease in MAD by 5.69, highlighting the framework’s capacity to enhance fidelity and sharpness.

As depicted in Figure 5, in the first example, all SR models fail to preserve the original textures present in the input images, resulting in excessively smoothed fabric patterns. However, upon applying our self-supervised fine-tuning method, significant improvements are observed across all approaches, successfully reconstructing clear fabric textures. A similar improvement is evident in the second example of oil paintings. The existing SR models struggle to capture the intricate details of the paintings. Conversely, our method effectively restores the artistic effects, particularly showcasing notable enhancement for StableSR. For other examples, the results are similar as well, our method significantly improving high-frequency detail recovery, yielding results that were both sharp and rich in detail.

Refer to caption
Figure 5: Qualitative comparisons on real-world datasets. The content within the blue box represents a zoomed-in image.
Refer to caption
Figure 6: Qualitative comparisons on two old films.
Refer to caption
Figure 7: Supervised fine-tuning a baseline model on one real dataset doesn’t perform well on another due to dataset gaps. Our proposed method self-supervised fine-tuned model for specific test images achieves superior performance.

4.3 Application on Real-world Scenes

Old films often exhibit issues like graininess, color fading, and lower resolution, making them an ideal testbed for evaluating the practical capabilities of SR models. To conduct a comprehensive comparison, we curate a selection of state-of-the-art real-world SR models. These encompass various methodologies: ZSSR [40], a self-supervised learning model; DASR [30], a degradation-adaptive approach; large diffusion models such as LDM [39], DiffBIR [32], and StableSR [44]; DARSR [56], which leverages unsupervised techniques for enhanced model performance; and CAL_GAN [38], a photo-realistic SR model. We employ StableSR as the base model and implement the proposed self-supervised learning strategy. The first case in Figure 6 involves a 480p low-resolution film, namely “My Fair Lady”. Among the assessed models, ZSSR, DASR, and DARSR exhibit minimal improvements, while DiffBIR introduces unpleasing artifacts. Other models achieve slightly smoother results. Notably, our model not only accurately reproduces the hat with clear fabric textures but also effectively restores facial features, including wrinkles and contours. In contrast to some methods that may introduce unnatural effects or overly smooth distortions, our model adeptly balances the restoration of fine textures with preserving overall image clarity.

User study. We conducted a user study with the participation of 24 experienced researchers. Each participant was tasked with assigning a visual perceptual quality score ranging from 0 to 10 to every image. The results, depicted in the Figure 9, reveal a significant lead of our proposed method over alternative approaches, surpassing the second-best method by more than 2 points. Notably, the scores for DASR, DiffBIR, and DARSR were even lower than those for LR images, indicating a limited effectiveness of these methods in handling real-world images.

Training Type
 Number of
 Sensors
 Number of
 Images
 LPIPS \downarrow  DISTIS \downarrow
- (baseline) - - 0.2302 0.2102
Synthetic Data - 2K 0.1836 0.1885
- 6k 0.1816 0.1873
Real-world Data 1 0.6K 0.2003 0.1970
2 2K 0.1785 0.1793
2 4K 0.1722 0.1772
3 6K 0.1800 0.1830
Table 3: Ablation on training data of LR reconstruction.

baseline

128

256

512

1024

2048

4096

PSNR \uparrow

28.13

28.92

28.54

28.85

29.10

29.56

29.20

LPIPS \downarrow

0.2302

0.1804

0.1776

0.1722

0.1736

0.1629

0.1669

DISTS \downarrow

0.2192

0.1792

0.1818

0.1772

0.1749

0.1630

0.1656

Table 4: Ablation on dimensions of degradation embedding.
Table 5: Our method versus supervised real data fine-tuning.
Method LPIPS \downarrow DISTIS \downarrow
baseline 0.2302 0.2102
baseline + real data 0.2268 0.1989
LWay (ours) 0.1722 0.1772
HF Loss LPIPS \downarrow DISTIS \downarrow
\usym

2613

0.1858 0.1879
\usym

1F5F8

0.1722 0.1772
Table 5: Our method versus supervised real data fine-tuning.
Table 6: Ablation study on high-frequency (HF) loss.

4.4 Ablation Study

We conducted an ablation study on the RealSR Nikon test set using BSRGAN. We trained 65% of the model parameters to achieve the lowest LPIPS score on this test set.

Training data of LR reconstruction. In this section, we demonstrate the robustness of the LR reconstruction network trained with limited data, which forms the cornerstone of our design. As depicted in Table 3, we incorporated two types of training data. The first category includes synthetic data created using BSRGAN degradation, while the second involves real paired images collected for training. Both settings result in improved performance. Specifically, compared to synthetic data, which brings a 0.0486 improvement in LPIPS, the utilization of only 600 images brings a 0.0299 improvement, and 4000 images notably boosts LPIPS by 0.058. Adding more images beyond this threshold did not yield any further advancement. We attribute this to the inherent ease in mapping from HR to LR compared to the reverse LR to HR mapping, mitigating the necessity for extensive training data. This assertion finds further support in Figure 9, where t-SNE visualization distinctly separates distinct degradations, even for unseen degradation types.

Degradation embedding dimensions. Table 4 tests different embedding dimensions, indicating that all variants significantly enhance performance. While a dimension of 512 (default) is effective, higher one (2048) can further improve results.

Our method versus supervised fine-tuning. To comprehensively illustrate the efficacy of our method, we conduct additional supervised fine-tuning of the baseline model using the gathered real paired data. As depicted in Table 6, we note marginal improvements. This aligns with our contention that LR to HR mapping poses inherent difficulties. Training with data from one sensor type showed negligible benefits for another, suggesting a significant gap in degradation patterns. This was further corroborated by Figure 7, where it generates over-smoothed outputs. Conversely, our method showcases robustness and substantially enhances the final SR quality. This proves that our proposed training strategy is more effective.

Figure 8: User study on the visual perceptual quality of results from different methods on real images.
Refer to caption
Refer to caption
Figure 8: User study on the visual perceptual quality of results from different methods on real images.
Figure 9: t-SNE visualization of embeddings from LR degradation encoder.
Refer to caption
Figure 10: The performance curve for fine-tuning different percentages of parameters.

Number of images used in fine-tuning. We employ self-supervised LR reconstruction fine-tuning on test images to optimize the SR model. This section investigates the impact of the number of fine-tuning images on the final performance. As indicated in Table 2, we establish a baseline without fine-tuning using ten real-world images. Conducting single-shot fine-tuning on individual images yields the most favorable results, allowing models to best adapt to the distribution of input images. Next, we conduct experiments involving collective fine-tuning of ten images. Results show significant improvements compared to the baseline but are not as effective as fine-tuning individual images separately. Furthermore, we extend our study by fine-tuning the model using an additional forty images to investigate whether acquiring more images from the same sensor would refine the model further. Our findings indicate that compared to training on ten images, there is a decline in LPIPS, while DISTIS and MAD exhibit slight improvements. This suggests a trade-off between fine-tuning performance and efficiency.

High-frequency loss. Table 6 illustrates the impact of the introduced high-frequency loss. The integration of the high-frequency loss results in a notable improvement, affirming the efficacy of our design. Importantly, it enhances high-frequency recovery and avoids the negative impact of our training method on low-frequency areas.

Fine-tuning parameters. In our exploration of parameter fine-tuning, we observe that increasing the number of trained parameters results in higher PSNR values. However, the LPIPS score reaches its optimal point at approximately 60% - 70% of the parameters, as depicted in Figure 10. Considering the limitation of PSNR, we prioritize the use of LPIPS as our reference. It’s important to note that different network and testsets may yield varied conclusions. The supplementary materials show more details.

5 Conclusion

In conclusion, our proposed super-resolution training strategy, termed “Low-Res Leads the Way”, represents an innovative approach that successfully bridges the disparity between synthetic data supervised training and real-world test image self-supervision. Demonstrating impressive performance and robustness across various SR frameworks and real-world benchmarks, our method marks a significant advancement toward achieving effective real-world applications.

References

  • Ahn et al. [2018] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European conference on computer vision (ECCV), pages 252–268, 2018.
  • Bell-Kligler et al. [2019] Sefi Bell-Kligler, Assaf Shocher, and Michal Irani. Blind super-resolution kernel estimation using an internal-gan. Advances in Neural Information Processing Systems, 32, 2019.
  • Bulat et al. [2018] Adrian Bulat, Jing Yang, and Georgios Tzimiropoulos. To learn image super-resolution, use a gan to learn how to do image degradation first. In Proceedings of the European conference on computer vision (ECCV), pages 185–200, 2018.
  • Cai et al. [2019] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019.
  • Chen et al. [2022] Chaofeng Chen, Xinyu Shi, Yipeng Qin, Xiaoming Li, Xiaoguang Han, Tao Yang, and Shihui Guo. Real-world blind super-resolution via feature matching with implicit high-resolution priors. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1329–1338, 2022.
  • Chen et al. [2021] Haoyu Chen, Jinjin Gu, and Zhi Zhang. Attention in attention network for image super-resolution. arXiv preprint arXiv:2104.09497, 2021.
  • Chen et al. [2023a] Haoyu Chen, Jinjin Gu, Yihao Liu, Salma Abdel Magid, Chao Dong, Qiong Wang, Hanspeter Pfister, and Lei Zhu. Masked image training for generalizable deep image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1692–1703, 2023a.
  • Chen et al. [2023b] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super-resolution transformer. In CVPR, pages 22367–22377, 2023b.
  • Chen et al. [2023c] Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xiaokang Yang, and Fisher Yu. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12312–12321, 2023c.
  • Chen et al. [2024] Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, and Xiaokang Yang. Recursive generalization transformer for image super-resolution. In International Conference on Learning Representations (ICLR), 2024.
  • Cheng et al. [2020] Xi Cheng, Zhenyong Fu, and Jian Yang. Zero-shot image super-resolution with depth guided internal degradation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 265–280. Springer, 2020.
  • Dai et al. [2019] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11065–11074, 2019.
  • Ding et al. [2020] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5):2567–2581, 2020.
  • Dong et al. [2015] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
  • Fritsche et al. [2019] Manuel Fritsche, Shuhang Gu, and Radu Timofte. Frequency separation for real-world super-resolution. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3599–3608. IEEE, 2019.
  • Gu et al. [2020] Jinjin Gu, Haoming Cai, Haoyu Chen, Xiaoxing Ye, Jimmy Ren, and Chao Dong. Image quality assessment for perceptual image restoration: A new dataset, benchmark and metric. arXiv preprint arXiv:2011.15002, 2020.
  • Hepburn et al. [2019] Alexander Hepburn, Valero Laparra, Ryan McConville, and Raul Santos-Rodriguez. Enforcing perceptual consistency on generative adversarial networks by using the normalised laplacian pyramid distance. arXiv preprint arXiv:1908.04347, 2019.
  • Jinjin et al. [2020] Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S Ren, and Dong Chao. Pipal: a large-scale image quality assessment dataset for perceptual image restoration. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 633–651. Springer, 2020.
  • Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  • Kim et al. [2016a] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016a.
  • Kim et al. [2016b] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1645, 2016b.
  • Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  • Larson and Chandler [2010] Eric C Larson and Damon M Chandler. Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of electronic imaging, 19(1):011006–011006, 2010.
  • Ledig et al. [2017] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
  • Li et al. [2020] Wenbo Li, Kun Zhou, Lu Qi, Nianjuan Jiang, Jiangbo Lu, and Jiaya Jia. Lapar: Linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond. NeurIPS, 33:20343–20355, 2020.
  • Li et al. [2022a] Wenbo Li, Kun Zhou, Lu Qi, Liying Lu, and Jiangbo Lu. Best-buddy gans for highly detailed image super-resolution. In AAAI, pages 1412–1420, 2022a.
  • Li et al. [2023] Wenbo Li, Xin Lu, Shengju Qian, and Jiangbo Lu. On efficient transformer-based image pre-training for low-level vision. In IJCAI-23, pages 1089–1097. International Joint Conferences on Artificial Intelligence Organization, 2023. Main Track.
  • Li et al. [2022b] Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Jinjin Gu, Yu Qiao, and Chao Dong. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 833–843, 2022b.
  • Liang et al. [2021] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1833–1844, 2021.
  • Liang et al. [2022] Jie Liang, Hui Zeng, and Lei Zhang. Efficient and degradation-adaptive network for real-world image super-resolution. In European Conference on Computer Vision, pages 574–591. Springer, 2022.
  • Lim et al. [2017] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
  • Lin et al. [2023] Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Ben Fei, Bo Dai, Wanli Ouyang, Yu Qiao, and Chao Dong. Diffbir: Towards blind image restoration with generative diffusion prior. arXiv preprint arXiv:2308.15070, 2023.
  • Liu et al. [2022] Anran Liu, Yihao Liu, Jinjin Gu, Yu Qiao, and Chao Dong. Blind image super-resolution: A survey and beyond. IEEE transactions on pattern analysis and machine intelligence, 45(5):5461–5480, 2022.
  • Liu et al. [2020] Jie Liu, Wenjie Zhang, Yuting Tang, Jie Tang, and Gangshan Wu. Residual feature aggregation network for image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2359–2368, 2020.
  • Liu et al. [2023] Yihao Liu, Jingwen He, Jinjin Gu, Xiangtao Kong, Yu Qiao, and Chao Dong. Degae: A new pretraining paradigm for low-level vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23292–23303, 2023.
  • Maeda [2020] Shunta Maeda. Unpaired image super-resolution using pseudo-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 291–300, 2020.
  • Neshatavar et al. [2023] Reyhaneh Neshatavar, Mohsen Yavartanoo, Sanghyun Son, and Kyoung Mu Lee. Icf-srsr: Invertible scale-conditional function for self-supervised real-world single image super-resolution. arXiv preprint arXiv:2307.12751, 2023.
  • Park et al. [2023] JoonKyu Park, Sanghyun Son, and Kyoung Mu Lee. Content-aware local gan for photo-realistic super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10585–10594, 2023.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Shocher et al. [2018] Assaf Shocher, Nadav Cohen, and Michal Irani. “zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3118–3126, 2018.
  • Soh et al. [2020] Jae Woong Soh, Sunwoo Cho, and Nam Ik Cho. Meta-transfer learning for zero-shot super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3516–3525, 2020.
  • Sun et al. [2023] Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Renjing Pei, Xueyi Zou, Youliang Yan, and Yujiu Yang. Coser: Bridging image and language for cognitive super-resolution. arXiv preprint arXiv:2311.16512, 2023.
  • Ulyanov et al. [2018] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9446–9454, 2018.
  • Wang et al. [2023] Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. In arXiv preprint arXiv:2305.07015, 2023.
  • Wang et al. [2021] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1905–1914, 2021.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Wei et al. [2020] Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qixiang Ye, Wangmeng Zuo, and Liang Lin. Component divide-and-conquer for real-world image super-resolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 101–117. Springer, 2020.
  • Wei et al. [2021] Yunxuan Wei, Shuhang Gu, Yawei Li, Radu Timofte, Longcun Jin, and Hengjie Song. Unsupervised real-world image super resolution via domain-distance aware training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13385–13394, 2021.
  • Yuan et al. [2018] Yuan Yuan, Siyuan Liu, Jiawei Zhang, Yongbing Zhang, Chao Dong, and Liang Lin. Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 701–710, 2018.
  • Zhang et al. [2021] Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021.
  • Zhang et al. [2018a] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018a.
  • Zhang et al. [2023] Ruofan Zhang, Jinjin Gu, Haoyu Chen, Chao Dong, Yulun Zhang, and Wenming Yang. Crafting training degradation distribution for the accuracy-generalization trade-off in real-world super-resolution. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org, 2023.
  • Zhang et al. [2018b] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV), pages 286–301, 2018b.
  • Zhang et al. [2018c] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018c.
  • Zhang et al. [2019] Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun Fu. Residual non-local attention networks for image restoration. arXiv preprint arXiv:1903.10082, 2019.
  • Zhou et al. [2023] Hongyang Zhou, Xiaobin Zhu, Jianqing Zhu, Zheng Han, Shi-Xue Zhang, Jingyan Qin, and Xu-Cheng Yin. Learning correction filter via degradation-adaptive regression for blind single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12365–12375, 2023.
  • Zhou et al. [2022] Lin Zhou, Haoming Cai, Jinjin Gu, Zheyuan Li, Yingqi Liu, Xiangyu Chen, Yu Qiao, and Chao Dong. Efficient image super-resolution using vast-receptive-field attention. In European Conference on Computer Vision, pages 256–272. Springer, 2022.
  • Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  • Zou et al. [2022] Wenbin Zou, Tian Ye, Weixin Zheng, Yunchen Zhang, Liang Chen, and Yi Wu. Self-calibrated efficient transformer for lightweight super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 930–939, 2022.
\thetitle

Supplementary Material

A Experimental Details

A.1 Fine-tuning Details

Due to the different network architectures of different types of models, we trained different parts of the parameters for them. The rationale behind parameter selection for training is corroborated by empirical experiments detailed further in the text.

  • For the training phases specific to BSRGAN, Real-ESRGAN+, and SwinIR-GAN, selective freezing of initial layers is implemented to concentrate training on the deeper parameters.

  • In the case of FeMaSR, which is based on the VQGAN (Vector Quantized Generative Adversarial Network) structure, the focus is placed on the parameters of the VQGAN encoder.

  • StableSR, which utilizes a pre-trained diffusion model, applies a controllable feature wrapping (CFW) module with an adjustable coefficient to refine the outputs of the diffusion model during the decoding process of the autoencoder. We choose to fine-tune the designed Collaborative Feature Weighting module and part of the encoder.

It usually takes 150 to 500 iterations to train. The time depends on the baseline network size, ranging from seconds to a few minutes. Our method can be fine-tuned either on individual images or on the entire test set assuming consistent degradation across the test set, which greatly reduces computational cost. Table 7 shows that our method takes only 8 minutes to fine-tune on the whole test set, much faster than others. Individual fine-tuning can improve the results if needed.

A.2 Testing Datasets

The validation of the effectiveness of our training method in real-world scenarios is conducted using real-world paired datasets, RealSR [4], and DRealSR [47]. These datasets are meticulously curated from various sensors to reflect different degradation characteristics inherent in each device. Furthermore, the datasets are segmented based on the capturing equipment. For RealSR, a 2×\times× scale factor is employed, with separate subsets for Canon and Nikon. In the case of DRealSR, a 4×\times× scale is applied across three subsets corresponding to Sony, Panasonic, and Olympus. To ensure a fair comparison with other models, we follow common settings employed by most methods. Each image is segmented into multiple smaller patches for performing 4×\times× super-resolution, with the patch size for LR images being 128×\times×128 and for HR images being 512×\times×512.

LPIPS \downarrow DISTS \downarrow PSNR \uparrow fine-tuning time \downarrow
Ours (d=2048𝑑2048d=2048italic_d = 2048) 0.1629 0.1630 29.56 8 min
ZSSR 0.2424 0.1889 29.14 19 hr
KernelGAN+ZSSR 0.3315 0.2774 23.52 72 hr
deep plug-and-play 0.2604 0.2524 29.44 1.4 hr
deep image prior 0.2091 0.2054 29.32 28 hr
Table 7: Our method can be fine-tuned on the entire test set assuming consistent degradation across the test set, which greatly reduces computational cost. Other methods need to be trained on each individual image.

A.3 Evaluation Metrics

Following prior research [18, 16], our study adopts a carefully curated set of perceptual metrics, ones that have shown a higher correlation with human perception, including Learned Perceptual Image Patch Similarity (LPIPS) [51], Deep Image Structure and Texture Similarity (DISTS) [13], and Normalized Laplacian Pyramid Distance (NLPD) [17]. LPIPS and DISTS have been empirically validated in [18, 16] as more closely aligned with human visual assessment than other metrics. We also include traditional metrics, such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) [46], and most apparent distortion (MAD) [23].

A previous study [18, 16] investigated the correlation between human visual perception quality of images and various Image Quality Assessment (IQA) metrics, with the findings summarized in the Table 8. The experimental results reveal that MAD, LPIPS, and DISTS outperform traditional PSNR and SSIM across various aspects in the context of super-resolution evaluation. Specifically, MAD demonstrates superior accuracy in assessing traditional SR methods. On the other hand, both LPIPS and DISTS exhibit higher precision when evaluating GAN-based SR methods. In the overall comparison, DISTS emerges as the most effective metric for super-resolution assessment. These findings underscore the limitations of relying solely on conventional metrics such as PSNR and SSIM, emphasizing the importance of incorporating newer metrics like MAD, LPIPS, and DISTS for a more comprehensive and accurate evaluation of super-resolution techniques.

Method SR Full Traditional SR PSNR. SR GAN-based SR
PSNR 0.4099 0.4782 0.5462 0.2839
SSIM 0.5209 0.5856 0.6897 0.3388
MAD 0.5424 0.6720 0.7575 0.3494
LPIPS 0.5614 0.5487 0.6782 0.4882
DISTS 0.6544 0.6685 0.7733 0.5527
Table 8: The Spearman rank correlation coefficient (SRCC) between MOS (Mean Opinion Score) and various IQA (Image Quality Assessment) metrics across different distortion sub-types.

B LR Reconstruction Network

B.1 Degradation Encoder

Following the methodology proposed by Liu et al. [35], our degradation encoder is constructed by integrating a pre-trained SR-GAN model [24] and downsampling layers. This collaborative framework aims to produce degradation embeddings, denoted as e𝑒eitalic_e, with a dimensionality of 512. The choice of a relatively small dimension for e𝑒eitalic_e ensures that the degradation embeddings do not encapsulate intrinsic image information but are sufficiently representative of content pertaining specifically to the degradation process. This design principle is crucial in isolating and preserving only the features relevant to degradation, avoiding contamination with the original image characteristics.

B.2 Reconstructor

In our methodology, we incorporate a modulation-demodulation-convolution strategy reminiscent of Instance Normalization as employed in StyleGAN2 [19]. This approach effectively utilizes the degradation embedding e𝑒eitalic_e to facilitate LR reconstruction when combined with the SR network’s output ISRsubscript𝐼𝑆𝑅I_{SR}italic_I start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT. To delve deeper into the specifics of this strategy, during modulation, a style is learned from the provided degradation embedding e𝑒eitalic_e. The modulation operation scales each input feature map of the convolution using the acquired style, as denoted by the equation

wijk=siwijk,subscript𝑤𝑖𝑗𝑘subscript𝑠𝑖subscript𝑤𝑖𝑗𝑘w_{ijk}=s_{i}\cdot w_{ijk},italic_w start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_w start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ,

the variables w𝑤witalic_w and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represent the original and modulated weights, respectively. The scale factor, denoted as sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, corresponds to the i𝑖iitalic_ith input feature map. The indices j𝑗jitalic_j and k𝑘kitalic_k are used to iterate over the output feature maps and spatial footprint of the convolution, respectively. This modulation process ensures that the convolutional features are adaptively adjusted based on the characteristics embedded in the degradation embedding. Following modulation, a demodulation step is executed to obtain the demodulated convolution weights, represented as

wijk′′=wijki,kwijk+2ϵ.w_{ijk}^{\prime\prime}=\frac{w_{ijk}^{\prime}}{\sqrt{\sum_{i,k}w_{ijk}^{\prime% }{}^{2}+\epsilon}}.italic_w start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = divide start_ARG italic_w start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT + italic_ϵ end_ARG end_ARG .

The primary objective of demodulation is to restore the outputs to a unit standard deviation, providing stability and normalizing the feature representations. It is crucial to emphasize that this modulation-demodulation-convolution strategy facilitates the integration of degradation-specific information into the LR reconstruction process. The adaptability of the convolutional features based on the learned style ensures that the network can effectively reconstruct LR inputs, enhancing the overall performance of the SR framework.

Refer to caption
Figure 11: Modulation method used in our LR reconstructor.

C More Experiment Results

C.1 The Effect of Fine-tuning Parameters for Different Network Architecture.

In our investigation into the impact of training parameters on the performance of the FeMaSR and SwinIR networks, the influence is shown in Figure 12 and Figure 13. Specifically, for the FeMaSR network, the optimal PSNR is achieved when training parameters constitute 86%, while the optimal LPIPS is obtained at 100%. In contrast, SwinIR attains the best PSNR and LPIPS values almost simultaneously at 100% of training parameters.

Refer to caption
Figure 12: The performance curve for fine-tuning different per- centages of parameters for FeMaSR.
Refer to caption
Figure 13: The performance curve for fine-tuning different per- centages of parameters for SwinIR.

C.2 Ablation on Model Size

Table 9 delineates the efficacy of our proposed model across a spectrum of sizes, demonstrating that our method retains robust performance notwithstanding the model’s capacity. From a comprehensive model with 12.9 million parameters to a compact version with merely 495 thousand parameters, our approach consistently outperforms the baseline.

Paras FLOPS PSNR \uparrow MAD \downarrow LPIPS \downarrow DISTS \downarrow
baseline - - 28.13 118.48 0.2302 0.2102
+ LWay (Large) 12.9 M 589.4 G 28.85 104.71 0.1722 0.1772
+ LWay (Medium) 5.38 M 117.6 G 29.50 99.76 0.1798 0.1810
+ LWay (Small) 2.77 M 44.49 G 28.69 106.42 0.1837 0.1862
+ LWay (Tiny) 495 K 19.38 G 28.70 104.92 0.1808 0.1842
Table 9: The performence of different model size. LWay is not contingent on the parameter count of the LR reconstruction, demonstrating effectiveness even with a small parameter volume.

C.3 Performance on Different Degradation

Table 10 demonstrates the robustness of our ’LWay’ method in handling various types of image degradations. It presents notable improvements in PSNR and reductions in LPIPS for both real-world degradation and synthetic distortions such as blurring and JPEG compression, signifying our method’s efficacy in maintaining image integrity across different degradation scenarios.

real-world degradation synthetic, blur 17×\times×17 synthetic, blur 11×\times×11 synthetic, JPEG q=15𝑞15q=15italic_q = 15
PSNR LPIPS PSNR LPIPS PSNR LPIPS PSNR LPIPS
baseline 28.13 0.2302 27.55 0.4065 27.5 0.3922 26.60 0.4240
+LWay (d𝑑ditalic_d=2048) 29.56 0.1629 28.39 0.2755 29.02 0.2265 26.85 0.3122
Table 10: Performance on different degradation. LWay improve image quality under a range of degradations.

D More Visual Results

D.1 LR Reconstruction Visualization

The visual outcomes of the LR reconstruction network are illustrated in Figure 14, encompassing HR, LR, and the reconstructed LR images. Notably, our network demonstrates the capability to restore LR images that closely approximate the ground truth LR by extracting a 512-dimensional degradation embedding solely from the LR input and subsequently integrating it with the HR image. This process demonstrates the effectiveness of our LR reconstruction approach in achieving visually compelling results. The showcased robustness of our LR reconstruction network is particularly noteworthy. Given that the transition from HR to LR is generally considered easier compared to the reverse process, our method exhibits a heightened degree of resilience with limited data. Leveraging only a finite dataset, our approach achieves a robust performance, underscoring its capacity to generalize and adapt well to diverse LR input scenarios.

Refer to caption
Figure 14: Visual results of LR reconstruction. For instance, given an input LR image and HR image, the degradation encoder encodes a 512-dimension degradation embedding e𝑒eitalic_e, the reconstructor utilizes e𝑒eitalic_e and the HR image to reconstruct the estimated LR image.

D.2 More Visual Comparison

Figure 15 and Figure 16 provide additional comparisons of our proposed method with other state-of-the-art (sota) approaches. Our method excels in effectively restoring the texture and fine details of images. In contrast, DASR, DiffBIR, and StableSR tend to produce smoother results at the expense of losing texture details. ZSSR, on the other hand, exhibits limited restoration capabilities, resulting in less clear outcomes that are less faithful to the LR input. The results generated by LDM display inconsistencies in texture details compared to the ground truth. DARSR, while prone to failure and introducing significant color bias, and CAL_GAN both exhibit varying degrees of artifacts in their outputs. These visual comparisons underscore the superior performance of our proposed method in preserving intricate details and textures during the super-resolution process. The tendency of other methods to sacrifice fine details for smoother results, introduce artifacts, or inaccurately represent texture details highlights the unique strengths of our approach.

Refer to caption
Figure 15: More visual comparisons.
Refer to caption
Figure 16: More visual comparisons.

E Discussion

Differences to optimization-based methods. Optimization-based methods, relying on pre-defined degradation models or downsampling operators, have limited capabilities in handling complex degradation. They are also time-consuming and difficult with large data. In contrast, our approach incorporates a more general and robust degradation modeling. Moreover, our method marries the benefits of supervised and unsupervised training, outperforming optimization-based methods that only use test images.

Differences to KernelGAN. KernelGAN’s discriminator only makes binary judgments (0/1), while LWay uses pixel-level regression to better capture the distribution. Moreover, local KernelGAN’s kernels have limited information and robustness in real-world, while our embedding has richer external priors rather than relying on solely learning test images and is robust as demonstrated by validation.

F Limitation

The proposed architecture excels in extracting and restoring information from low-resolution (LR) images, especially when they contain discernible texture details. It is within these conditions that our method showcases its maximum effectiveness. However, a limitation might arise when the LR images themselves lack texture details, impeding the model’s capability to execute effective restoration.