Introduction

Regenerative therapies (e.g. [1, 2]) are emerging as treatments for blinding retinal diseases such as Age-Related Macular Degeneration [3]. Their efficiency, however, will depend on their precise injection in the sub-retinal and intra-retinal space. High-resolution cross-sectional (B-scans) images of the retina are required so that the retinal layers of interest can be visualised with quality adequate for injection guidance. Optical Coherence Tomography (OCT) captures such cross-sectional retinal images.

Intra-operative OCT (iOCT), acquired through recently introduced modified biomicroscopy systems such as Zeiss OPMI/Lumera and Leica Proveo/Enfocus, can be delivered in real time but at the expense of image quality (low signal strength and increased speckle noise [4]) with regard to pre-operative OCT. The produced iOCT scans are ambiguous and with limited interventional utility. While complementary research develops higher-quality iOCT systems, e.g. [5], we focus on computationally enhancing the capabilities of already deployed clinical systems.

An established approach to OCT quality enhancement is denoising. Spatially adaptive wavelets [6], Wiener filters [4], diffusion-based [7] and registration-based techniques [8] reduce speckle noise while preserving edges and image features. Unfortunately, these methods are limited by prolonged scanning periods, alignment errors and high computational cost, which limit their effectiveness for real-time interventions and iOCT.

Within the deep learning domain, Generative Adversarial Networks (GANs, [9]) can achieve image quality enhancementFootnote 1 in natural images ([10,11,12]). Many of these approaches have been adopted for medical image quality enhancement [13] and cross-modality image synthesis [14]. Research has also been conducted for OCT denoising including [15,16,17,18], but these works do not focus on intra-operatively acquired OCT images.

Despite its superior quality, pre-operative OCT is acquired under different protocols (date, patient position, device) than iOCT, implying a domain gap in addition to deformations that may lead to generated images with artefacts. Therefore, our paper considers iOCT information only. We propose a methodology that uses high-resolution (HR) iOCT images generated offline through registered and fused low-resolution (LR) iOCT video frames (B-scans). Generated HR images are ranked for quality considering metrics that incorporate the quality of segmented retinal layers. High-scoring HR images comprise the target domain for image-to-image translation. Several image quality metrics and a complementary qualitative survey showcase that our super-resolution methodology improves iOCT image quality outperforming filter-based denoising methods and the learning-based state-of-the-art [19].

Methods

This section presents the process of creating HR iOCT images, validating their quality and generating SR iOCT images through image-to-image translation.

Data

Our data are derived from an internal Moorfields Eye Hospital database of vitreoretinal surgery videos, including intra-operative and pre-operative OCTs. We use a data-complete subset comprising 42 intra-operative retinal surgery videos acquired from 22 subjects. The data contain the surgical microscope view captured by a Zeiss OPMI LUMERA 700 with embedded LR iOCT frames (resolution of \(440\hbox {x}300\)) acquired by RESCAN 700 (see Fig. 1). These intra-operative sequences are used to generate HR iOCT images (\(\widehat{HR}\)), which are the target domain for the examined super-resolution models.

Fig. 1
figure 1

a Surgery video frame. Left: Biomicroscope view. Right: iOCT B-scans. b From top to bottom: intra-operative and pre-operative OCT images

\(\widehat{HR}\) iOCT generation

Generating \(\widehat{HR}\) iOCT images is based on fusing registered iOCT video frames that are acquired from the same retinal position by averaging the temporal information. This process is illustrated in Figs. 2 and 3.

First, for each surgery video (Fig. 1a), we identified the time intervals wherein both iOCT scan position and retina points positions remain constant. During such intervals, the acquired iOCT B-scans can be considered as corresponding to the same retinal location and can therefore be registered and fused to acquire a HR B-scan.

Fig. 2
figure 2

Overview of the proposed tracking methods for identifying the iOCT frames extracted from the same retina position

Fig. 3
figure 3

Fusion of multiple registered LR iOCT images through averaging and automatic extraction of ROIs for SNR, CNR and ENL calculation

The position of the iOCT scan is obtained by detecting the white square depicted on the surgical microscope view (see Fig. 1a), which illustrates the iOCT’s scanning region. Detection starts with binary thresholding, Canny edge detection and Hough line transform on the microscope view image. To improve the robustness of identifying the iOCT scan position, we further detected the cyan and magenta arrows inside the already detected square (see Fig. 2). Two points (one point per arrow) were derived to represent the iOCT scan position.

Due to retina movement (patient breathing, surgical interactions) we must verify that the retina is also stationary. Therefore, we manually selected a point at the start of each video sequence that corresponds to a strong feature (e.g. vessel bifurcations), and tracked it using Lucas-Kanade methodFootnote 2.

If the aforementioned positions remained constant for more than eight consecutive video frames (number empirically selected), the corresponding iOCT B-scans were then rigidly registered to the first B-scan and averaged to generate the corresponding \(\widehat{HR}\) iOCT frame (Fig. 3). We applied rigid registration as we wanted to avoid unrealistic deformations (e.g. folding) that non-rigid registration might introduce, damaging the quality and realism of the final averaged image. The fusion process led to a total of 1966 \(\widehat{HR}\) images.

As the videos depict actual surgical procedures, many incoming LR iOCT images have low signal strength, calculated as signal to noise ratio (SNR) [20]. Thus, their corresponding fused \(\widehat{HR}\) images will be of low SNR as well. Furthermore, imperfections in tracking retina points and registration errors between the LR iOCT images could lead to blurry averaged \(\widehat{HR}\) iOCT scans. These factors affect the quality of many \(\widehat{HR}\) images, which as a result lessens the robustness of the estimated \(\widehat{HR}\) domain in terms of SNR and contrast.

To assess the quality of the generated images and define which ones should be included in the \(\widehat{HR}\) dataset, we used three different metrics, i.e. SNR, Equivalent Number of Looks (ENL) and Contrast to Noise Ratio (CNR) [4]:

$$\begin{aligned} SNR= & {} 10log(max\{F_{lin}^2\}/\sigma _{lin}^2) \end{aligned}$$
(1)
$$\begin{aligned} ENL= & {} (1/H)\Sigma _{h=1}^{H}(\mu _{h}^{2}/\sigma _{h}^{2}) \end{aligned}$$
(2)
$$\begin{aligned} CNR= & {} (1/R)\Sigma _{r=1}^{R}(\mu _{r} - \mu _{b})/ \sqrt{\sigma _{r}^2+\sigma _{b}^2} \end{aligned}$$
(3)

where \(F_{lin}\) is the linear magnitude image, \(\sigma _{lin}\) the standard deviation of \(F_{lin}\) in a background noise region, \(\mu _{b}, \mu _{h}, \mu _{r},\sigma _{b}, \sigma _{h}, \sigma _{r}\) are the mean and standard deviations for background region (b), homogeneous regions (h) and all regions of interest (r), respectively. In our image quality assessment, we empirically used \(H=2\) and \(R=4\) (see Fig. 3). To obtain metrics describing image quality on key anatomical landmarks, namely, retinal layers, we compute retinal layer masks using a deep semantic segmentation model. Then, metric computation takes place for regions of interest (ROI) tightly cropped around retinal layers.

Retinal layer segmentation

The segmentation model utilizes the architecture introduced in [21] and is trained using the Lovász-Softmax loss [22]. Due to the lack of large public pixel-level annotated datasets, we first pretrain the model for retinal fluid segmentation on the RETOUCHFootnote 3 dataset, which contains 3200 images (72 subjects). The model was then fine-tuned for the task of retinal layer segmentation on the DUKE datasetFootnote 4, which comprises 610 images (10 subjects). We qualitatively observed acceptable generalization of the segmentation model to our intra-operative OCT dataset. It is also worth mentioning that our aim is not a perfect segmentation of retinal layers but an acceptable approximation of the background area and pertinent retinal layers in the iOCT image in order to extract ROIs for the calculation of SNR, CNR and ENL.

Given the output label maps of the segmentation model, five ROIs are chosen (see Fig. 3): a background ROI (red rectangle), two small homogeneous ROIs on the second and the last retinal layers (blue rectangles), and two large ROIs on the first and the last retinal layers (green rectangles). The centre of the ROIs is random in the B-scan, so long as the aforementioned location constraints are respected, which stem from the requirements of the quality metrics themselves. Using (1–3), the ROIs, and considering empirically defined thresholds of 70.0, 3.0 and 10.0 for SNR, CNR and ENL, respectively, we identified 962 \(\widehat{HR}\) images of acceptable quality to form the \(\widehat{HR}\) dataset.

Deep learning models

To perform super-resolution (SR), we used two state-of-the-art image-to-image translation models: CycleGAN [11] and Pix2Pix [12]. These models belong to the family of GANs which alternately train a generator G and a discriminator D in an adversarial manner. Pix2Pix requires supervision in the form of aligned image pairs to update its generator G as it minimizes the L1 loss between images of source (LR) and target (\(\widehat{HR}\)) domain. On the contrary, CycleGAN can be trained without the need of paired examples using cycle consistency to enforce mappings between forward (\(G:LR\rightarrow \widehat{HR} \)) and backward (\(G:\widehat{HR}\rightarrow LR\)) direction. Preliminary experiments, however, revealed that CycleGAN produced inconsistent results on unpaired images. We therefore also include L1 supervised losses for training CycleGAN.

Implementation details

The dataset (962 image pairs of LR and \(\widehat{HR}\) iOCT images) was split into three subsets: training set (\(70\%\)), validation set (\(10\%\)) and test set (\(20\%\)). We performed online data augmentation for the training set through rotation (\(\pm 5^\circ \)), translation(\(\pm 30\) width, \(\pm 20\) height), horizontal flip (with a probability of 0.5), scale (\(1\pm 0.2\)) and the AlbumentationsFootnote 5 ‘colorjitter’ augmentation with brightness and contrast between [2/3, 3/2]. Our implementations of Pix2Pix and CycleGAN are based on the code available onlineFootnote 6, and both networks use CycleGAN’s ResNet-based generator [10] with 9 residual blocks. Our networks are trained using Adam Optimizer, for 200 epochs, with a batch size of 4 and input resolutions of \(440\hbox {x}300\) for Pix2Pix and CycleGAN. Our experiments ran on an NVIDIA Quadro \(\hbox {P}6000\) GPU with 24 GB memory.

Results

This section presents the results of the quantitative and qualitative analysis that we performed to validate our SR pipeline. We also validate the merit of employing deep learning for this task by comparing our models with classical filter-based OCT denoising techniques and the learning-based state-of-the-art.

Quantitative analysis

We quantitatively validate the quality enhancement of the SR images compared to the LR iOCT images. As our ground truth (HR) images are estimated by our methodology, full-reference metrics alone are not sufficient in image quality evaluation. Therefore, our analysis uses six different metrics including two full-reference metrics, i.e. Peak signal-to-noise ratio (PSNR) and Structural Similarity Index (SSIM) and four no-reference metrics, i.e. perceptual loss function (\({\ell }_{feat}\)) [10], Frechet Inception Distance (FID) [23], Global Contrast Factor (GCF) [24] and Natural Image Quality Evaluator (NIQE) [25]. The metric values were calculated on the test images of LR iOCT, SR using the state-of-the-art method of [19], SR using Pix2Pix [12] (SR-Pix) and SR using CycleGAN [11] (SR-Cyc). The evaluation metrics were computed on the original resolution (440x300px) for both Pix2Pix and CycleGAN outputs. The results are reported in Table 1. We assessed the statistical significance of the pairwise comparisons using paired t test. All p-values were \(p < 0.001\) except for pairwise comparisons between SR-Cyc and filter-based methods for SSIM.

Table 1 Quantitative analysis. Arrows show whether higher/lower is better

Reference metrics (PSNR, SSIM) were calculated using \(\widehat{HR}\) as reference images. As far as no-reference metrics are concerned, perceptual loss, \({\ell }_{feat}\), calculates the high-level perceptual similarity between two image domains by computing the distance of their feature representations extracted by Imagenet-pretrained Deep Convolutional Network [26]. We also used FID to capture how different are two image sets through the distance of their distributions of features extracted from the ImageNet-pretrained Inception-v3. Perceptual loss \({\ell }_{feat}\) and FID were calculated for the whole test dataset (193 images) of each image domain (LR, SR-Pix, etc.) with respect to the \(\widehat{HR}\) domain. In addition, we trained a NIQE model on the test database of \(\widehat{HR}\) images and assigned a NIQE score per test frame as well. The intuition behind the above three reference-free quality criteria is that if their values for SR images are lower than the corresponding values for LR, then our SR methodology generates images which are perceptually more similar to the \(\widehat{HR}\) and thus of better quality. Finally, we used GCF, a no-reference metric which calculates the image contrast which is an essential characteristic for iOCT images.

Fig. 4
figure 4

From left to right: LR, SR-Pix, SR-Cyc, \(\widehat{HR}\)

As shown in Table 1, SR-Cyc ranks first in terms of PSNR, SSIM \({\ell }_{feat}\) and FID, which shows that the image quality has been improved and is perceptually more similar to \(\widehat{HR}\) (see also Fig. 4). Regarding GCF, the more noisy images (LR and SR output by [19]) exhibit higher values, probably due to the appearance of high-frequency information (speckle noise). Finally, for frames of size 440x300, SR-Cyc performs at 18.17 frames per second (FPS), while Pix2Pix at 17.51 which both are appropriate for iOCT real-time requirements.

Qualitative analysis

To further validate our super-resolution pipeline, we performed qualitative analysis. Our survey included 20 pairs of LR and SR-Cyc images, randomly selected from the test set. We asked 8 retinal doctors/surgeons to evaluate these image pairs by assigning a score between 1 (strongly disagree) and 5 (strongly agree) on the following questions:

  • Q1: Can you notice an improvement in the delineation of RPE/Bruchs vs. IS/OS junction in the generated image? (A1: 3.8±0.3)

  • Q2: Can you notice a reduction of artefacts in the generated image?

    (A2: 3.9±0.1)

  • Q3: Can you notice an improvement in the delineation of the ILM vs. RNFL in the generated image? (A3: 3.7±0.3)

Their answers, A1, A2, A3 (mean±standard deviation), indicate that SR-Cyc images provide improved delineation of RPE vs IS/OS junction (Q1), reduction of artefacts (Q2) and improved delineation of ILM vs RNFL (Q3). Visual results are shown in Fig. 4, confirming the findings of our survey.

Denoising results

To demonstrate the denoising effect of our work, as part of the broader aim of image quality enhancement, we compare our optimal (according to the metrics) network (SR-Cyc) with conventional denoising filters. We selected three different state-of-the-art speckle reduction methods for OCT images: Symmetric Nearest Neighbour (SNN) [27], adaptive Wiener filter [28] and BM3D [29] whose denoising ability has been assessed in several works [4, 18].

All the filter-based methods demonstrated considerable denoising capabilities, as shown in Fig 5. We can, however, observe that those filters blurred the images (b,c,d) and that retinal layers cannot be distinguished easily especially when compared to the outputs of SR-Pix and SR-Cyc. The SR-Cyc images, in particular, are visually more similar to the \(\widehat{HR}\).

Quantitative analysis using the aforementioned metrics (see Table 1) shows that SR-Cyc achieved the best performance according to all metrics compared to the Wiener, BM3D and SNN filters. Among the filter-based techniques, SNN has the best performance according to PSNR, SSIM, \({\ell }_{feat}\), FID.

Fig. 5
figure 5

Visual results of different denoising methods

Discussion and conclusions

This paper addresses the challenge of super-resolution in iOCT images. We overcome the absence of ground truth HR images by a novel pipeline that leverages spatiotemporal consistency of incoming iOCT B-scans to estimate the \(\widehat{HR}\) images. Furthermore, we automatically assess the quality of the \(\widehat{HR}\) images to accept only the high-scoring ones as target domain for super-resolution. Our quantitative and qualitative analysis demonstrated that the proposed super-resolution pipeline can achieve convincing results for iOCT image quality enhancement and outperform filter-based denoising methods with statistical significance. Future work will increase the sharpness of retinal layer delineations to produce iOCT images of quality even closer to pre-operative OCT scans.