Learning To Correct Overexposed and Underexposed Photos

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

arXiv:2003.11596v1 [eess.

IV] 25 Mar 2020


Learning to Correct Overexposed and
Underexposed Photos

Mahmoud Afifi1,2? , Konstantinos G. Derpanis1 ,


Björn Ommer3 , and Michael S. Brown1
1
Samsung AI Centre (SAIC), Toronto, Canada
2
York University, Canada 3 Heidelberg University, Germany

Abstract. Capturing photographs with wrong exposures remains a ma-


jor source of errors in camera-based imaging. Exposure problems are
categorized as either: (i) overexposed, where the camera exposure was
too long, resulting in bright and washed-out image regions, or (ii) un-
derexposed, where the exposure was too short, resulting in dark regions.
Both under- and overexposure greatly reduce the contrast and visual
appeal of an image. Prior work mainly focuses on underexposed images
or general image enhancement. In contrast, our proposed method tar-
gets both over- and underexposure errors in photographs. We formulate
the exposure correction problem as two main sub-problems: (i) color
enhancement and (ii) detail enhancement. Accordingly, we propose a
coarse-to-fine deep neural network (DNN) model, trainable in an end-to-
end manner, that addresses each sub-problem separately. A key aspect
of our solution is a new dataset of over 24,000 images exhibiting a range
of exposure values with a corresponding properly exposed image. Our
method achieves results on par with existing state-of-the-art methods
on underexposed images and yields significant improvements for images
suffering from overexposure errors.

Keywords: Deep learning, camera-based imaging, exposure correction,


datasets

1 Introduction
The exposure used at capture time directly affects the overall brightness of the
final rendered photograph. Digital cameras control exposure using three main
factors: (i) capture shutter speed, (ii) f-number, which is the ratio of the focal
length to the camera aperture diameter, and (iii) the ISO value to control the
amplification factor of the received pixel signals. In photography, exposure set-
tings are represented by exposure values (EVs), where each EV refers to different
combinations of camera shutter speeds and f-numbers that result in the same
exposure effect—also referred to as ‘equivalent exposures’ in photography.
Digital cameras can adjust the exposure value of captured images for the pur-
pose of varying the brightness levels. This adjustment can be controlled manually
?
This work was done while Mahmoud Afifi was an intern at the SAIC.
2 M. Afifi, K. G. Derpanis, B. Ommer, and M. S. Brown

Overexposure

By Roland Tanglao (Flickr: CC BY 2.0)


Underexposure

From MIT-Adobe FiveK [38]

Input images Google Photo Enhancer Photoshop HDR iPhone Photo Enhancer Our method

Fig. 1: The first column shows photographs with over- and underexposure errors.
Results from current commercial software and the proposed method are shown.

by users or performed automatically in an auto-exposure (AE) mode. When AE


is used, cameras adjust the EV to compensate for low/high levels of brightness
in the captured scene using through-the-lens (TTL) metering that measures the
amount of light received from the scene [1].
Exposure errors can occur due to several factors, such as errors in measure-
ments of TTL metering, hard lighting conditions (e.g., very low lighting and
backlighting), dramatic changes in the brightness level of the scene, or errors
made by users in the manual mode. Such exposure errors are introduced early in
the capture process and are thus hard to correct after rendering the final 8-bit
image. This is due to the highly nonlinear operations applied by the camera
image signal processor (ISP) afterwards to render the final 8-bit standard RGB
(sRGB) image [2].
Fig. 1 shows typical examples of images with exposure errors. In Fig. 1, ex-
posure errors result in either very bright image regions, due to overexposure, or
very dark regions, caused by underexposure errors, in the final rendered images.
Correcting images with such errors is a challenging task even for well-established
image enhancement software packages as shown in Fig. 1. Although both over-
and underexposure errors are common in photography, most prior work is mainly
focused on correcting underexposure errors [3–7] or generic image quality en-
hancement [8, 9].
Contributions We propose a coarse-to-fine deep learning method for exposure
error correction of both over- and underexposed sRGB images. Our approach
formulates the exposure correction problem as two main sub-problems: (i) color
and (ii) detail enhancement. We propose a coarse-to-fine deep neural network
(DNN) model, trainable in an end-to-end manner, that begins by correcting the
global color information and subsequently refines the image details. In addition
to our DNN model, a key contribution to the exposure correction problem is
a new dataset containing over 24,000 images rendered from raw-RGB to sRGB
with different exposure settings. Each image in our dataset is provided with a
corresponding properly exposed reference image. Lastly, we present an extensive
set of evaluations and ablations of our proposed method with comparisons to
the state of the art. We demonstrate that our method achieves results on par
Learning to Correct Overexposed and Underexposed Photos 3

with previous methods dedicated to underexposed images and yields significant


improvements on overexposed images.

2 Related Work

The focus of our paper is on correcting exposure errors in camera-rendered 8-


bit sRGB images. We refer the reader to [10–13] for representative examples
for rendering linear raw-RGB images captured with low light or exposure errors.
The related work is organized into three categories: (i) exposure error correction,
(ii) high dynamic range (HDR) restoration and generic image enhancement, and
(iii) related paired datasets.

Exposure Correction Traditional methods for exposure correction rely on im-


age histograms to re-balance image intensity values aiming to correct exposure
errors and enhance image contrast [14–18]. Alternatively, tone curve adjustment
is used to correct images with exposure errors. This process is performed by
relying either solely on input image information [19] or trained deep learning
models [20, 21]. The majority of prior work adopts the Retinex theory [22] by
assuming that improperly exposed images can be formulated as a pixel-wise
multiplication of target images, captured with correct exposure settings, by illu-
mination maps. Thus, the goal of these methods is to predict illumination maps
to recover the well-exposed target images. Representative Retinex-based meth-
ods include [3, 4, 22–26] and the most recent deep learning ones [5–7]. Most of
these methods, however, are restricted to correcting underexposure errors [3–7].
In contrast to the majority of prior work, our work is the first deep learning
method to explicitly correct both overexposed and underexposed photographs.

HDR Restoration and Image Enhancement HDR restoration is the pro-


cess of reconstructing scene radiance HDR values from one or more low dynamic
range (LDR) input images. Prior work either requires access to multiple LDR im-
ages [27–29] or uses a single LDR input image, which is converted to an HDR im-
age by hallucinating missing information [30,31]. Ultimately, these reconstructed
HDR images are mapped back to LDR for perceptual visualization. This map-
ping can be directly performed from the input multi-LDR images [32, 33], the
reconstructed HDR image [34], or directly from the single input LDR image
without the need for radiance HDR reconstruction [8,9]. There are also methods
that focus on general image enhancement that can be applied to enhancing im-
ages with poor exposure. In particular, work by [35, 36] was developed primarily
to enhance images captured on smartphone cameras by mapping captured im-
ages to appear as high-quality images captured by a DSLR. Our work does not
seek to reconstruct HDR images or general enhancement, but instead is trained
to explicitly address exposure errors.
4 M. Afifi, K. G. Derpanis, B. Ommer, and M. S. Brown

t-SNE visualization
of our dataset

Linear raw-RGB

+1.5 EV Emulates camera ISP processes based on


Overexposed image from our metadata in each DNG file
dataset

+1.5 EV +1 EV
-1.5 EV
Underexposed image
from our dataset

+0 EV -1 EV

+0 EV
Properly exposed image from
Example from the LOL our dataset -1.5 EV Properly exposed ref.
dataset [5]

Fig. 2: Dataset overview. Our dataset contains images with different exposure
error types and their corresponding properly exposed reference images. Shown
is a t-SNE visualization [37] of all images in our dataset and the low-light (LOL)
paired dataset (outlined in red) [5]. Notice that LOL covers a relatively small
fraction of the possible exposure levels, as compared to our introduced dataset.
Our dataset was rendered from linear raw-RGB images taken from the MIT-
Adobe FiveK dataset [38]. Each image was rendered with different relative ex-
posure values (EVs) by an accurate emulation of the camera ISP processes.

Paired Dataset Paired datasets are crucial for supervised learning for image
enhancement tasks. Existing paired datasets for exposure correction focus only
on low-light underexposed images. Representative examples include Wang et al.’s
dataset [7] and the low-light (LOL) paired dataset [5]. Unlike existing datasets
for exposure correction, we introduce a large image dataset rendered with a wide
range of exposure errors. Fig. 2 shows a comparison between our dataset and
the LOL dataset in terms of the number of images and the variety of exposure
errors in each dataset. The LOL dataset covers a relatively small fraction of the
possible exposure levels, as compared to our introduced dataset. Our dataset
is based on the Adobe-MIT FiveK dataset [38] and is accurately rendered by
adjusting the high tonal values provided in camera sensor raw-RGB images to
realistically emulate camera exposure errors. An alternative worth noting is to
use a large HDR dataset to produce training data—for example, the Google
HDR+ dataset [12]. One drawback, however, is that this dataset is a composite
of a varying number of smartphone captured raw-RGB images that were first
aligned to a composite raw-RGB image. The target ground truth image is based
on an HDR-to-LDR algorithm applied to this composite raw-RGB image [8,
12]. We opt instead to use the FiveK dataset as it starts with a single high-
quality raw-RGB image and the ground truth result is generated by an expert
photographer.

3 Our Dataset
To train our model, we need a large number of training images rendered with
realistic over- and underexposure errors and corresponding properly exposed
Learning to Correct Overexposed and Underexposed Photos 5

Level 1


swap swap

Level ݊
(A) Input image and the Laplacian (B) Properly exposed reference (C) Reconstructed image using the pyramid in (A) (D) Reconstructed image using the pyramid in (A)
pyramid image and the Laplacian pyramid after swapping the last level of the pyramid with after swapping the last two levels of the pyramid
the corresponding one in (B) with the corresponding levels in (B)

Fig. 3: Motivation behind our coarse-to-fine exposure correction approach. Ex-


ample of an overexposed image and its corresponding properly exposed image
shown in (A) and (B), respectively. The Laplacian pyramid decomposition allows
us to enhance the color and detail information sequentially, as shown in (C) and
(D), respectively.

ground truth images. As discussed in Sec. 2, such datasets are currently not
publicly available. For this reason, our first task was to create a new dataset.
Our dataset is rendered from the MIT-Adobe FiveK dataset [38], which has
5,000 raw-RGB images and corresponding sRGB images rendered manually by
five expert photographers [38].
For each raw-RGB image, we use the Adobe Camera Raw SDK [39] to em-
ulate different EVs as would be applied by a camera [40]. Adobe Camera Raw
accurately emulates the nonlinear camera rendering procedures using metadata
embedded in each DNG raw file [40, 41]. We render each raw-RGB image with
different digital EVs to mimic real exposure errors. Specifically, we use the rel-
ative EVs −1.5, −1, +0, +1, and +1.5 to render images with underexposure
errors, a zero gain of the original EV, and overexposure errors, respectively.
The zero-gain relative EV is equivalent to the original exposure settings applied
onboard the camera during capture time.
As the ground truth images, we use images that were manually retouched by
an expert photographer (referred to as Expert C in [38]) as our target correctly
exposed images, rather than using our rendered images with +0 relative EV.
The reason behind this choice is that a significant number of images contain
backlighting or partial exposure errors in the original exposure capture settings.
The expert adjusted images were performed in ProPhoto RGB color space [38]
(rather than raw-RGB), which we converted to a standard 8-bit sRGB color
space encoding.
In total, we generated 24,330 8-bit sRGB images with different digital ex-
posure settings. We discarded a small number of images that had misalignment
with their corresponding ground truth image. These misalignments were due to
different usage of the DNG crop area metadata by Adobe Camera Raw SDK
and the expert. Our dataset is divided into three sets: (i) training set of 17,675
images, (ii) validation set of 750 images, and (iii) testing set of 5,905 images.
The training, validation, and testing sets, use different images taken from the
FiveK dataset. This means the training, validation, and testing images do not
share any images in common. Fig. 2 shows examples of our generated 8-bit sRGB
images and the corresponding properly exposed 8-bit sRGB reference images.
6 M. Afifi, K. G. Derpanis, B. Ommer, and M. S. Brown

Level 1 4-layer U-Net with 24


Input training image
output channels of
the first level of the
Level ݊ െ ͳ encoder
Level ݊
3-layer U-Net with 24
U-Net skip U-Net skip U-Net skip output channels of
Level 1 the first level of the
connections connections connections
encoder


3-layer U-Net with 16
Encoder Decoder + Encoder Decoder + … + Encoder Decoder + output channels of
the first level of the
encoder
Level ݊ െ ͳ
2×2 transposed conv
layer with 2 stride

Level ݊
ʹሺ௟ିଶሻ ∑ െ + ʹሺ௟ିଶሻ ∑ െ +… ∑ െ

conv, LReLU,
conv, LReLU

conv, LReLU

conv, bn,
LReLU

conv
Ground truth Corrected Ground truth Corrected Ground truth Corrected
Laplacian pyramid ‫܂‬ሺ௟ୀ௡ሻ ‫܇‬ሺ௟ୀ௡ሻ ‫܂‬ሺ௟ୀ௡ିଵሻ ‫܇‬ሺ௟ୀ௡ିଵሻ
‫܂‬ ‫܇‬
െ Ž‘‰ሺ•‹‰‘‹† ሻ
Pyramid loss Reconstruction loss Adversarial loss

Fig. 4: Overview of our image exposure correction architecture. We propose a


coarse-to-fine deep network to progressively correct exposure errors in 8-bit
sRGB images. Our network first corrects the global color captured at the fi-
nal level of the Laplacian pyramid and then the subsequent frequency layers.

4 Our Method

Given an 8-bit sRGB input image, I, rendered with the incorrect exposure set-
ting, our method aims to produce an output image, Y, with fewer exposure
errors than those in I. As we target both over- and underexposed errors, our
input image, I, is expected to contain regions of nearly over- or under-saturated
values with corrupted color and detail information. We propose to correct color
and detail errors of I in a sequential manner. Specifically, we propose to pro-
cess a multi-resolution representation of I, rather than directly dealing with the
original form of I. We use the Laplacian pyramid [42] as our multiresolution
decomposition, which is derived from the Gaussian pyramid of I.

4.1 Coarse-to-Fine Exposure Correction

Let X represent the Laplacian pyramid of I with n levels, such that X(l) is the
lth level of X. The last level of this pyramid (i.e., X(n) ) captures low-frequency
information of I, while the first level (i.e., X(1) ) captures the high-frequency
information. Such frequency levels can be categorized into: (i) global color infor-
mation of I stored in the low-frequency level and (ii) image coarse-to-fine details
stored in the mid- and high-frequency levels. These levels can be later used to
reconstruct the full-color image I.
Fig. 3 motivates our coarse-to-fine approach to exposure correction. Figs.
3-(A) and (B) show an example overexposed image and its corresponding well-
exposed target, respectively. As observed, a significant exposure correction can
be obtained by using only the low-frequency layer (i.e., the global color infor-
mation) of the target image in the Laplacian pyramid reconstruction process,
as shown in Fig. 3-(C). We can then improve the final image by enhancing the
details in a sequential way by correcting each level of the Laplacian pyramid, as
shown in Fig. 3-(D). Practically, we do not have access to the properly exposed
image in Fig. 3-(B) at the inference stage, and thus our goal is to predict the
missing color/detail information of each level in the Laplacian pyramid.
Learning to Correct Overexposed and Underexposed Photos 7

Inspired by this observation and the success of coarse-to-fine architectures


for various other computer vision tasks (e.g., [43–46]), we design a DNN that
corrects the global color and detail information of I in a sequential manner
using the Laplacian pyramid decomposition. The remaining part of this section
explains the technical details of our model (Sec. 4.2), including details of the
losses (Sec. 4.3), implementation and training (Sec. 4.4), and inference phase
(Sec. 4.5).

4.2 Coarse-to-Fine Network

Our image exposure correction architecture sequentially processes the n-level


Laplacian pyramid X of image I to produce the final corrected image Y. The
proposed model consists of n sub-networks. Each of these sub-networks is a U-
Net-like architecture [47] with untied weights. We allocate the network capacity
in the form of weights based on how significantly each sub-problem (i.e., global
color correction and detail enhancement) contributes to our final result. Fig. 4
provides an overview of our network. As shown, the largest (in terms of weights)
sub-network in our architecture is dedicated to processing the global color infor-
mation in I (i.e., X(n) ). This sub-network (shown in yellow in Fig. 4) processes
the low-frequency level X(n) and produces an upscaled image Y(n) . The upscal-
ing process scales up the output of our sub-network by a factor of two using
strided transposed convolution with trainable weights. Next, we add the first
mid-frequency level X(n−1) to Y(n) to be processed by the second sub-network
in our model. This sub-network enhances the corresponding details of the cur-
rent level and produces a residual layer that is then added to Y(n) + X(n−1)
to reconstruct image Y(n−1) , which is equivalent to the corresponding Gaussian
pyramid level n − 1. This refinement-upsampling process proceeds until the final
output image, Y, is produced. Our network is fully differentiable and thus can
be trained in an end-to-end manner.

4.3 Losses

We train our model end-to-end to minimize the following loss function:

L = Lrec + Lpyr + Ladv , (1)

where Lrec denotes the reconstruction loss, Lpyr the pyramid loss, and Ladv the
adversarial loss.

Reconstruction Loss: We use the L1 loss function between the reconstructed and
properly exposed reference images. This loss can be expressed as follows:

3hw
X
Lrec = |Y(p) − T(p)| , (2)
p=1
8 M. Afifi, K. G. Derpanis, B. Ommer, and M. S. Brown

1st level
2nd level (final result)

w/o pyramid loss


1st level 3rd level

4th level

2nd level

w/ pyramid loss
3rd level

4th level

Input image and its 4-level Laplacian pyramid Output of each sub-network Properly exposed ref. image

Fig. 5: Multiscale losses. Shown are the output of each sub-net trained with and
without the pyramid loss (Eq. 3).

where h and w denote the height and width of the training image, respectively,
and p is the index of each pixel in our corrected image, Y, and the correspond-
ing properly exposed reference image, T, respectively. The individual losses are
described below.

Pyramid Loss: To guide each sub-network to follow the Laplacian pyramid re-
construction procedure, we introduce dedicated losses at each pyramid level. Let
T(l) denote the lth level of the Gaussian pyramid of our reference image, T,
after upsampling by a factor of two. We use a simple interpolation process for
the upsampling operation [27]. Our pyramid loss is computed as follows:
n
X 3h
Xl wl

2(l−2)

Lpyr = Y(l) (p) − T(l) (p) , (3)
l=2 p=1

where hl and wl are twice the height and width of the lth level in the Laplacian
pyramid of the training image, respectively, and p is the index of each pixel
in our corrected image at the lth level Y(l) and the properly exposed reference
image at the same level T(l) , respectively. The pyramid loss not only gives a
principled interpretation of the task of each sub-network but also results in
less visual artifacts compared to training using only the reconstruction loss (see
Fig. 5). Notice that without their intermediate pyramid losses, the multi-scale
reconstructions deviate widely from the intermediate Gaussian targets.

Adversarial Loss: To perceptually enhance the reconstruction of the corrected


image output in terms of realism and appeal, we also consider an adversarial
loss as a regularizer. This adversarial loss term can be described by the following
equation [48]:

Ladv = −3hwn log (S (D (Y))) , (4)

where S is the sigmoid function and D is a discriminator DNN that is trained


together with our main network.
Learning to Correct Overexposed and Underexposed Photos 9

ܹ ‫ܪ‬
ൈ ൈ݉ ܹ ‫ܪ‬
ʹ ʹ ൈ ൈ ʹ݉
ܹ ‫ܪ‬ ʹ ʹ
ൈ ൈ ʹ௅ିଵ ݉ ܹ ‫ܪ‬
ʹ௅ିଵ ʹ௅ିଵ ൈ ൈ ʹ௅ିଵ ݉
ʹ௅ିଵ ʹ௅ିଵ

… …

ܹ ‫ܪ‬ ܹ ‫ܪ‬
ൈ ൈ ʹ௅ିଵ ݉ ൈ ൈ ʹ௅ ݉
ܹ ‫ܪ‬ ʹ௅ ʹ௅ ʹ௅ ʹ௅
ൈ ൈ ʹ݉ ܹ ‫ܪ‬ ܹ ‫ܪ‬
Ͷ Ͷ ൈ ൈ ʹ௅ିଵ ݉ ൈ ൈ ʹ௅ ݉
ʹ௅ିଵ ʹ௅ିଵ ʹ௅ିଵ ʹ௅ିଵ
ܹ ‫ܪ‬
ൈ ൈ ʹ݉ ܹ ‫ܪ‬ ܹ ‫ܪ‬
ʹ ʹ ൈ ൈ ʹ݉ ൈ ൈ Ͷ݉
ʹ ʹ ʹ ʹ
ܹൈ‫ܪ‬ൈ݉ ܹ ൈ ‫ ܮ‬ൈ ʹ݉
ܹൈ‫ܮ‬ൈ݉ ܹൈ‫ܮ‬ൈ͵
ܹൈ‫ܮ‬ൈ݉

Output of 3×3 covn layers with stride 1 and padding 1 Output of 2×2 transposed conv layers Skip connections

Output of 2×2 max-pooling layers with stride 2 Output of Leaky ReLU (LReLU) layers ݉ǡ ‫ ܮ‬՜ output channels of 1st level in the encoder
and number of levels in the encoder/decoder
Output of 1×1 covn layer with stride 1 and padding 1 Output of depth concatenation layers ܹǡ ‫ ܪ‬՜ input width and input height

(A) Encoder-decoder architecture used in each sub-network

Output of 4×4 covn layers with stride 2 and padding 0

Output of 4×4 covn layers with stride 2 and padding 1

Output of LReLU layers


ͳ͸ ൈ ͳ͸ ൈ ͸Ͷ ͺ ൈ ͺ ൈ ͳʹͺ Ͷ ൈ Ͷ ൈ ͳʹͺ ʹ ൈ ʹ ൈ ʹͷ͸ ݀ ሺͳ ൈ ͳ ൈ ͳሻ
͵ʹ ൈ ͵ʹ ൈ ͵ʹ

͸Ͷ ൈ ͸Ͷ ൈ ͳ͸
Output of batch normalization layer

ͳʹͺ ൈ ͳʹͺ ൈ ͺ
(B) Discriminator architecture used in our adversarial training

Fig. 6: Details of the architectures used in our work. (A) Encoder-decoder archi-
tecture [47] used to design our sub-networks in the main network. (B) Discrim-
inator architecture.

4.4 Implementation and Training Details


In our implementation, we use a Laplacian pyramid with four levels (i.e., n = 4)
and thus we have four sub-networks in our model. We provide the implemen-
tation details of our network, the discriminator network used in the adversarial
training process, and the training details below.

4.4.1 Main Network Our main network consists of four sub-networks with
∼7M parameters trained in an end-to-end manner. Each sub-network accepts a
different representation of the input image extracted from the Laplacian pyramid
decomposition. The first sub-network is a four-layer encoder-decoder network
with skip connections (i.e., U-Net-like architecture [47]). The output of the first
convolutional (conv) layer has 24 channels. Our first sub-network has ∼4.4M
learnable parameters and it accepts the low-frequency band level of the Laplacian
pyramid, i.e., X(4) . The result of the first sub-network is then upscaled using a
2×2×3 transposed conv layer with three output channels and a stride of two.
This processed layer is then added to the first mid-frequency band level of the
Laplacian pyramid (i.e., X(3) ) and is fed to the second sub-network.
The second sub-network is a three-layer encoder-decoder network with skip
connections. It has 24 channels in the first conv layer of the encoder, with a total
of ∼1.1M learnable parameters. The second sub-network processes the upscaled
input from the first sub-network and outputs a residual layer, which is then added
back to the input to the second sub-network followed by a 2×2×3 transposed
conv layer with three output channels and a stride of two. The result is added to
10 M. Afifi, K. G. Derpanis, B. Ommer, and M. S. Brown

the second mid-frequency band level of the Laplacian pyramid (i.e., X(2) ) and is
fed to the third sub-network, which generates a new residual that is added back
again to the input of this sub-network.
The third sub-network has the same design as the second network. Finally,
the result is added to the high-frequency band level of the Laplacian pyramid
(i.e., X(1) ) and is fed to the fourth sub-network to produce the final processed
image.
The final sub-network is a three-layer encoder-decoder network with skip
connections and has ∼482.2K learnable parameters, where the output of the
first conv layer in its encoder has 16 channels. We provide the details of the
main encoder-decoder architecture of each sub-network in Fig. 6-(A).

4.4.2 Discriminator Network In the adversarial training of our network,


we use a light-weight discriminator network with ∼1M learnable parameters.
We provide the details of the discriminator in Fig. 6-(B). Notice that unlike
our main network, we resize all input image patches to have 256 × 256 pixels
before being processed by the discriminator. The output of the last layer in our
discriminator is a single scalar value which is then used in our loss during the
optimization.

4.4.3 Training Details We use He et al.’s method [49] to initialize the weights
of our encoder and decoder conv layers, while the bias terms are initialized to
zero. We minimize our loss functions using the Adam optimizer [50] with a decay
rate β1 = 0.9 for the exponential moving averages of the gradient and a decay
rate β2 = 0.999 for the squared gradient. We use a learning rate of 10−4 to update
the parameters of our main network and a learning rate of 10−5 to update our
discriminator’s parameters.
We train our network on patches with different dimensions. Training begins
without the adversarial loss, Ladv , then Ladv is added to enhance the results
of our initial training [51]. Specifically, we begin our training without Ladv on
176,590 patches with dimensions of 128 × 128 pixels extracted randomly from
our training images for 40 epochs. The mini-batch size is set to 32. The learning
rate is decayed by a factor of 0.5 after the first 20 epochs. Then, we continue
training on another 105,845 patches with dimensions of 256×256 pixels for 30
epochs with a mini-batch size of eight. At this stage, we train our main network
without Ladv for 15 epochs and continue training for another 15 epochs with
Ladv . The learning rates for the main network and the discriminator network
are decayed by a factor of 0.5 every 10 epochs. Finally, we fine-tune the trained
networks on another 69,515 training patches with dimensions of 512×512 pixels
for 20 epochs with a mini-batch size of four and a learning rate decay of 0.5
applied every five epochs.
We discard any training patches that have an average intensity less than 0.02
or higher than 0.98. We also discard homogeneous patches that have a gradient
magnitude less than 0.06. We randomly left-right flip training patches for data
augmentation.
Learning to Correct Overexposed and Underexposed Photos 11

In the adversarial training, we optimize both the main network and the dis-
criminator in an iterative manner. At each optimization step, the learnable pa-
rameters of each network are updated to minimize its own loss function. The
discriminator is trained to minimize the following loss function [48]:

Ldsc = r (T) + c (Y) , (5)

where r (T) refers to the discriminator loss of recognizing the properly exposed
reference image T, while c (Y) refers to the discriminator loss of recognizing our
corrected image Y. The r (T) and c (Y) loss functions are given by the following
equations:
r (T) = − log (S (D (T))) , (6)

c (Y) = − log (1 − S (D (Y))) . (7)

4.5 Inference Stage

Our network is fully convolutional and can process input images with different
resolutions. While our model requires a reasonable memory size (∼7M param-
eters), processing high-resolution images requires a high computational power
that may not always be available. Furthermore, processing images with consid-
erably higher resolution (e.g., 16-megapixel) than the range of resolutions used
in the training process can affect our model’s robustness with large homogeneous
image regions. This issue arises because our network was trained on a certain
range of effective receptive fields, which is very low compared to the receptive
fields required for images with very high resolution. To that end, we use the bilat-
eral guided upsampling method [52] to process high-resolution images. First, we
resize the input test image to have a maximum dimension of 512 pixels. Then, we
process the downsampled version of the input image using our model, followed
by applying the fast upsampling technique [52] with a bilateral grid of 22×22×8
cells. This process allows us to process a 16-megapixel image in ∼4.5 seconds on
average. This time includes ∼0.5 seconds to run our network on an NVIDIAr
GeForce GTX 1080TM GPU and ∼4 seconds on an Intelr Xeonr E5-1607 @ 3.10
GHz machine for the guided upsampling process. Note the runtime of guided up-
sampling step can be significantly improved with a Halide implementation [53].

5 Empirical Evaluation

We compare our method against several existing methods for exposure correction
and image enhancement. We first present quantitative results and comparisons
in Sec. 5.1, followed by qualitative comparisons in Sec. 5.2. Finally, we present
ablation studies performed to validate our architecture and loss function in Sec.
5.3.
12 M. Afifi, K. G. Derpanis, B. Ommer, and M. S. Brown

Expert A Expert B Expert C

Expert D Expert E

Example input image Result obtained Results are evaluated against each of the five experts results from
with poor exposure by our method the MIT-Adobe FiveK dataset [38]

Fig. 7: We evaluate the results of input images against all five expert photogra-
phers’ edits from the FiveK dataset [38].

5.1 Quantitative Results

To evaluate our method, we use our test set, which consists of 5,905 images
rendered with different exposure settings, as described in Sec. 3. Specifically, our
test set includes 3,543 well-exposed/overexposed images rendered with +0, +1,
and +1.5 relative EVs, and 2,362 underexposed images rendered with −1 and
−1.5 relative EVs.
We adopt the following three standard metrics to evaluate the pixel-wise
accuracy and the perceptual quality of our results: (i) peak signal-to-noise ratio
(PSNR), (ii) structural similarity index measure (SSIM) [54], and (iii) perceptual
index (PI) [55]. The PI is given by:

PI = 0.5(10 − Ma + NIQE), (8)

where both Ma [56] and NIQE [57] are no-reference image quality metrics.
For the pixel-wise error metrics – namely, PSNR and SSIM – we compare the
results not only against the properly exposed rendered images by Expert C but
also with all five expert photographers in the MIT-Adobe FiveK dataset [38].
Though the expert photographers may render the same image in different ways
due to differences in the camera-based rendering settings (e.g., white balance,
tone mapping), a common characteristic over all rendered images by the expert
photographers is that they all have fairly proper exposure settings [38] (see
Fig. 7). For this reason, we evaluate our method against the five expert rendered
images as they all represent satisfactory exposed reference images.
We also evaluate a variety of previous non-learning and learning-based meth-
ods on our test set for comparison: histogram equalization (HE) [14], contrast-
limited adaptive histogram equalization (CLAHE) [16], the weighted variational
model (WVM) [59], the low-light image enhancement method (LIME) [3, 60],
HDR CNN [30], DPED models [35], deep photo enhancer (DPE) models [9], the
high-quality exposure correction method (HQEC) [4], RetinexNet [5], and deep
underexposed photo enhancer (UPE) [7]. To render the reconstructed HDR im-
ages generated by the HDR CNN method [30] back into LDR, we tested both the
deep reciprocating HDR transformation method (RHT) [34] and Adobe Photo-
shop’s (PS) HDR tool [58].
Learning to Correct Overexposed and Underexposed Photos 13

Input images Results from DPED [35] Our results Properly exposed ref. images

Fig. 8: Qualitative results of correcting overexposed images. Shown are the input
images, results from the DPED [35], our results, and the corresponding ground
truth images.

Table 1 summarizes the quantitative results obtained by each method. As


shown in the top portion of the table, our method achieves the best results
for overexposed images under all metrics. In the underexposed image correction
setting, our results (middle portion of table) are on par with the state-of-the-art
methods. Finally, in contrast to most of the existing methods, the results in the
bottom portion of the table show that our method can effectively deal with both
types of exposure errors.

For completeness, we further evaluate our method on the following stan-


dard image datasets used by previous low-light image enhancement methods:
(i) LIME (10 images) [3], (ii) NPE (75 images) [24], (iii) VV (24 images) [61],
and DICM (44 images) [62]. Similar to previous methods, we use the NIQE per-
ceptual score [57] for evaluation. Table 2 compares results by our method and
the following methods: LIME [3, 60], WVM [59], RetinexNet (RNet) [5], “kin-
dling the darkness” method (KinD) [6], enlighten GAN (EGAN) [63], and deep
bright-channel prior (BCP) [64]. As can be seen in Table 2, our method generally
achieves perceptually superior results in correcting low-light 8-bit images.
14 M. Afifi, K. G. Derpanis, B. Ommer, and M. S. Brown

Expert A Expert B Expert C Expert D Expert E Avg.


Method PI
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
+0, +1, and +1.5 relative EVs (3,543 properly exposed and overexposed images)
HE [14] ∗ 16.140 0.686 16.277 0.672 16.531 0.699 16.643 0.669 17.321 0.691 16.582 0.683 2.351
CLAHE [16] ∗ 13.934 0.568 14.689 0.586 14.453 0.584 15.116 0.593 15.850 0.612 14.808 0.589 2.270
WVM [59] ∗ 12.355 0.624 13.147 0.656 12.748 0.645 14.059 0.669 15.207 0.690 13.503 0.657 2.342
LIME [3, 60] ∗ 09.627 0.549 10.096 0.569 9.875 0.570 10.936 0.597 11.903 0.626 10.487 0.582 2.412
HDR CNN [30] w/ RHT [34] 13.151 0.475 13.637 0.478 13.622 0.497 14.177 0.479 14.625 0.503 13.842 0.486 4.284
HDR CNN [30] w/ PS [58] 14.804 0.651 15.622 0.689 15.348 0.670 16.583 0.685 18.022 0.703 16.076 0.680 2.248
DPED (iPhone) [35] 12.680 0.562 13.422 0.586 13.135 0.581 14.477 0.596 15.702 0.630 13.883 0.591 2.909
DPED (BlackBerry) [35] 15.170 0.621 16.193 0.691 15.781 0.642 17.042 0.677 18.035 0.678 16.444 0.662 2.518
DPED (Sony) [35] 16.398 0.672 17.679 0.707 17.378 0.697 17.997 0.685 18.685 0.700 17.627 0.692 2.740
DPE (HDR) [9] 14.399 0.572 15.219 0.573 15.091 0.593 15.692 0.581 16.640 0.626 15.408 0.589 2.417
DPE (U-FiveK) [9] 14.314 0.615 14.958 0.628 15.075 0.645 15.987 0.647 16.931 0.667 15.453 0.640 2.630
DPE (S-FiveK) [9] 14.786 0.638 15.519 0.649 15.625 0.668 16.586 0.664 17.661 0.684 16.035 0.661 2.621
HQEC [4] ∗ 11.775 0.607 12.536 0.631 12.127 0.627 13.424 0.652 14.511 0.675 12.875 0.638 2.387
RetinexNet [5] 10.149 0.570 10.880 0.586 10.471 0.595 11.498 0.613 12.295 0.635 11.059 0.600 2.933
Deep UPE [7] 10.047 0.532 10.462 0.568 10.307 0.557 11.583 0.591 12.639 0.619 11.008 0.573 2.428
Our method w/o Ladv 18.976 0.743 19.767 0.731 19.980 0.768 18.966 0.716 19.056 0.727 19.349 0.737 2.189
Our method w/ Ladv 18.874 0.738 19.569 0.718 19.788 0.760 18.823 0.705 18.936 0.719 19.198 0.728 2.183
−1 and −1.5 relative EVs (2,362 underexposed images)
HE [14] ∗ 16.158 0.683 16.293 0.669 16.517 0.692 16.632 0.665 17.280 0.684 16.576 0.679 2.486
CLAHE [16] ∗ 16.310 0.619 17.140 0.646 16.779 0.621 15.955 0.613 15.568 0.608 16.350 0.621 2.387
WVM [59] ∗ 17.686 0.728 19.787 0.764 18.670 0.728 18.568 0.729 18.362 0.724 18.615 0.735 2.525
LIME [3, 60] ∗ 13.444 0.653 14.426 0.672 13.980 0.663 15.190 0.673 16.177 0.694 14.643 0.671 2.462
HDR CNN [30] w/ RHT [34] 14.547 0.456 14.347 0.427 14.068 0.441 13.025 0.398 11.957 0.379 13.589 0.420 5.072
HDR CNN [30] w/ PS [58] 17.324 0.692 18.992 0.714 18.047 0.696 18.377 0.689 19.593 0.701 18.467 0.698 2.294
DPED (iPhone) [35] 18.814 0.680 21.129 0.712 20.064 0.683 19.711 0.675 19.574 0.676 19.858 0.685 2.894
DPED (BlackBerry) [35] 19.519 0.673 22.333 0.745 20.342 0.669 19.611 0.683 18.489 0.653 20.059 0.685 2.633
DPED (Sony) [35] 18.952 0.679 20.072 0.691 18.982 0.662 17.450 0.629 15.857 0.601 18.263 0.652 2.905
DPE (HDR) [9] 17.625 0.675 18.542 0.705 18.127 0.677 16.831 0.665 15.891 0.643 17.403 0.673 2.340
DPE (U-FiveK) [9] 19.130 0.709 19.574 0.674 19.479 0.711 17.924 0.665 16.370 0.625 18.495 0.677 2.571
DPE (S-FiveK) [9] 20.153 0.738 20.973 0.697 20.915 0.738 19.050 0.688 17.510 0.648 19.720 0.702 2.564
HQEC [4] ∗ 15.801 0.692 17.371 0.718 16.587 0.700 17.090 0.705 17.675 0.716 16.905 0.706 2.532
RetinexNet [5] 11.676 0.607 12.711 0.611 12.132 0.621 12.720 0.618 13.233 0.637 12.494 0.619 3.362
Deep UPE [7] 17.832 0.728 19.059 0.754 18.763 0.745 19.641 0.737 20.237 0.740 19.106 0.741 2.371
Our method w/o Ladv 19.432 0.750 20.590 0.739 20.542 0.770 18.989 0.723 18.874 0.727 19.685 0.742 2.344
Our method w/ Ladv 19.475 0.751 20.546 0.730 20.518 0.768 18.935 0.715 18.756 0.719 19.646 0.737 2.342
Combined over and underexposed images (5,905 images)
HE [14] ∗ 16.148 0.685 16.283 0.671 16.525 0.696 16.639 0.668 17.305 0.688 16.580 0.682 2.405
CLAHE [16] ∗ 14.884 0.589 15.669 0.610 15.383 0.599 15.452 0.601 15.737 0.610 15.425 0.602 2.317
WVM [59] ∗ 14.488 0.665 15.803 0.699 15.117 0.678 15.863 0.693 16.469 0.704 15.548 0.688 2.415
LIME [3, 60] 11.154 0.591 11.828 0.610 11.517 0.607 12.638 0.628 13.613 0.653 12.150 0.618 2.432
HDR CNN [30] w/ RHT [34] 13.709 0.467 13.921 0.458 13.800 0.474 13.716 0.446 13.558 0.454 13.741 0.460 4.599
HDR CNN [30] w/ PS [58] 15.812 0.667 16.970 0.699 16.428 0.681 17.301 0.687 18.650 0.702 17.032 0.687 2.267
DPED (iPhone) [35] 15.134 0.609 16.505 0.636 15.907 0.622 16.571 0.627 17.251 0.649 16.274 0.629 2.903
DPED (BlackBerry) [35] 16.910 0.642 18.649 0.713 17.606 0.653 18.070 0.679 18.217 0.668 17.890 0.671 2.564
DPED (Sony) [35] 17.419 0.675 18.636 0.701 18.020 0.683 17.554 0.660 17.778 0.663 17.881 0.676 2.806
DPE (HDR) [9] 15.690 0.614 16.548 0.626 16.305 0.626 16.147 0.615 16.341 0.633 16.206 0.623 2.417
DPE (U-FiveK) [9] 16.240 0.653 16.805 0.646 16.837 0.671 16.762 0.654 16.707 0.650 16.670 0.655 2.606
DPE (S-FiveK) [9] 16.933 0.678 17.701 0.668 17.741 0.696 17.572 0.674 17.601 0.670 17.510 0.677 2.621
HQEC [4] ∗ 13.385 0.641 14.470 0.666 13.911 0.656 14.891 0.674 15.777 0.692 14.487 0.666 2.445
RetinexNet [5] 10.759 0.585 11.613 0.596 11.135 0.605 11.987 0.615 12.671 0.636 11.633 0.607 3.105
Deep UPE [7] 13.161 0.610 13.901 0.642 13.689 0.632 14.806 0.649 15.678 0.667 14.247 0.640 2.405
Our method w/o Ladv 19.158 0.746 20.096 0.734 20.205 0.769 18.975 0.719 18.983 0.727 19.483 0.739 2.251
Our method w/ Ladv 19.114 0.743 19.960 0.723 20.080 0.763 18.868 0.709 18.864 0.719 19.377 0.731 2.247

Table 1: Quantitative evaluation on our introduced test set. The best results
are highlighted with green and bold. The second- and third-best re-
sults are highlighted in yellow and red, respectively. We compare each
method with properly exposed reference image sets rendered by five expert pho-
tographers [38]. For each method, we present peak signal-to-noise ratio (PSNR),
structural similarity index measure (SSIM) [54], and perceptual index (PI) [55].
We denote methods designed for underexposure correction in gray. Non-deep
learning methods are marked by ∗. The terms U and S stand for unsupervised
and supervised, respectively. Notice that higher PSNR and SSIM values are
better, while lower PI values indicate better perceptual quality.
Learning to Correct Overexposed and Underexposed Photos 15

Input images Results from Deep UPE [9] Our results Properly exposed ref. images

Fig. 9: Qualitative results of correcting underexposed images. Shown are the


input images, results from the Deep UPE [9], our results, and the corresponding
ground truth images.

5.2 Qualitative Results


We compare our method qualitatively with a variety of previous methods. Note
we show results using the model trained with the adversarial loss term, as it pro-
duces perceptually superior results (see the perceptual metric results in Tables
1 and 2).
Figs. 8 and 9 show our results on different overexposed and underexposed
images, respectively. As shown, our method provides compelling results for both
exposure errors. We also compare our method with the most recent method
for overexposure correction [26] in Fig. 10 – the source code of [26] is not yet

Input image HDR CNN w/ PS [30] Zhang et al. [26] Ours

Fig. 10: Qualitative comparison with HDR CNN [30] and Zhang et al. [26].
16 M. Afifi, K. G. Derpanis, B. Ommer, and M. S. Brown

By Floris van Lint (Flickr: CC BY-NC 2.0)

By Justin Chiaratti (Flickr: CC BY-NC-SA 2.0)

Input image Photoshop HDR [55] results DPE [9] results Our results

Fig. 11: Qualitative comparison with Adobe Photoshop’s local adaptation HDR
function [58] and DPE [9]. Input images are taken from Flickr.

By Stéphanie Van Laethem (Flickr: CC BY 2.0)

Input image HE [14] WVM [56] HDR CNN w/ PS [30]

DPED [35] DPE [9] HQEC [4] Ours

Fig. 12: Qualitative comparison with several existing methods in correcting par-
tially overexposed regions due to backlighting. Input image is from Flickr.

available; the result is taken directly from the original paper [26]. As shown, our
results are arguably visually superior to the other methods, even when input
images have hard backlight conditions, as shown in the second row in Fig. 9 and
the example in Fig. 12.

We also ran our model on several images from Flickr that are outside our
introduced dataset, as shown in Figs. 1, 11, and 12. As with the images from
our proposed dataset, our results on the Flickr images are arguably superior to
the compared methods.

Our method produces unsatisfactory results in regions that have insufficient


semantic information, as shown in Fig. 13. For example, the input image shown
in the first row in Fig. 13 is completely saturated and contains almost no details
in the region of the man’s face. We can see that our network cannot constrain
the color inside the face region due to the lack of semantic information. It also
can be observed that our method may introduce noise when the input image has
very dark regions, as shown in the second example in Fig. 13. These challenging
conditions prove difficult for other methods as well.
Learning to Correct Overexposed and Underexposed Photos 17

Image set NPE [24] ∗ LIME [3] ∗ WVM [59] ∗ RNet [5] KinD [6] EGAN [63] DBCP [64] Ours w/o Ladv Ours w/ Ladv
LIME [3] 3.91 4.16 3.79 4.42 3.72 3.72 3.78 3.76 3.76
NPE [24] 3.95 4.26 3.99 4.49 3.88 4.11 3.18 3.20 3.18
VV set [61] 2.52 2.49 2.85 2.60 - 2.58 - 2.28 2.28
DICM [62] 3.76 3.85 3.90 4.20 - - 3.57 2.55 2.50
Avg. 3.54 3.69 3.63 3.93 3.80 3.50 3.48 2.95 2.93

Table 2: Perceptual quality evaluation. Summary of NIQE scores [57] on different


low-light image datasets. Highlights are in the same format as Table 1.

By Dr. D. (Flickr: CC BY-NC-SA 2.0)

By eviljohnius (Flickr: CC BY 2.0)

Input image Photoshop HDR [55] DPED [35] DPE [9] Ours

Fig. 13: Failure examples of correcting (top) overexposed and (bottom) under-
exposed images. The input images are taken from Flickr.

5.3 Ablation Studies


This section presents details on the ablation studies that were performed to
validate the architecture and loss function.

5.3.1 Loss Function Our loss function in Eq. 1 includes three main terms.
The first term is the standard reconstruction loss (i.e., L1 loss). The second and
third terms consist of the pyramid and adversarial losses, respectively, which are
introduced to further improve the reconstruction and perceptual quality of the
output images. In the following part of this section, we discuss the effect of these
loss terms.

Pyramid Loss Impact In Fig. 5, we show the output of each sub-network when
we train our model with and without the pyramid loss. We observe that the pyra-
mid loss helps to provide additional supervision to guide each sub-network to
follow a coarse-to-fine reconstruction. In this ablation study, we aim to quanti-
tatively evaluate the effect of the pyramid loss on our final results.
We train two light-weight models of our main network with and without our
pyramid loss term. Each model has four 3-layer U-Nets with a total of ∼4M
learnable parameters, where the number of output channels of the first encoder
in each U-Net is set to 24.
The training is performed on a sub-set of our training data for ∼150,000
iterations on 80,000 128×128 patches, ∼100,000 iterations on 40,000 256×256
patches, and ∼25,000 iterations on 25,000 512×512 patches. Table 3 shows the
results on 500 randomly selected images from our validation set. The results
18 M. Afifi, K. G. Derpanis, B. Ommer, and M. S. Brown

3615  3615  3615 


66,0  66,0  66,0 
3,  3,  3, 

,QSXWLPDJH 5HVXOWRIn= 5HVXOWRIn= 5HVXOWRIn= 3URSHUO\H[SRVHGUHILPDJH

Fig. 14: Comparison of results by varying the number of Laplacian pyramid lev-
els. Notice that higher PSNR and SSIM values are better, while lower PI values
indicate better perceptual quality.

Pyramid loss Lpyr Number of levels n


w/o w/ n=1 n=2 n=4
PSNR 18.041 18.385 16.984 17.442 18.385
SSIM 0.746 0.749 0.723 0.734 0.749

Table 3: Results of our ablation study on 500 images randomly selected from
our validation set. We show the effects of: (i) the pyramid loss, Lpyr , and (ii)
the number of levels, n, in the main network. The best PSNR/SSIM values are
indicated with bold for each experiment.

show that the pyramid loss not only helps in providing a better interpretation
of the task of each sub-network but also improves the final results.

Adversarial Loss Impact In Tables 1 and 2, we show quantitative results


of our method with and without the adversarial loss term. Our trained model
with the adversarial loss term achieves better perceptual quality (i.e., lower PI
values [55]) than training without the adversarial loss term.
Fig. 15 shows qualitative comparisons of our results with and without the
adversarial loss. As shown, the network trained without the adversarial training
tends to produce darker images and slightly unrealistic colors in some cases, while
the adversarial regularization improves the perceptual quality of our results.

5.3.2 Number of Laplacian Pyramid Levels We repeat the same experi-


mental setup described in Sec. 5.3.1 with a varying number of Laplacian pyramid
levels (sub-networks). Specifically, we train a network with n = 1 levels—this
network is equivalent to a vanilla U-Net-like architecture [47]. Additionally, we
train another network with n = 2 (i.e., two sub-networks).
For a fair comparison, we fix the total number of parameters in each model by
changing the number of filters in the conv layers. Specifically, we set the number
of output channels of the first layer in the encoder to 48 for the trained model
with n = 1, while we decrease it to 34 for the two-sub-net model (i.e., n = 2) to
have approximately the same number of learnable parameters. Thus, the trained
Learning to Correct Overexposed and Underexposed Photos 19

PSNR = 16.83 PSNR = 21.18


SSIM = 0.534 SSIM = 0.668
PI = 2.174 PI = 2.117

PSNR = 21.09 PSNR = 23.30


SSIM = 0.755 SSIM = 0.829
PI = 1.798 PI = 1.773

Input image Ours w/o adv. loss Ours w/ adv. loss Properly exposed ref. image

Fig. 15: Comparisons between our results with (w/) and without (w/o) the ad-
versarial loss for training. Notice that higher PSNR and SSIM values are better,
while lower PI values indicate better perceptual quality.

model in Sec. 5.3.1 to study the pyramid loss impact and the additional two
trained models have approximately the same number of parameters.
Table 3 shows the results obtained by each model on the same random val-
idation image subset used to study the pyramid loss impact in Sec. 5.3.1. Fig.
14 shows a qualitative comparison. As can be seen, the best quantitative and
qualitative results are obtained using the four-sub-net model (i.e., n = 4 levels).

6 Concluding Remarks

We proposed a coarse-to-fine deep learning model for overexposed and underex-


posed image correction. We employed the Laplacian pyramid decomposition to
process input images in different frequency bands. Our method is designed to se-
quentially correct each of the Laplacian pyramid levels in a multi-scale manner,
starting with the global color in the image and progressively addressing the im-
age details. Our method is enabled by generating a large dataset of over 24,000
images rendered with different exposure errors. Each image in our introduced
dataset has a reference image properly rendered by a well-trained photographer
with well-exposure compensation. Through extensive evaluation, we showed that
our method produces compelling results compared to available solutions for cor-
20 M. Afifi, K. G. Derpanis, B. Ommer, and M. S. Brown

recting images rendered with exposure errors. We believe that our dataset will
help future work on improving exposure correction for photographs.

References
1. Peterson, B.: Understanding exposure: How to shoot great photographs with any
camera. AmPhoto Books (2016)
2. Karaimer, H.C., Brown, M.S.: A software platform for manipulating the camera
imaging pipeline. In: ECCV. (2016)
3. Guo, X., Li, Y., Ling, H.: LIME: Low-light image enhancement via illumination
map estimation. IEEE Transactions on Image Processing 26(2) (2017) 982–993
4. Zhang, Q., Yuan, G., Xiao, C., Zhu, L., Zheng, W.S.: High-quality exposure cor-
rection of underexposed photos. In: ACM MM. (2018)
5. Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light
enhancement. In: BMVC. (2018)
6. Zhang, Y., Zhang, J., Guo, X.: Kindling the darkness: A practical low-light image
enhancer. In: ACM International Conference on Multimedia. (2019)
7. Wang, R., Zhang, Q., Fu, C.W., Shen, X., Zheng, W.S., Jia, J.: Underexposed
photo enhancement using deep illumination estimation. In: CVPR. (2019)
8. Gharbi, M., Chen, J., Barron, J.T., Hasinoff, S.W., Durand, F.: Deep bilateral
learning for real-time image enhancement. ACM Transactions on Graphics (TOG)
36(4) (2017) 118:1–118:12
9. Chen, Y.S., Wang, Y.C., Kao, M.H., Chuang, Y.Y.: Deep photo enhancer: Unpaired
learning for image enhancement from photographs with GANs. In: CVPR. (2018)
10. Chen, C., Chen, Q., Xu, J., Koltun, V.: Learning to see in the dark. In: CVPR.
(2018)
11. Hu, Y., He, H., Xu, C., Wang, B., Lin, S.: Exposure: A white-box photo post-
processing framework. ACM Transactions on Graphics (TOG) 37(2) (2018) 26:1–
26:17
12. Hasinoff, S.W., Sharlet, D., Geiss, R., Adams, A., Barron, J.T., Kainz, F., Chen,
J., Levoy, M.: Burst photography for high dynamic range and low-light imaging
on mobile cameras. ACM Transactions on Graphics (TOG) 35(6) (2016) 1–12
13. Liba, O., Murthy, K., Tsai, Y.T., Brooks, T., Xue, T., Karnad, N., He, Q., Barron,
J.T., Sharlet, D., Geiss, R., Hasinoff, S.W., Pritch, Y., Levoy, M.: Handheld mobile
photography in very low light. ACM Transactions on Graphics (TOG) 38(6) (2019)
1–16
14. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Addison-Wesley Longman
Publishing Co., Inc. (2001)
15. Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Geselowitz, A., Greer,
T., ter Haar Romeny, B., Zimmerman, J.B., Zuiderveld, K.: Adaptive histogram
equalization and its variations. Computer Vision, Graphics, and Image Processing
39(3) (1987) 355–368
16. Zuiderveld, K.: Contrast limited adaptive histogram equalization. In: Graphics
Gems IV. (1994) 474–485
17. Celik, T., Tjahjadi, T.: Contextual and variational contrast enhancement. IEEE
Transactions on Image Processing 20(12) (2011) 3431–3441
18. Lee, C., Lee, C., Kim, C.S.: Contrast enhancement based on layered difference
representation of 2D histograms. IEEE Transactions on Image Processing 22(12)
(2013) 5372–5384
Learning to Correct Overexposed and Underexposed Photos 21

19. Yuan, L., Sun, J.: Automatic exposure correction of consumer photographs. In:
ECCV. (2012)
20. Yu, R., Liu, W., Zhang, Y., Qu, Z., Zhao, D., Zhang, B.: DeepExposure: Learning
to expose photos with asynchronously reinforced adversarial learning. In: NeurIPS.
(2018)
21. Park, J., Lee, J.Y., Yoo, D., So Kweon, I.: Distort-and-recover: Color enhancement
using deep reinforcement learning. In: CVPR. (2018)
22. Land, E.H.: The retinex theory of color vision. Scientific American 237(6) (1977)
108–129
23. Jobson, D.J., Rahman, Z., Woodell, G.A.: A multiscale retinex for bridging the
gap between color images and the human observation of scenes. IEEE Transactions
on Image Processing 6(7) (1997) 965–976
24. Wang, S., Zheng, J., Hu, H.M., Li, B.: Naturalness preserved enhancement algo-
rithm for non-uniform illumination images. IEEE Transactions on Image Process-
ing 22(9) (2013) 3538–3548
25. Meylan, L., Susstrunk, S.: High dynamic range image rendering with a retinex-
based adaptive filter. IEEE Transactions on Image Processing 15(9) (2006) 2820–
2830
26. Zhang, Q., Nie, Y., Zheng, W.S.: Dual illumination estimation for robust exposure
correction. In: Computer Graphics Forum. (2019)
27. Mertens, T., Kautz, J., Van Reeth, F.: Exposure fusion: A simple and practical
alternative to high dynamic range photography. In: Computer Graphics Forum.
(2009)
28. Kalantari, N.K., Ramamoorthi, R.: Deep high dynamic range imaging of dynamic
scenes. ACM Transactions on Graphics (TOG) 36(4) (2017) 144–1
29. Endo, Y., Kanamori, Y., Mitani, J.: Deep reverse tone mapping. ACM Transactions
on Graphics (TOG) 36(6) (2017) 177:1–177:10
30. Eilertsen, G., Kronander, J., Denes, G., Mantiuk, R., Unger, J.: HDR image
reconstruction from a single exposure using deep CNNs. ACM Transactions on
Graphics (TOG) 36(6) (2017) 178:1–178:15
31. Moriwaki, K., Yoshihashi, R., Kawakami, R., You, S., Naemura, T.: Hybrid loss for
learning single-image-based HDR reconstruction. arXiv preprint arXiv:1812.07134
(2018)
32. Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from
photographs. In: ACM SIGGRAPH. (1997)
33. Cai, J., Gu, S., Zhang, L.: Learning a deep single image contrast enhancer from
multi-exposure images. IEEE Transactions on Image Processing 27(4) (2018)
2049–2062
34. Yang, X., Xu, K., Song, Y., Zhang, Q., Wei, X., Lau, R.W.: Image correction via
deep reciprocating HDR transformation. In: CVPR. (2018)
35. Ignatov, A., Kobyshev, N., Timofte, R., Vanhoey, K., Van Gool, L.: DSLR-quality
photos on mobile devices with deep convolutional networks. In: ICCV. (2017)
36. Ignatov, A., Kobyshev, N., Timofte, R., Vanhoey, K., Van Gool, L.: WESPE:
Weakly supervised photo enhancer for digital cameras. In: CVPR Workshops.
(2018)
37. Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. Journal of Machine
Learning Research 9 (2008) 2579–2605
38. Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global
tonal adjustment with a database of input / output image pairs. In: CVPR. (2011)
39. Adobe: Color and camera raw. https://helpx.adobe.com/ca/
photoshop-elements/using/color-camera-raw.html Accessed: 2020-03-05.
22 M. Afifi, K. G. Derpanis, B. Ommer, and M. S. Brown

40. Schewe, J., Fraser, B.: Real World Camera Raw with Adobe Photoshop CS5.
Pearson Education (2010)
41. Afifi, M., Price, B., Cohen, S., Brown, M.S.: When color constancy goes wrong:
Correcting improperly white-balanced images. In: CVPR. (2019)
42. Burt, P., Adelson, E.: The Laplacian pyramid as a compact image code. IEEE
Transactions on Communications 31(4) (1983) 532–540
43. Denton, E.L., Chintala, S., szlam, a., Fergus, R.: Deep generative image models
using a Laplacian pyramid of adversarial networks. In: NeurIPS. (2015)
44. Shaham, T.R., Dekel, T., Michaeli, T.: SinGAN: Learning a generative model from
a single natural image. In: ICCV. (2019)
45. Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep Laplacian pyramid networks
for fast and accurate super-resolution. In: CVPR. (2017)
46. Ma, R., Hu, H., Xing, S., Li, Z.: Efficient and fast real-world noisy image denoising
by combining pyramid neural network and two-pathway unscented Kalman filter.
IEEE Transactions on Image Processing 29(1) (2020) 3927–3940
47. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomed-
ical image segmentation. In: MCCAI. (2015)
48. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., Bengio, Y.: Generative adversarial nets. In: NeurIPS. (2014)
49. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-
level performance on ImageNet classification. In: ICCV. (2015)
50. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
51. Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided
person image generation. In: NeurIPS. (2017)
52. Chen, J., Adams, A., Wadhwa, N., Hasinoff, S.W.: Bilateral guided upsampling.
ACM Transactions on Graphics (TOG) 35(6) (2016) 1–8
53. Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.:
Halide: A language and compiler for optimizing parallelism, locality, and recom-
putation in image processing pipelines. In: ACM SIGPLAN Conference on Pro-
gramming Language Design and Implementation. (2013)
54. Zhou Wang, Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assess-
ment: From error visibility to structural similarity. IEEE Transactions on Image
Processing 13(4) (2004) 600–612
55. Blau, Y., Mechrez, R., Timofte, R., Michaeli, T., Zelnik-Manor, L.: The 2018 PIRM
challenge on perceptual image super-resolution. In: ECCV Workshops. (2018)
56. Ma, C., Yang, C.Y., Yang, X., Yang, M.H.: Learning a no-reference quality metric
for single-image super-resolution. Computer Vision and Image Understanding 158
(2017) 1–16
57. Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image
quality analyzer. IEEE Signal Processing Letters 20(3) (2012) 209–212
58. Dayley, L.D., Dayley, B.: Photoshop CS5 Bible. John Wiley & Sons (2010)
59. Fu, X., Zeng, D., Huang, Y., Zhang, X.P., Ding, X.: A weighted variational model
for simultaneous reflectance and illumination estimation. In: CVPR. (2016)
60. Guo, X.: LIME: A method for low-light image enhancement. In: ACM MM. (2016)
61. Vonikakis, V.: Busting image enhancement and tone-mapping algorithms. https:
//sites.google.com/site/vonikakis/datasets Accessed: 2020-03-05.
62. Lee, C., Lee, C., Kim, C.S.: Contrast enhancement based on layered difference
representation. In: ICIP. (2012)
Learning to Correct Overexposed and Underexposed Photos 23

63. Jiang, Y., Gong, X., Liu, D., Cheng, Y., Fang, C., Shen, X., Yang, J., Zhou, P.,
Wang, Z.: EnlightenGAN: Deep light enhancement without paired supervision.
arXiv preprint arXiv:1906.06972 (2019)
64. Lee, H., Sohn, K., Min, D.: Unsupervised low-light image enhancement using bright
channel prior. IEEE Signal Processing Letters 27 (2020) 251–255

You might also like