Point Based Rendering Enhancement Via Deep Learning: December 16, 2018

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

CGI2018 manuscript No.

(will be inserted by the editor)

Point Based Rendering Enhancement via Deep Learning


Giang Bui · Truc Le · Brittany Morago · Ye Duan

December 16, 2018

Abstract Current state-of-the-art point rendering techniques or, with more advanced hole-filling point-based rendering
such as splat rendering generally require very high resolu- methods, leads to low-frequency appearance, as no detail
tion point clouds in order to create high quality photo realis- information is available between the point samples. One ex-
tic renderings. These can be very time consuming to acquire isting way to circumvent this problem is texture splatting,
and oftentimes also require high-end expensive scanners. but that technique requires access to high-resolution camera
This paper proposes a novel deep learning based approach images taken alongside the (coarser) 3D scan.
that can generate high resolution photo realistic point ren- In this paper, we propose an alternative approach: the
derings from low resolution point clouds. More specifically, adaptation of a CNN (Convolutional Neural Network) bor-
we propose to use co-registered high quality photographs as rowed from existing work on image super-resolution, to take
the ground truth data to train the deep neural network for a coarse point-based rendering and up-sample it to a plau-
point based rendering. The proposed method can generate sible high-resolution image. We demonstrate that the pro-
high quality point rendering images very efficiently and can posed method can generate high quality point rendering im-
be used for interactive navigation of large scale 3D scenes ages very efficiently and can be used for interactive naviga-
as well as image-based localization. Extensive quantitative tion of large scale 3D scenes as well as image based local-
evaluations on both synthetic and real datasets show that the ization with significantly reduced overhead.
proposed method outperforms state-of-the-art methods. The rest of the paper is structured as follows. Section 2
discusses the related work. Section 3 describes the proposed
method. The experimental results are presented in Section 4
1 Introduction followed by our conclusion in Section 5.

Large scale 3D scene modeling and visualization are fun-


damental building blocks for active research fields such as
2 Related Works
Virtual Reality, Augmented Reality, Autonomous Driving,
etc. Despite decades of research in this area, how to model
Point-based splat rendering is a technique for rendering a
the real world efficiently and accurately remains a very chal-
smooth surface with approximated linear piece-wise splats.
lenging task.
In order to cover the gaps between the points, a circular disc
The advancement of 3D depth sensors, such as LIDAR
is assigned to each sample point with a normal vector ni
(Light Detection And Ranging) scanners, has provided an
and radius ri which are computed based on local geome-
effective alternative to the traditional CAD-based and image-
try. The surface splats serve as linear approximation to a
based approaches for 3D modeling. When rendering 3D re-
smooth surface if constructed properly. Zwicker et al. [52]
constructions directly from an acquired point cloud, the dis-
proposed a signal-theorectically motivated approach to re-
crete resolution leads to either holes in the reconstruction,
construct smooth and correctly band-limitted images from
Giang Bui, Truc Le, Ye Duan splats. Botsch et al. introduced a multi-pass rendering ap-
University of Missouri, Columbia proach which averages colors and normals of overlapping
Brittany Morago splats [2]. However, those approaches may generate blurred
University of North Carolina Wilmington images due to gradient suppression. Sibbing et al. [38] pro-
2 Giang Bui et al.

posed two post-processing methods to further improve the al. [51] proposed a super-resolution algorithm that learns a
results of splat rendering. Intensity Completion preserves pair of over-complete dictionaries with the assumption that
the intensities of the non-gap pixels, and automatically finds both the input and output patches have the same mixing co-
color transitions between the non-gap and gap pixels while efficients of their corresponding dictionary.
Gradient Completion requires additional gradient informa-
tion from intermediate images to reconstruct image more Besides traditional super-resolution methods [6, 4,50, 51,
faithfully. Moreover, when additional texture images are avail- 47, 10,11], convolutional neural networks (CNNs) have re-
able along with projection matrices, they also proposed Tex- cently demonstrated remarkable performance in the image
ture Splat Rendering which improves the rendered image SR field thanks to their ability to exploit the contextual infor-
by propagating as much information as possible from a set mation through their receptive field to recover missing high
of texture images. That is to say, instead of using a single frequency components. Inspired by traditional sparse coding
color for a whole splat, they used a projection matrix to based SR [51], Dong et al. [8] proposed a shallow SRCN N
project the fragments of the splat into a texture image and consisting of three layers which correspond to patch repre-
obtain the color information. However, obtaining the pro- sentation, non-linear mapping, and reconstruction. Kim et
jection matrices of texture images requires non-trivial 2D- al. [20] proposed a deeply recursive CNN that utilizes a very
3D registration, and storing all the texture images together large image context compared to previous SR methods with
with a large point cloud requires a lot of memory. Similar a single recursive layer. Kim et al. [19] proposed a resid-
to Intensity Completion and Gradient Completion of [38], ual network structure with 16 layers by concatenating 64
we also propose to conduct post-processing image enhance- 3 × 3 kernels. For these approaches, the SR methods work
ment based on the 3-pass splat rendering results of [2]. Com- directly on the high-resolution (HR) space by first applying
pared with Intensity Completion and Gradient Completion a simple interpolation method (e.g. bicubic interpolation) to
methods, our results are much better. Compared with Tex- up-sample the LR image to the desired size, then feeding it
ture Splat Rendering, our method does not need to store ex- through a deep neural network to obtain a visually satisfying
tra texture images while still generating better image qual- result. While most of the deep learning based super image
ity. Fig. 1 shows a side-by-side comparisons between the super resolution (SISR) methods work directly on the HR
proposed method and the photograph, 3-pass rendering [2], space, other methods cope with the LR space and only go
Intensity Completion, Gradient Completion, Texture Splat back to the HR space at the very last several layers. Bishop
Rendering. More examples are shown in Fig. 4 and 5. et al. [37] presented the first CNN in which extracted feature
maps are performed in LR space and an efficient sub-pixel
Image Super Resolution (SR) is a set of methods to re- convolution is used to upscale the final LR feature maps to
construct a high-resolution image from either a single or the HR image. Improving over the SRCN N , Dong et al. [9]
multiple images. The image super-resolution methods can introduced a fast version of SRCNN, named F SRCN N ,
be classified into 3 categories: interpolation-based method, which can reach up to 43 fps with a generic GPU. Ledig et
reconstruction-based method, and example-based method. al. [24] proposed SRResNet by employing the ResNet ar-
According to [48], example-based methods show superior chitecture from He at al. [14] that successfully solves time
performance by modeling the mapping from low-resolution and memory issues with good performance. By removing
(LR) to high-resolution (HR) image patches. This map then the batch normalization layers from SRResNet, Lim et al.
is applied to a new LR image to obtain the most likely HR [27] showed superior performance over the state-of-the-art
output [6, 4, 51]. These methods either exploit internal sim- methods and won the NTIRE2017 Super-Resolution Chal-
ilarities of the same image [5,10, 11,15] or learn mapping lenge [41].
functions from external training image pairs [51, 47, 6,17,
21, 36,42, 43,49]. Glasner et al. [11] proposed a method to Recently, there has been a lot of interest in Generative
combine example-based SR constraints and classical SR con- Adversarial Networks (GANs) [12] which can learn to pro-
straints in a unified framework which allows for inter-patch duce samples that imitate the dataset according to the dis-
search. The parent of the search results is copied to an ap- criminator network. Mathieu et al. [31] supplemented a squared
propriate location in the high-resolution image. Freedman et error loss with both GAN and image gradient-based simi-
al. [10] proposed a method to adopt a local search by using larity to improve image sharpness of video prediction. The
the multi-step coarse-to-fine algorithm. Since the extracted ability to produce high-quality images of GANs is also demon-
patches from multiple scale images may not always be suf- strated in the works of Denton et al. [7] and Radford et al.
ficiently expressive to cover the textural appearance varia- [34]. In the work of Johnson et al. [18] and Ledig et al. [24],
tions, Huang et al. [15] extended the self-similarity based SR the authors proposed a conceptual loss function based on the
method by allowing geometric variations in patch searching VGG network [39] obtained results that are more convinc-
scheme. The sparse-coding-based methods [51, 47] are the ing than the ones obtained with traditional low level mean
representative external example-based SR methods. Yang et square error (MSE). In this paper, we employ the GAN net-
Point Based Rendering Enhancement via Deep Learning 3

work for point based rendering enhancement, which will be by Principal Component Analysis (PCA) which selects an
explained in more details in the following section. eigenvector corresponding to the smallest eigenvalue of the
covariance matrix [23].

3 Proposed Method
Next, we estimate a radius for each point. Since the point
Fig. 2 shows the overview of our proposed method. The al- normal is assigned to that of the plane it belongs to, we
gorithm takes a splat image as input and feeds it through still have to determine the radius r so that we can render
a neural network to obtain a high quality image. As shown a gap-free surface. We do not want it to be too big as this
in Fig. 1, the deep learning result is almost indistinguish- will likely require many overlapping regions be rendered
able from the photograph taken by cameras. Beside of fill- which is computationally inefficient. We propose a heuris-
ing holes, it can generate more natural results while splat tic approach which is similar to [38]. First, for each sample
rendering has blurring effect, intensity completion and gra- point pi with corresponding normal ni , we determine the k
dient completion can not fill the big holes due to the lack of nearest neighbors using KNN and define Ri as the furthest
constrains. distance from a neighbor to the sample point. Therefore, the
sphere at pi with radius Ri covers the whole k nearest neigh-
bors. In the second step, we form a circle at pi with radius
3.1 Input data Ri , normal ni and project all neighbors whose normals are
consistent (the angle deviation is less than 5◦ ) to that of the
There are several ways to obtain a color point cloud of a 3D sample point onto it. Lastly, we divide the circle of projected
reconstructed scene. A traditional approach is to use SfM points into 12 sectors and for each non-empty sector, we
(Structure from Motion) [3, 13, 40] to generate a point cloud only keep the closest projection point. The radius assigned
from a collection of densely sampled images. The generated to each point is guaranteed that any point is covered by at
point cloud can be used for many tasks such as image local- least one of splats of other points. Since the point distribu-
ization [16, 26], classification and segmentation [33,35,45, tion is quite uniform, scaling that radius with smaller factor
29]. Another common approach is to use LIDAR laser scan- can reduce the effect of blurring and still guarantee the gap
ners that produce highly accurate coordinate measurements. between any point is covered. For that reason, we set the
The RGB-color information can either come from imagery splat radii ri to be the scale factor of 0.7 of the maximum
collected at the time of the LIDAR survey. By placing a laser distance of those points. Although computing the splat ra-
scanner at several sample locations, a large-scale 3D color dius depends on how many of the nearest neighbors are con-
point cloud can be obtained with relative ease. In this paper, sidered, we have found that it is not sensitive to parameter
we focus on LIDAR point cloud. k. Empirically we set k = 30 for all models in the dataset.

3.2 Splat rendering Since each splat is a solid color (determined from its cen-
ter point), sharp edges may exist between splats. Following
Point-based splat rendering has been proven to be a flexi- [38], we implement a 3-pass rendering technique to blend
ble and efficient technique for rendering 3D objects due to the color and normal between neighboring splats. In the first
its simplicity. The key idea is to approximate a surface by pass called visibility splatting, the splats are rendered with-
piece-wise planar ellipses, or splats, in the object space and out color information in order to fill the depth buffer. In the
render them to the image space. This technique is widely second blending pass, after the object is slightly shifted to-
used in the literature and often needs an associated normal, wards the viewer by  (0.05), a simple depth test is used to
a color, and a radius ri for each point pi to render the points remove all the splats that are far behind the visible surfaces.
as small discs. That information can be obtained from neigh- For each virtual image pixel, the second pass sums up the
bors around points. There are two common approaches to colors and normals of the splats that lie in the proximity of
obtain this information, radius search and k-nearest neigh- the visible surface and perspectively project onto the respec-
bors search (KNN). Both can be efficiently retrieved using tive pixel. Finally, the normalization pass normalizes the
a k-d tree. However, radius search is not suitable for the sum-up colors and normals by dividing them by the accu-
case of LIDAR data due to its non-uniform distribution. The mulated weights. Although the 3-pass rendering technique
points tend to be very dense on surfaces close to the scanning can give visually pleasant rendering images, it smooths out
location and become sparse on ones further away. In contrast the gradient information and hence blurs the image due to
to radius search, KNN is a naturally adaptive method better the use of blending techniques. In what follows, we will de-
suited for this situation. Once the neighborhood information scribe the use of a deep-learning based approach as a post-
is obtained, the point cloud’s normals can be approximated processing step to alleviate these unwanted effects.
4 Giang Bui et al.

Fig. 1: From left to right, different rendering techniques: (1) Photograph. (2) Splat rendering. (3) Intensity completion. (4)
Gradient completion. (5) Texture splat rendering. (6) (Ours) Deep-Learning based rendering with splat rendering of (2) as
input.

Fig. 2: Proposed method pipeline.

3.3 Deep-Learning based Super-Resolution with ative model G with the goal of fooling a differentiable dis-
Generative Adversarial Network criminator D that is trained to distinguish super-resolved im-
ages from real images. That is to say, with this approach our
Following Goodfellow et al. [12], in the context of genara- generator can learn to predict outputs that are highly similar
tive adversarial networks, we solve the adversarial min-max to real images and thus difficult to classify by D.
problem:
3.3.1 Generator network structure
min max EIHR ∼ ptrain (I HR ) logDθD (I HR ) +
 
θG θD

EILR ∼ pG (I LR ) log(1 − DθD (GθG (I LR )) Our network, named deep network for super-resolution with
 

(1) deeply supervised nets (SRDSN), is illustrated in Fig. 3 (top


row). The network takes a splat rendering of the LR image
where GθG and DθD are generator and discriminator net- as input and progressively predicts the intermediate HR im-
works which are parameterized by θG and θD respectively. age. We supervise all the intermediate results to alleviate the
In our problem, the generator network G is the network to effect of vanishing/exploding gradients. Our network con-
predict a HR image given a LR image. The general idea be- tains two parts: (i) cascade network and (ii) deeply super-
hind this formulation is that it allows one to train a gener- vised network.
Point Based Rendering Enhancement via Deep Learning 5

Fig. 3: SRDSN-GAN. (Top) Generator Network Structure. The network consists of multiple blocks with deep supervision,
every intermediate result is penalized by a Euclidean loss function l. The layout of each network block has a corresponding
number of feature maps followed by kernel size indicated for each convolutional (e.g. 64@3x3 stands for 64 3x3 kernels).
Our final network consists of four blocks (M = 4) and 20 layers (d = 20) within each block. (Bottom) Discriminator
Network Structure with corresponding number of feature maps (n) and stride (s). The network takes a set of natural images
to measure the similarity of the generator output image and ground truth image. For the SRDSN-GAN network, the training
loss (Eq. 5) is computed by both the Generator Network and the Discriminator Network.

Cascade network: The network has multiple similar struc- the final objective, rather than relying on the final layer to
ture blocks. Each block, consisting of d convolutional layers back-propagate the information to its predecessors. A sim-
followed by a PReLU (Parametric ReLUs) layer except for ilar idea of supervising intermediate layers for a convolu-
the last one, takes an image (1 or 3 channels) to produce an tional neural network can be found in [46] and [20].
intermediate HR image. This intermediate result is the input
to the next block and is supervised by a deeply supervised Generator output: Denote x as the input of low-resolution
network. image, and ŷm , m = 1, 2, . . . , M, as predicted M interme-
diate outputs. The output of the generator P network is aver-
M
Deeply supervised network: We supervise all the out- aged over all the intermediate outputs ŷ = i=1 αi ŷm with
puts of the network blocks using a deeply supervised nets αi indicates the relative importance of the intermediate out-
(DSN) structure [25]. The DSN can be considered as a net- put, e.g. setting αm = 2 ≤ m < M makes the model turn
work regularization that informs intermediate layers about into a predictor with only a single output at the top.
6 Giang Bui et al.

3.3.2 Adversarial Network Architecture 4 Experiments

For the discriminator network, we modify the network pro- 4.1 Training Dataset
posed by Ledig [24] to simplify the network architecture.
We remove the batch normalization layers from the network Deep learning techniques usually require many pairs of LR
as Lim et al. [27] presented in their work. It is better to re- and HR images for training. In our problem, the LR images
move them since batch normalization layers get rid of range are 3-pass splat rendering (splat rendering for short) images
flexibility from networks by normalizing the features. We whereas the HR images are texture mesh images. We syn-
also replace the LeakyRELU layers by P ReLU layers to thesize the LIDAR scans on the 3D mesh models where we
allow the network to adaptively learn the coefficients of neg- can control the location, rotation angle and focus of the LI-
ative parts. The network consists of 8 convolution layers DAR and camera. In this paper, we use 78 mesh models from
with increasing feature depths with factors of 2 and decreas- Google SketchUp. To simulate the LIDAR scan, we first nor-
ing the feature resolutions each time the feature depth is malize a model so that it can be fit in an unit sphere and set
doubled. The discriminator network (shown in Fig. 3 (bot- a point on that sphere and in front of the main facet of a
tom row)) is trained to solve for the min-max optimization building as the virtual LIDAR scanner’s position. We create
problem described in Equation 1. a grid of cells by sub-dividing the space based on the polar
and azimuthal angles of the spherical coordinates 1 where
3.3.3 Loss function rays are cast from the LIDAR’s position. We keep the first
intersection between the rays and the mesh model as a 3D
N
point along with color.

Given training pair images (xn , yn ) n=1 , our goal is to
optimize both the generator network GθG and the discrimi- In addition to using the synthetic dataset, we also eval-
nator network DθD simultaneously. Similar to [24], our loss uate the method on a real LIDAR dataset. We used a Le-
function consists of a mean square error (M SE) loss (Eq. ica laser scanner to capture a large scene along with pho-
2), V GG [39] loss (Eq. 3), and adversarial loss components tographs on a university campus.
(Eq. 4). Next, we need to generate a set of pairs of LR-HR im-
ages which are 3-pass renderings and mesh images respec-
N
1 X 2 tively. For the synthetic LIDAR data, we use a technique
lM SE (θG ) = kGθG (xn ) − yn k (2)
N n=1 similar to [16] which is used to generate camera locations
along with parameters around the 3D model. For each lo-
cation, we render the point model with splats and keep its
N
1 X 2 rendering as the LR image. We then render the correspond-
lV GG/i,j (θG ) = kvggi,j (yn ) − vggi,j (GθG (xn ))k
N n=1 ing mesh model and keep its rendering as the corresponding
HR image. For the real data, we use the natural images as
(3)
HR images, and generate splat images with the same cam-
where vggi,j (.) is the feature map obtained by the j − th era parameters used in the HR image. In total, we created a
convolution before the i − th max-pooling layer within the dataset consisting of 2000 image pairs of synthetic data and
VGG network. 200 image pairs of real data. Both synthetic data and real
N data are used for training but we keep two types of data for
X
lGen (θD ) = −logDθD (GθG (xn )) (4) testing. Set1 contains 100 synthetic image pairs and Set2
n=1 contains 20 real image pairs. All the images have the same
size of 768 × 1024.
The final loss function needs to be minimized is
L (θG , θD ) =γ1 lM SE (θG ) + γ2 lV GG/i,j (θG )
(5) 4.2 Implementation details
+ γ3 lGen (θD )
where γi (i = 1, 2, 3) are given weighting parameters. Data samples: To prepare the training data, we sample 400K
It is worth mentioning that single lM SE can give a satis- patches from splat rendering images with a stride of 10 and
factory results as demonstrated in other deep learning super- crop the corresponding HR patches from the ground truth
resolution approaches [9, 8,19, 20]. However, it results in lack- images. For a synthetic dataset, it is trivial to crop a corre-
ing details around edge regions. By incorporating with the sponding HR patch of a LR patch since we can control the
perceptual loss lV GG and adversarial loss lGen with an ap- 1
Note the sub-dividing in the spherical coordinates (which is used
propriate scale factor, the final loss can give more percep- in real LIDAR) is the primary reason for the produced scattered point
tual and natural results. Empirically, we choose γ1 = 1, clouds because there is a higher density in the center region than in the
γ2 = 1.0e − 5 and γ3 = 1.0e − 3. outer region.
Point Based Rendering Enhancement via Deep Learning 7

camera parameters of virtual views. However, we can not do their P SN R and SSIM respectively. The P SN R and SSIM
the same for real datasets due to the numerical error involved of the two images are the average of all P SN R and SSIM
when recovering the camera’s intrinsic and extrinsic param- at patch level. Table 1 shows P SN R and SSIM indices on
eters of natural images. To overcome this issue, for each im- the designed testing sets. For synthetic testing set Set1, tex-
age, we recover the rotation and translation, and then gener- ture splat rendering obtains an average PSNR of 33.18 and
ate the splat and texture splat images. Since the texture splat an average SSIM of 0.96, SRDSN 33.35 and 0.97 whereas
rendering has high quality, we can do dense-SIFT matching SRDSN-GAN 33,57 and 0.97. For the real testing set Set2,
between the texture splat image and the natural image. We although the texture splat image can render quite well at high
extract all the patches at the matched SIFT feature locations texture regions, it performs worse than SRDSN − GAN
with SIFT’s orientations and scales in both the splat images in terms of P SN R and SSIM (26.24 and 0.86) because
and the natural images to form image pairs for training. it cannot fill the holes in the data. Both the SRDSN and
Data augmentation: Inspired by [44], we augment the SRDSN − GAN can fill holes in the image since they
training data in two ways: (i) Rotation: rotate images by learn missing information from the training data. The av-
90o , 180o and 270o ; (ii) Flipping: flip images horizontally or erage P SN R and SSIM of SRDSN − GAN on Set2
vertically. These steps together lead to an augmented train- are 27.32 and 0.89 whereas ones of SRDSN are 25.44 and
ing set that is a factor of 8 times larger than the original data 0.84. Some visual examples of the two testing sets are shown
set. in Fig. 4 and Fig. 5.
Training details: We implement our network using the
deep learning framework TensorFlow [1]. For simplicity, we
represent our proposed SRDSN network as SRDSN(d, M ) 4.4 Interactive 3D Scene Navigation
where d is the number of convolutional layers in a block and
M is the number of blocks in the model. We first train the The proposed method can be used to support interactive nav-
SRDSN with MSE loss. Empirically, we choose the follow- igation of large scale 3D scenes represented by colored point
ing hyper-parameters: batch size (64), patch size (48), con- clouds. Since the generator network (SRDSN) is quite deep
volutional filters and bias randomly initialized as described (up to 80 layers), it cannot render every frame in real-time.
in [9]. We adopt the adjustable gradient clipping [19] to We have designed a novel rendering pipeline to avoid this is-
boost the convergence rate while suppressing exploding gra- sue. Imagine that we are navigating a very large point cloud
dients. Specifically, the gradients are clipped to [− γθ ; γθ ], where such as an urban scene. Instead of doing super-resolution
γ is the current learning rate and θ = 0.01 is the gradi- for every frame, we just do super-resolution on several syn-
ent clipping parameter. Furthermore, we use the Adam op- thetic views which cover the whole scene. Those images are
timizer [22] with an initial learning rate of 10−4 . Next, we selected visually during the interactive navigation. The high-
train SRDSN-GAN. We increase the patch size to 244 and re- resolution versions of those synthetic views are kept as tex-
duce the batch size to 8 due to the GPU memory limitation. ture images along with the Model, View, and Projection ma-
All the other parameters are kept the same as the SRDSN. trices (known as MVP). Using those matrices and texture
Training takes four days on a PC using a Nvidia TitanX images, we can perform texture splat rendering which can
GPU. be done in real time using ordinary computers. We repeat
It is worth mentioning that in training, we use image this process every time users enter a new scene. Single frame
patches (e.g. 256x256) for both generator and discrimina- super-resolution is very fast with modern GPU’s as shown in
tor networks. In testing, we feed image of arbitrary size to Table 2. Moreover, this process can be done off-line which is
the generator only. Thus, our method can work on any size transparent to users. Thus, the cost of rendering is reduced
of input image. to the cost of texture splat rendering. Table 3 shows more
details of rendering times for different sizes of point clouds.
4.3 Comparison with state-of-the-art rendering techniques An example of real time 3D scene navigation can be seen on
the video https://www.youtube.com/watch?v=94I4IV0i mc.
For quantitative comparison, we use the Peak signal-to-noise
ratio (PSNR) and The Structural SIMilarity (SSIM) indices.
As mentioned in Section 4.1, it is trivial to compute those 4.5 Image matching with synthetic views
indices for the synthetic testing set for comparison. For real
data sets, we follow the same methodology of generating In this section, we demonstrate the ability to extract SIFT
real training data sets. To compute P SN R and SSIM of features [30] for matching from generated synthetic images.
a rendered image with its corresponding natural image, we Since we have all the mesh images of all the SketchUp mod-
perform Dense-SIFT feature matching. We extract the patches els, we choose 5 mesh images for each model to match with
on both of the images at matched locations and compute the synthetic images. In total, we have 390 mesh images for
8 Giang Bui et al.

Fig. 4: Comparisons with other methods on synthetic testing set (Set1) for the three examples. From left to right and top
to bottom: OpenGL Point Rendering, Texture Mesh, Intensity Completion, Gradient Completion, Splat Rendering, Texture
Splat, SRDSN, and SRDSN-GAN.
Point Based Rendering Enhancement via Deep Learning 9

Table 1: Quantitative comparison of the approaches on the testing sets.

Set1 Set2
Method
PSNR SSIM PSNR SSIM
Splat rendering 23.5 0.84 22.98 0.78
Intensity Completion [38] 24.78 0.89 22.72 0.78
Gradient Completion [38] 28.56 0.96 25.12 0.84
Texture splat rendering 33.18 0.96 26.24 0.86
SRDSN 33.35 0.97 25.44 0.84
SRDSN-GAN 33.37 0.97 27.32 0.89

Fig. 5: Comparisons with other methods on real testing set (Set2) for the three examples. From left to right and top to
bottom: OpenGL Point Rendering, Photograph, Intensity Completion, Gradient Completion, Splat Rendering, Texture Splat,
SRDSN, and SRDSN-GAN.

Table 2: Running time of SRDSN with different image sizes

480 × 640 768 × 1024 1080 × 1920 2016 × 3940


Running time(ms) 259 780 1356 4324
10 Giang Bui et al.

Table 3: Rendering time (fps) of interactive 3D scene navigation with different resolutions.
hhhh
h hhhResolution 460 × 640 768 × 1024 1080 × 1920 2016 × 3940
No of points hhh
h
500K 87 67 64 56
1000K 58 40 38 30
2000K 44 36 31 26
3000K 37 30 28 23
4000K 29 24 21 16

the synthetic data. For the real LiDAR data, we chose 10 im- error is below 4 pixels, and regard a novel image as local-
ages taken using a variety of modern devices: Nikon D90, ized if we can find a pose with at least 12 inliers. Those meta
Nikon D3100, and cell phones to form the real data. parameters are chosen empirically.
Similar to [38], we measure the average numbers of SIFT Fig. 6 shows the result of the experiment. As we can
features extracted from the images. In order to measure the see, all of the methods perform well on both datasets. Splat
repetitiveness and the distinctiveness of extracted features, super-resolution obtains the highest recognition rates which
we use the SIFT ratio test: two SIFT features f1 , f2 , one are 98% and 92% on (Set 1) and (Set 2) respectively. These
in a mesh image and one in a synthetic image, are consid- are 16% and 20% improvements over the base-line method
ered as a match if their descriptors pass the SIFT ratio test splat rendering.
||des(f1 ) − des(f2 )|| < 0.7 ∗ ||des(f1 ) − des(f20 )|| for all
features f20 also extracted in the synthetic image. We match
each query photo against all synthetic views rendered from
the same scene. For each feature on a synthetic view, we
back project to the 3D point cloud. The resulting 2D-3D
correspondences are then used to estimate the pose of each
query image. The Avg. Inliers columns of Fig. 4 describe the
5 Conclusion
number of SIFT features that are used to compute the cam-
era pose position. We notice that our proposed method out-
performs other methods on synthetic data sets in number of In this paper, we have proposed a novel framework for point-
extracted features, number of inliers, and number of repet- based rendering by incorporating the deep learning based
itive features whereas its performance on our real data set super-resolution technique with point rendering methods. The
is a little bit worse than texture splat rendering as shown in network takes a splat rendering image as input and generates
Table 4. Remember that for the real data set, we do not have the corresponding high-quality image. Our network which is
a mesh model, thus we use texture splat images as ground based on the deeply supervised stacked model with genera-
truth images. tive adversarial network has demonstrated superior perfor-
mance in terms of accuracy compared to other methods.
To generate a photo realistic rendering of the real 3D
4.6 Location recognition using synthetic views. world is a holy grail in computer graphics and remains a
challenge even after several decades of research. Our pro-
Following [38], in this section, we also demonstrate that posed method is a small step in this direction. In the fu-
such generated synthetic views can be used to recognize new ture we would like to build on the idea proposed in this
images taken from viewpoints substantially different from paper, i.e. using realistic high quality natural images as the
the photos included in the scans. We collect 200 synthetic ground truth data to train the neural network to learn to gen-
from 3 synthetic models to form Set 1 and 50 images taken erate rendering images indistinguishable from the real im-
by the camera within our campus (Set 2), together with their ages. Currently, the proposed method is working in the 2D
locations. For each query image, we extract SIFT features image space. We would like to extend the proposed network
and match against all the synthetic views from the same to work directly in 3D and be able to generate high reso-
scan. We assume that the query image has a tagged GPS lution 3D point clouds using low resolution colored point
location so that we can limit it to one scan. Since the cor- clouds and single and/or multiple high quality images as
respondence from synthetic views and the point cloud are input. That would be very useful for 3D scene modeling.
known, we can establish the 2D-3D correspondences which Another direction we would like to pursue is to investigate
are then used for pose estimation using the five-point algo- shallower networks that can achieve similar results but with
rithm [32] with RANSAC. Similar to [38], we consider a a much faster speed so that it can generate the output in real-
match to be an inlier to an estimated pose if the reprojection time.
Point Based Rendering Enhancement via Deep Learning 11

Table 4: Feature extraction evaluation

Set1 Set2
Method
Avg. SIFT Avg. Repeatabilites Avg. Inliners Avg. SIFT Avg. Repeatabilites Avg. Inliners
splat rendering 1224 28.26 10.62 2845 13.67 8.45
Intensity completion 1135 31.36 10.38 2754 12.93 9.34
Gradient completion 1345 43.40 19.41 3255 34.83 14.45
Texture splat 1456 61.91 21.35 3524 43.53 18.53
Splat super-resolution 1554 64.74 23.49 3552 47.91 19.45

Fig. 6: The percentage of query images that can be localized using synthetic views with different rendering techniques.

Acknowledgment 2015. url h ttp. tensorflow. org/. Software available from tensor-
flow. org
We would like to thank Sebastian Lipponer for providing 2. Botsch, M., Hornung, A., Zwicker, M., Kobbelt, L.: High-quality
surface splatting on today’s gpus. In: Proceedings Eurograph-
open source code [28] of which our splat rendering imple- ics/IEEE VGTC Symposium Point-Based Graphics, 2005., pp.
mentation is mainly based on. We also thank him for all 17–141. IEEE (2005)
the suggestions during the implementation. We would like 3. Brown, M., Lowe, D.G.: Unsupervised 3d object recognition and
to thank Qing Lei and Xu Wang for helping us to generate reconstruction in unordered datasets. In: 3-D Digital Imaging and
Modeling, 2005. 3DIM 2005. Fifth International Conference on,
the video. We also like to thank Roger Kiew, Fan Gao and
pp. 56–63. IEEE (2005)
Chuhang Wang for helping us to generate the training data. 4. Chang, H., Yeung, D.Y., Xiong, Y.: Super-resolution through
neighbor embedding. In: Computer Vision and Pattern Recogni-
tion, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer
Compliance with Ethical Standards Society Conference on, vol. 1, pp. I–I. IEEE (2004)
5. Cui, Z., Chang, H., Shan, S., Zhong, B., Chen, X.: Deep network
Conflict of Interest: Giang Bui declares that he has no con- cascade for image super-resolution. In: European Conference on
Computer Vision, pp. 49–64. Springer (2014)
flict of interest. Truc Le declares that he has no conflict of
6. Dai, D., Timofte, R., Van Gool, L.: Jointly optimized regres-
interest. Brittany Morago declares that she has no conflict of sors for image super-resolution. In: Computer Graphics Forum,
interest. Ye Duan declares that he has no conflict of interest. vol. 34, pp. 95–104. Wiley Online Library (2015)
7. Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative im-
age models using a laplacian pyramid of adversarial networks. In:
References Advances in neural information processing systems, pp. 1486–
1494 (2015)
1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, 8. Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution us-
C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Ten- ing deep convolutional networks. IEEE Transactions on Pattern
sorflow: Large-scale machine learning on heterogeneous systems, Analysis and Machine Intelligence pp. 295–307 (2015)
12 Giang Bui et al.

9. Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution 31. Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale
convolutional neural network. In: European Conference on Com- video prediction beyond mean square error. arXiv preprint
puter Vision, pp. 391–407. Springer (2016) arXiv:1511.05440 (2015)
10. Freedman, G., Fattal, R.: Image and video upscaling from local 32. Nistér, D.: An efficient solution to the five-point relative pose
self-examples. ACM Transactions on Graphics (TOG) 30(2), 12 problem. IEEE transactions on pattern analysis and machine in-
(2011) telligence 26(6), 756–770 (2004)
11. Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single 33. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on
image. In: 2009 IEEE 12th International Conference on Computer point sets for 3d classification and segmentation. Proc. Computer
Vision, pp. 349–356. IEEE (2009) Vision and Pattern Recognition (CVPR), IEEE 1(2), 4 (2017)
12. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde- 34. Radford, A., Metz, L., Chintala, S.: Unsupervised representation
Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adver- learning with deep convolutional generative adversarial networks.
sarial nets. In: Advances in neural information processing sys- arXiv preprint arXiv:1511.06434 (2015)
tems, pp. 2672–2680 (2014) 35. Savva, M., Yu, F., Su, H., Aono, M., Chen, B., Cohen-Or, D.,
13. Hartley, R., Zisserman, A.: Multiple view geometry in computer Deng, W., Su, H., Bai, S., Bai, X., et al.: Shrec16 track large-scale
vision. Cambridge university press (2003) 3d shape retrieval from shapenet core55. In: Proceedings of the
14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for eurographics workshop on 3D object retrieval (2016)
image recognition. In: Proceedings of the IEEE conference on 36. Schulter, S., Leistner, C., Bischof, H.: Fast and accurate image
computer vision and pattern recognition, pp. 770–778 (2016) upscaling with super-resolution forests. In: Proceedings of the
15. Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution IEEE Conference on Computer Vision and Pattern Recognition,
from transformed self-exemplars. In: 2015 IEEE Conference on pp. 3791–3799 (2015)
Computer Vision and Pattern Recognition (CVPR), pp. 5197– 37. Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R.,
5206. IEEE (2015) Rueckert, D., Wang, Z.: Real-time single image and video super-
16. Irschara, A., Zach, C., Frahm, J.M., Bischof, H.: From structure- resolution using an efficient sub-pixel convolutional neural net-
from-motion point clouds to fast location recognition. In: Com- work. In: Proceedings of the IEEE Conference on Computer Vi-
puter Vision and Pattern Recognition, 2009. CVPR 2009. IEEE sion and Pattern Recognition, pp. 1874–1883 (2016)
Conference on, pp. 2599–2606. IEEE (2009) 38. Sibbing, D., Sattler, T., Leibe, B., Kobbelt, L.: Sift-realistic render-
17. Jia, K., Wang, X., Tang, X.: Image transformation based on learn- ing. In: International Conference on 3D Vision, pp. 56–63 (2013)
ing dictionaries across image spaces. IEEE transactions on pattern 39. Simonyan, K., Zisserman, A.: Very deep convolutional networks
analysis and machine intelligence 35(2), 367–380 (2013) for large-scale image recognition. CoRR abs/1409.1556 (2014)
18. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time 40. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring
style transfer and super-resolution. In: European Conference on photo collections in 3d. In: ACM transactions on graphics (TOG),
Computer Vision, pp. 694–711. Springer (2016) vol. 25, pp. 835–846. ACM (2006)
19. Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution 41. Timofte, R., Agustsson, E., Van Gool, L., Yang, M.H., Zhang,
using very deep convolutional networks. arXiv preprint L., Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M., et al.: Ntire
arXiv:1511.04587 (2015) 2017 challenge on single image super-resolution: Methods and re-
20. Kim, J., Lee, J.K., Lee, K.M.: Deeply-recursive convolutional net- sults. In: Computer Vision and Pattern Recognition Workshops
work for image super-resolution. arXiv preprint arXiv:1511.04491 (CVPRW), 2017 IEEE Conference on, pp. 1110–1121. IEEE
(2015) (2017)
21. Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse 42. Timofte, R., De Smet, V., Van Gool, L.: Anchored neighborhood
regression and natural image prior. IEEE transactions on pattern regression for fast example-based super-resolution. In: Proceed-
analysis and machine intelligence 32(6), 1127–1133 (2010) ings of the IEEE International Conference on Computer Vision,
22. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. pp. 1920–1927 (2013)
arXiv preprint arXiv:1412.6980 (2014) 43. Timofte, R., De Smet, V., Van Gool, L.: A+: Adjusted anchored
23. Kobbelt, L., Botsch, M.: A survey of point-based techniques in neighborhood regression for fast super-resolution. In: Asian Con-
computer graphics. Computers & Graphics 28(6), 801–814 (2004) ference on Computer Vision, pp. 111–126. Springer (2014)
24. Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., 44. Timofte, R., Rothe, R., Van Gool, L.: Seven ways to improve
Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al.: Photo- example-based single image super resolution. In: Proceedings of
realistic single image super-resolution using a generative adver- the IEEE Conference on Computer Vision and Pattern Recogni-
sarial network. arXiv preprint arXiv:1609.04802 (2016) tion, pp. 1865–1873 (2016)
25. Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply- 45. Vinyals, O., Bengio, S., Kudlur, M.: Order matters: Sequence to
supervised nets. In: Artificial Intelligence and Statistics, pp. 562– sequence for sets. arXiv preprint arXiv:1511.06391 (2015)
570 (2015) 46. Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceed-
26. Li, Y., Snavely, N., Huttenlocher, D.P.: Location recognition using ings of the IEEE International Conference on Computer Vision,
prioritized feature matching. In: European conference on com- pp. 1395–1403 (2015)
puter vision, pp. 791–804. Springer (2010) 47. Yang, C.Y., Huang, J.B., Yang, M.H.: Exploiting self-similarities
27. Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep for single frame super-resolution. In: Proceedings of the Asian
residual networks for single image super-resolution. In: Computer Conference on Computer Vision, pp. 497–510 (2011)
Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE 48. Yang, C.Y., Ma, C., Yang, M.H.: Single-image super-resolution:
Conference on, pp. 1132–1140. IEEE (2017) A benchmark. In: European Conference on Computer Vision, pp.
28. Lipponer, S.: Surface splatting. https://github.com/ 372–386. Springer (2014)
sebastianlipponer/surface_splatting (2015) 49. Yang, J., Wang, Z., Lin, Z., Cohen, S., Huang, T.: Coupled dic-
29. Liu, Y., Xiong, Y.: Automatic segmentation of unorganized noisy tionary training for image super-resolution. IEEE Transactions on
point clouds based on the gaussian map. Computer-Aided Design Image Processing 21(8), 3467–3478 (2012)
40(5), 576–594 (2008) 50. Yang, J., Wright, J., Huang, T., Ma, Y.: Image super-resolution as
30. Lowe, D.G.: Distinctive image features from scale-invariant key- sparse representation of raw image patches. In: Computer Vision
points. International journal of computer vision 60(2), 91–110 and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on,
(2004) pp. 1–8. IEEE (2008)
Point Based Rendering Enhancement via Deep Learning 13

51. Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution
via sparse representation. IEEE Transactions on Image Processing
19(11), 2861–2873 (2010)
52. Zwicker, M., Pfister, H., Van Baar, J., Gross, M.: Surface splat-
ting. In: Proceedings of the 28th annual conference on Computer
graphics and interactive techniques, pp. 371–378. ACM (2001)

Giang Bui received the B.S. and M.S. degrees from the Viet-
nam National University of Hanoi in 2004 and 2007, respec-
tively. He is currently pursuing the Ph.D degree at the Uni-
versity of Missouri, Columbia. He was a Research Assistant
with the Computer Graphics and Image Understanding Lab-
oratory under the supervision of Dr. Y. Duan. His research
interests include image and video processing, 3-D computer
vision, and machine learning.

Truc Le received his B.S. in Computer Science in 2012


from the University of Science, VNU-HCM of Vietnam. He
is currently pursuing the Ph.D degree at the University of
Missouri, Columbia in the Computer Graphics and Image
Understanding Laboratory under the supervision of Dr. Ye
Duan. His research interests include Computer Graphics, 3D
computer vision, and machine learning.

Brittany Morago received the B.S. degree in digital arts


and sciences from the University of Florida in 2010, and
the Ph.D. degree in computer science from the University
of Missouri, Columbia, in 2016. She is currently an Assis-
tant Professor with the Department of Computer Science,
University of North Carolina at Wilmington. Her research
interests include computer vision and graphics. She was a
recipient of the NSFGRF and GAANN fellowships.

Ye Duan received the B.A. degree in mathematics from Peking


University in 1991, and the M.S. degree in mathematics from
Utah State University in 1996, and the M.S. and Ph.D. de-
grees in computer science from the State University of New
York, Stony Brook, in 1998 and 2003, respectively. From
2003 to 2009, he was an Assistant Professor of Computer
Science with the University of Missouri, Columbia. He is
currently an Associate Professor of Computer Science with
the University of Missouri, Columbia. His research inter-
ests include computer graphics and visualization, biomedi-
cal imaging, and computer vision.

You might also like