HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: tabu

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2407.16369v2 [cs.CV] 24 Jul 2024
\onlineid

1056 \vgtccategoryIEEE VIS Short Paper \vgtcinsertpkg

FCNR: Fast Compressive Neural Representation of Visualization Images

Yunfei Lu e-mail: ylu25@nd.edu    Pengfei Gu e-mail: pgu@nd.edu    Chaoli Wang e-mail: chaoli.wang@nd.edu University of Notre Dame
Abstract

We present FCNR, a fast compressive neural representation for tens of thousands of visualization images under varying viewpoints and timesteps. The existing NeRVI solution, albeit enjoying a high compression ratio, incurs slow speeds in encoding and decoding. Built on the recent advances in stereo image compression, FCNR assimilates stereo context modules and joint context transfer modules to compress image pairs. Our solution significantly improves encoding and decoding speed while maintaining high reconstruction quality and satisfying compression ratio. To demonstrate its effectiveness, we compare FCNR with state-of-the-art neural compression methods, including E-NeRV, HNeRV, NeRVI, and ECSIC. The source code can be found at https://github.com/YunfeiLu0112/FCNR.

Introduction

Generating vast datasets has become indispensable for analyzing complex phenomena across diverse fields. As technical advances continue to shine in an era of unprecedented data acquisition, managing and interpreting such enormous amounts of data becomes increasingly challenging. Scientific visualization is important for us to visually comprehend complicated patterns, identify trends, and extract meaningful insights from simulation data, which are often time-varying. When dealing with a time-varying volumetric dataset, we can produce numerous visualization images, including those generated through isosurface rendering (IR) or direct volume rendering (DVR). These images correspond to various parameters, such as timesteps and camera views, offering a thorough representation of the data. We tackle the issue of managing significant quantities of these visualization images, which occupy substantial gigabytes of storage and could surpass the original data size. This scenario poses significant constraints, including storage cost, network transmission, image access, and interactive display when conveying the visualization output. Hence, efficient compression and sharing of these visualization images become a necessity. Recent developments in DL4SciVis [20] provide a viable direction.

To achieve this goal, we present FCNR, a fast compressive neural representation for tens of thousands of visualization images. FCNR takes a pair of visualization images with nearby views as input. It encodes the image pair into quantized bitstreams with entropy coding computed and their similarity exploited. It then decodes them to reconstruct the images. FCNR can efficiently compress a vast array of images derived from time-varying datasets under different viewing and timestep parameters in a relatively short time. We evaluate FCNR on multiple datasets, quantitatively and qualitatively, and compare it with state-of-the-art deep learning compression baselines, including E-NeRV, HNeRV, NeRVI, and ECSIC, to demonstrate its superiority. Our contributions are as follows:

  • simultaneously compressing a pair of images with similar views based on joint context transfer modules (JCTMs), which extract mutual information from the whole images;

  • incorporating viewpoints and timesteps into stereo context modules (SCMs) to accommodate our compression scenario while improving entropy estimation of one encoded image using another as the context;

  • improving encoding and decoding speeds significantly while maintaining high reconstruction quality and satisfying compression ratio expressed in bit per pixel (BPP);

  • leveraging the interpolation ability by compressing all the images of the given dataset after training on its subset, which also expedites the encoding process.

1 Related Work

In recent years, implicit neural representation (INR) has been extensively studied for image and video compression [5, 12, 3, 25, 4, 14, 11]. NeRV [5] takes as input an image index, generates the image embedding via multilayer perceptrons (MLPs) and convolutional layers, and outputs the whole image. E-NeRV [12] improves the NeRV architecture by identifying the redundant parts and decomposing the image-wise INR into distinct spatial and temporal contexts, accelerating convergence while maintaining high performance. CNeRV [3] enables internal generalization using content-adaptive embedding, which compactly encodes visual information. D-NeRV [25] represents various videos using the same model, which takes sampled key-frames as input for clip-specific content encoding and outputs video frames with a motion-aware decoder. HNeRV [4] resolves the content-agnostic issue and the unbalanced parameter distribution of NeRV by storing videos in small, content-adaptive frame embeddings and utilizing a learned decoder. NIRVANA [14] proposes patch-wise prediction to accommodate videos with varying spatial and temporal resolutions using the same architecture. HiNeRV [11] addresses the limited representation capability of INR and refines the model compression pipeline with adaptive parameter weighting and quantization-aware training.

In scientific visualization, INR has been applied to data generation and compression tasks [9, 19, 18]. Gu et al. [8] extended INR to visualization image compression, which is much more challenging due to the need for accommodating viewpoint and timestep parameters and more significant differences between neighboring rendering images than video frames. The proposed NeRVI achieves neural representations with a high compression ratio and leads to good image fidelity using mask loss. However, the rather slow encoding speed restricts its use in practice, especially when compressing high-resolution visualization images in a large collection.

Refer to caption
Figure 1: Overview of FCNR. The encoder (E𝐸Eitalic_E) encodes xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to bitstreams with hyper-encoder (hEsubscript𝐸h_{E}italic_h start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT), quantization (Q𝑄Qitalic_Q), and arithmetic encoder (AE𝐴𝐸AEitalic_A italic_E). The decoder (D𝐷Ditalic_D) then reconstructs x^lsubscript^𝑥𝑙\hat{x}_{l}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and x^rsubscript^𝑥𝑟\hat{x}_{r}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT through quantized latents (y^l(\hat{y}_{l}( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and y^r)\hat{y}_{r})over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) with arithmetic decoder (AD𝐴𝐷ADitalic_A italic_D) and hyper-decoder (hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT).

Refer to captionRefer to captionRefer to caption(a) encoder (E) and hyper-encoder (hE)(b) decoder (D) and hyper-decoder (hD)(c) SCM (contz) and SCM (conty)Refer to captionRefer to captionRefer to caption(a) encoder (E) and hyper-encoder (hE)(b) decoder (D) and hyper-decoder (hD)(c) SCM (contz) and SCM (conty)\begin{array}[]{c@{\hspace{0.2in}}c@{\hspace{0.2in}}c}\includegraphics[height=% 46.97505pt]{figures/architecture1.pdf}\hfil\hskip 14.45377pt&\includegraphics[% height=46.97505pt]{figures/architecture2.pdf}\hfil\hskip 14.45377pt&% \includegraphics[height=46.97505pt]{figures/architecture3.pdf}\\ \mbox{(a) encoder ($E$) and hyper-encoder ($h_{E}$)}\hfil\hskip 14.45377pt&% \mbox{(b) decoder ($D$) and hyper-decoder ($h_{D}$)}\hfil\hskip 14.45377pt&% \mbox{(c) SCM (cont${}_{z}$) and SCM (cont${}_{y}$)}\\ \end{array}start_ARRAY start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL (a) encoder ( italic_E ) and hyper-encoder ( italic_h start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) end_CELL start_CELL (b) decoder ( italic_D ) and hyper-decoder ( italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) end_CELL start_CELL (c) SCM (cont ) and SCM (cont ) end_CELL end_ROW end_ARRAY

Figure 2: The detailed structure of each module.

In contrast, stereo-image compression methods seek to compress image pairs simultaneously by exploiting their similarities (i.e., mutual information) using neural networks. For instance, Liu et al. [13] presented DSIC, which computes a dense warp field and feeds features from the left image after warping into the encoder and decoder of the right image. Deng et al. [7] designed HESIC, which improves DSIC by applying a rigid image-space homography transform. Wödlinger et al. [22] introduced SASIC that enhances a conventional single-image compression backbone model. To accommodate finer local displacements between images, it incorporates latent-domain global shift and subtraction as well as stereo attention modules in the decoder. Based on SASIC, Wödlinger et al. [21] further developed ECSIC that augments the architecture with stereo cross-attention modules (SCAMs) and SCMs. Zhang et al. [24] proposed LDMIC, a simple and effective cross-attention-based JCTM utilizing the decoder’s cross-attention mechanism to capture global inter-view correlations efficiently. In this work, we assimilate the notion of jointly compressing image pairs by exploiting their mutual information. We design our network based on ECSIC, incorporate the JCTM from LDMIC for the more complicated visualization image compression task, and present FCNR, a fast solution for compressive neural representation.

2 FCNR

Given a time-varying dataset Y={Y1,Y2,,YT}𝑌subscript𝑌1subscript𝑌2subscript𝑌𝑇Y=\{Y_{1},Y_{2},\ldots,Y_{T}\}italic_Y = { italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, where T𝑇Titalic_T is the number of timesteps, with a predefined isovalue or transfer function, we produce a set of visualization images. Each volume Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, t[1,T]𝑡1𝑇t\in[1,T]italic_t ∈ [ 1 , italic_T ], is represented as a subset of images Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT associated with different camera views. We aim to learn a mapping that encodes the input images into latent representations, which are quantized, compressed, and decoded to reconstruct the images.

Networks for stereo image compression often consist of the main autoencoder and hyperprior autoencoder. We adapt this structure to compress IR or DVR images under different views and timesteps. As shown in Figure 1, the input to our model is a pair of images xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (l𝑙litalic_l and r𝑟ritalic_r denote left and right) from Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with the two neighboring views (θ,φl)𝜃subscript𝜑𝑙(\theta,\varphi_{l})( italic_θ , italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) and (θ,φr)𝜃subscript𝜑𝑟(\theta,\varphi_{r})( italic_θ , italic_φ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ). We first use the encoder E𝐸Eitalic_E to encode xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT into the latents ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and yrsubscript𝑦𝑟y_{r}italic_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Then, we estimate the latent entropy parameters ψlysubscriptsuperscript𝜓𝑦𝑙\psi^{y}_{l}italic_ψ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and ψrysubscriptsuperscript𝜓𝑦𝑟\psi^{y}_{r}italic_ψ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT using the hyper-encoder hEsubscript𝐸h_{E}italic_h start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and hyper-decoder hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT to produce quantized latents y^lsubscript^𝑦𝑙\hat{y}_{l}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and y^rsubscript^𝑦𝑟\hat{y}_{r}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Finally, we utilize the decoder D𝐷Ditalic_D to reconstruct the images x^lsubscript^𝑥𝑙\hat{x}_{l}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and x^rsubscript^𝑥𝑟\hat{x}_{r}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT from y^lsubscript^𝑦𝑙\hat{y}_{l}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and y^rsubscript^𝑦𝑟\hat{y}_{r}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. We improve the structure with SCMs [21] (which aid in the prediction of ψrysubscriptsuperscript𝜓𝑦𝑟\psi^{y}_{r}italic_ψ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT from y^lsubscript^𝑦𝑙\hat{y}_{l}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and hyper-latent entropy parameters ψrzsubscriptsuperscript𝜓𝑧𝑟\psi^{z}_{r}italic_ψ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT from hyper-latent z^lsubscript^𝑧𝑙\hat{z}_{l}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT) and JCTMs [24] (which exploits the feature-space inter-view correlations brought by overlapping viewpoints in visualization images for generating more informative representations).

Encoding modules and quantization. We develop our E𝐸Eitalic_E and hEsubscript𝐸h_{E}italic_h start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT based on the structure proposed in ECSIC. Each consists of several convolutional (Conv) layers, a JCTM, and parametric ReLU (PReLU) activation functions [10]. Their detailed structures are shown in Figure 2 (a). Unlike the epipolar assumption in ECSIC, the transformations between two visualization images in a pair are much more complex, involving the 3D rotation of the volume. Since SCAM [21] only computes cross-attention between the corresponding epipolar lines, it fails to fully capture the inter-view information brought by the more complicated transformations. Therefore, we replace SCAMs with JCTMs to exploit mutual information globally. Given xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, E𝐸Eitalic_E computes the main latent representations ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and yrsubscript𝑦𝑟y_{r}italic_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT by

yi=E(xi),i{l,r}.formulae-sequencesubscript𝑦𝑖𝐸subscript𝑥𝑖𝑖𝑙𝑟y_{i}=E(x_{i}),\ i\in\{l,r\}.italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_E ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i ∈ { italic_l , italic_r } . (1)

hEsubscript𝐸h_{E}italic_h start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT accepts ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and yrsubscript𝑦𝑟y_{r}italic_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and generates the hyper-latents zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and zrsubscript𝑧𝑟z_{r}italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT

zi=hE(xi),i{l,r}.formulae-sequencesubscript𝑧𝑖subscript𝐸subscript𝑥𝑖𝑖𝑙𝑟z_{i}=h_{E}(x_{i}),\ i\in\{l,r\}.italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i ∈ { italic_l , italic_r } . (2)

Introducing z𝑧zitalic_z for entropy estimation is effective to model the dependencies between y𝑦yitalic_y, which are assumed to be independently conditioned on z𝑧zitalic_z [2]. ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, yrsubscript𝑦𝑟y_{r}italic_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, and zrsubscript𝑧𝑟z_{r}italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT then go through a quantization process. For example,

y^i=round(yiμiy)+μiy,i{l,r},formulae-sequencesubscript^𝑦𝑖roundsubscript𝑦𝑖subscriptsuperscript𝜇𝑦𝑖subscriptsuperscript𝜇𝑦𝑖𝑖𝑙𝑟\hat{y}_{i}={\rm round}(y_{i}-\mu^{y}_{i})+\mu^{y}_{i},\ i\in\{l,r\},over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_round ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { italic_l , italic_r } , (3)

where μiysubscriptsuperscript𝜇𝑦𝑖\mu^{y}_{i}italic_μ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the estimated mean of the disrtibution of yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. A similar process is applied to zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and zrsubscript𝑧𝑟z_{r}italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to generate quantized hyper-latents z^lsubscript^𝑧𝑙\hat{z}_{l}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and z^rsubscript^𝑧𝑟\hat{z}_{r}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. To make the process differentiable, we use approximate quantization by adding the uniform noise ϵ𝒰(0.5,0.5)similar-toitalic-ϵ𝒰0.50.5\epsilon\sim\mathcal{U}(-0.5,0.5)italic_ϵ ∼ caligraphic_U ( - 0.5 , 0.5 ) for the rate loss [1].

y~i=yi+ϵ,i{l,r}.formulae-sequencesubscript~𝑦𝑖subscript𝑦𝑖italic-ϵ𝑖𝑙𝑟\tilde{y}_{i}=y_{i}+\epsilon,\ i\in\{l,r\}.over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ , italic_i ∈ { italic_l , italic_r } . (4)

Thus, the density function of y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a continuous relaxation of the probability mass function of yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, allowing the differential entropy of y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to approximate the entropy of y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG. Additionally, independent uniform noise approximates the quantization error, modeling its marginal moments for distortion measurement. zlsubscript𝑧𝑙z_{l}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and zrsubscript𝑧𝑟z_{r}italic_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT follow a similar process. For the distortion loss, we employ a straight-through-estimation quantization [15].

Decoding modules and entropy model. As shown in Figure 2 (b), D𝐷Ditalic_D and hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT each have Conv layers, a JCTM, PReLU activation functions, and transposed convolutional (ConvT) layers for upsampling. z^lsubscript^𝑧𝑙\hat{z}_{l}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and z^rsubscript^𝑧𝑟\hat{z}_{r}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are stored as side information to help predict ψlysubscriptsuperscript𝜓𝑦𝑙\psi^{y}_{l}italic_ψ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and ψrysubscriptsuperscript𝜓𝑦𝑟\psi^{y}_{r}italic_ψ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT of y^lsubscript^𝑦𝑙\hat{y}_{l}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and y^rsubscript^𝑦𝑟\hat{y}_{r}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. The distribution of each latent representation is modeled by a Laplacian distribution with parameters ψ=(μ,b)𝜓𝜇𝑏\psi=(\mu,b)italic_ψ = ( italic_μ , italic_b ). The process of distribution modeling and parameter estimation corresponds to the arithmetic encoders (AEs) and arithmetic decoders (ADs) shown in Figure 1. We model the distribution of z^lsubscript^𝑧𝑙\hat{z}_{l}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT by a channel-wise Laplacian distribution with parameters ψlzsubscriptsuperscript𝜓𝑧𝑙\psi^{z}_{l}italic_ψ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT computed from visualization parameters (tl,θl,φl)subscript𝑡𝑙subscript𝜃𝑙subscript𝜑𝑙(t_{l},\theta_{l},\varphi_{l})( italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT )

ψlz=MLP(PE(tl,θl,φl)),subscriptsuperscript𝜓𝑧𝑙MLPPEsubscript𝑡𝑙subscript𝜃𝑙subscript𝜑𝑙\psi^{z}_{l}=\operatorname*{MLP}(\operatorname*{PE}(t_{l},\theta_{l},\varphi_{% l})),italic_ψ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = roman_MLP ( roman_PE ( italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) , (5)

where PE denotes positional encoding which projects (tl,θl,φl)subscript𝑡𝑙subscript𝜃𝑙subscript𝜑𝑙(t_{l},\theta_{l},\varphi_{l})( italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) into a higher-dimensional space

PE(u)=(sin(b0πu),cos(b0πu),,sin(bL1πu),cos(bL1πu)),PE𝑢superscript𝑏0𝜋𝑢superscript𝑏0𝜋𝑢superscript𝑏𝐿1𝜋𝑢superscript𝑏𝐿1𝜋𝑢\operatorname*{PE}(u)=(\sin(b^{0}\pi u),\cos(b^{0}\pi u),\dots,\sin(b^{L-1}\pi u% ),\cos(b^{L-1}\pi u)),roman_PE ( italic_u ) = ( roman_sin ( italic_b start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_π italic_u ) , roman_cos ( italic_b start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_π italic_u ) , … , roman_sin ( italic_b start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT italic_π italic_u ) , roman_cos ( italic_b start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT italic_π italic_u ) ) , (6)
PE(t,θ,φ)=(PE(t),PE(θ),PE(φ)).PE𝑡𝜃𝜑PE𝑡PE𝜃PE𝜑\operatorname*{PE}(t,\theta,\varphi)=(\operatorname*{PE}(t),\operatorname*{PE}% (\theta),\operatorname*{PE}(\varphi)).roman_PE ( italic_t , italic_θ , italic_φ ) = ( roman_PE ( italic_t ) , roman_PE ( italic_θ ) , roman_PE ( italic_φ ) ) . (7)

Here, we set b=1.25𝑏1.25b=1.25italic_b = 1.25 and L=8𝐿8L=8italic_L = 8. Since our visualization image pairs have greater variations than the stereo image pairs compressed by ECSIC, computing distribution parameters from visualization parameters can help mitigate the gap by allowing for detailed quantitative differences between images and improve model performance. The distributions of z^rsubscript^𝑧𝑟\hat{z}_{r}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, y^lsubscript^𝑦𝑙\hat{y}_{l}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, and y^rsubscript^𝑦𝑟\hat{y}_{r}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are modeled by factorized Laplacian distributions. We further reduce the bitrate by conditioning the distributions of y^rsubscript^𝑦𝑟\hat{y}_{r}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and z^rsubscript^𝑧𝑟\hat{z}_{r}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT on the information from y^lsubscript^𝑦𝑙\hat{y}_{l}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and z^lsubscript^𝑧𝑙\hat{z}_{l}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT with SCMs contysubscriptcont𝑦{\rm cont}_{y}roman_cont start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and contzsubscriptcont𝑧{\rm cont}_{z}roman_cont start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, and their structures are shown in Figure 2 (c). In this way, ψrzsubscriptsuperscript𝜓𝑧𝑟\psi^{z}_{r}italic_ψ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are predicted from z^lsubscript^𝑧𝑙\hat{z}_{l}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and parameters ϕrzsubscriptsuperscriptitalic-ϕ𝑧𝑟\phi^{z}_{r}italic_ϕ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT

ψrz=contz(z^l,ϕrz),subscriptsuperscript𝜓𝑧𝑟subscriptcont𝑧subscript^𝑧𝑙subscriptsuperscriptitalic-ϕ𝑧𝑟\psi^{z}_{r}={\rm cont}_{z}(\hat{z}_{l},\phi^{z}_{r}),italic_ψ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = roman_cont start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_ϕ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) , (8)

where, likewise, ϕrzsubscriptsuperscriptitalic-ϕ𝑧𝑟\phi^{z}_{r}italic_ϕ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is generated with (tr,θr,φr)subscript𝑡𝑟subscript𝜃𝑟subscript𝜑𝑟(t_{r},\theta_{r},\varphi_{r})( italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT )

ϕrz=MLP(PE(tr,θr,φr)),subscriptsuperscriptitalic-ϕ𝑧𝑟MLPPEsubscript𝑡𝑟subscript𝜃𝑟subscript𝜑𝑟\phi^{z}_{r}=\operatorname*{MLP}(\operatorname*{PE}(t_{r},\theta_{r},\varphi_{% r})),italic_ϕ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = roman_MLP ( roman_PE ( italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ) , (9)

Then hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT computes ϕlysubscriptsuperscriptitalic-ϕ𝑦𝑙\phi^{y}_{l}italic_ϕ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and ϕrysubscriptsuperscriptitalic-ϕ𝑦𝑟\phi^{y}_{r}italic_ϕ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT from quantized hyper-latents

ϕiy=hD(z^i),i{l,r}.formulae-sequencesubscriptsuperscriptitalic-ϕ𝑦𝑖subscript𝐷subscript^𝑧𝑖𝑖𝑙𝑟\phi^{y}_{i}=h_{D}(\hat{z}_{i}),\ i\in\{l,r\}.italic_ϕ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i ∈ { italic_l , italic_r } . (10)

In the main branch, we set ψly=ϕlysubscriptsuperscript𝜓𝑦𝑙subscriptsuperscriptitalic-ϕ𝑦𝑙\psi^{y}_{l}=\phi^{y}_{l}italic_ψ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_ϕ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. For ψrysubscriptsuperscript𝜓𝑦𝑟\psi^{y}_{r}italic_ψ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, we similarly apply the SCM

ψry=conty(y^l,ϕry).subscriptsuperscript𝜓𝑦𝑟subscriptcont𝑦subscript^𝑦𝑙subscriptsuperscriptitalic-ϕ𝑦𝑟\psi^{y}_{r}={\rm cont}_{y}(\hat{y}_{l},\phi^{y}_{r}).italic_ψ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = roman_cont start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_ϕ start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) . (11)

Finally, D𝐷Ditalic_D reconstructs x^lsubscript^𝑥𝑙\hat{x}_{l}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and x^rsubscript^𝑥𝑟\hat{x}_{r}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT from y^lsubscript^𝑦𝑙\hat{y}_{l}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and y^rsubscript^𝑦𝑟\hat{y}_{r}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT

x^i=D(y^i),i{l,r}.formulae-sequencesubscript^𝑥𝑖𝐷subscript^𝑦𝑖𝑖𝑙𝑟\hat{x}_{i}=D(\hat{y}_{i}),\ i\in\{l,r\}.over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_D ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i ∈ { italic_l , italic_r } . (12)
Table 1: The resolution and total sampled images of each dataset. “# st” denotes the number of timesteps we subsample from the dataset.
dataset resolution (x×y×z×t𝑥𝑦𝑧𝑡x\times y\times z\times titalic_x × italic_y × italic_z × italic_t) # views # st # images
vortex [17] 128×128×128×9012812812890128\times 128\times 128\times 90128 × 128 × 128 × 90 812812812812 30303030 24360243602436024360
Tangaroa [16] 300×180×120×150300180120150300\times 180\times 120\times 150300 × 180 × 120 × 150 812812812812 30303030 24360243602436024360
tornado [6] 128×128×128×4812812812848128\times 128\times 128\times 48128 × 128 × 128 × 48 812812812812 30303030 24360243602436024360
Table 2: Average PSNR (dB) and LPIPS, BPP, and total ET (hours) DT (seconds). Each case has 24360243602436024360 images with a resolution of 1024×1024102410241024\times 10241024 × 1024. The best ones are highlighted in bold.
IR images DVR images
dataset method PSNR\uparrow LPIPS\downarrow BPP\downarrow ET\downarrow DT\downarrow PSNR\uparrow LPIPS\downarrow BPP\downarrow ET\downarrow DT\downarrow
E-NeRV 27.17 0.1432 0.0031 149.46 1925.47 21.77 0.1154 0.0031 151.58 1939.19
HNeRV 21.15 0.2701 0.0019 72.72 391.25 20.17 0.2022 0.0019 69.63 916.33
vortex NeRVI 26.80 0.1386 0.0356 251.46 3965.49 24.30 0.0602 0.0356 255.56 2982.23
ECSIC 36.27 0.0980 0.0915 1.20 259.90 34.77 0.0139 0.1437 1.20 250.66
FCNR 37.47 0.1025 0.0693 1.18 269.67 34.85 0.0132 0.1212 1.19 296.70
E-NeRV 25.96 0.0093 0.0031 147.52 4362.98 25.43 0.1103 0.0031 147.13 4203.47
HNeRV 23.91 0.1690 0.0015 57.07 4872.01 24.17 0.1759 0.0015 71.50 1239.88
Tangaroa NeRVI 28.16 0.0750 0.0356 181.13 2015.16 26.39 0.0964 0.0359 247.33 2137.02
ECSIC 37.82 0.0149 0.0895 1.18 306.94 34.61 0.0153 0.1405 1.20 246.52
FCNR 38.12 0.0145 0.0709 1.18 319.84 34.45 0.0177 0.1109 1.17 211.73
E-NeRV 36.72 0.1389 0.0031 50.14 4643.82 35.09 0.0038 0.0031 50.34 5157.63
HNeRV 34.53 0.0544 0.0015 28.54 5359.20 31.97 0.0506 0.0016 30.75 1804.20
tornado NeRVI 38.21 0.0359 0.0356 90.56 9609.40 36.27 0.0336 0.0356 88.99 2030.20
ECSIC 36.30 0.0700 0.0580 0.40 231.71 36.51 0.0501 0.0982 0.42 180.17
FCNR 38.07 0.0685 0.0280 0.39 326.91 37.35 0.0386 0.0359 0.39 352.00

Loss functions. Image compression models can be optimized for a weighted sum of the rate and distortion losses [1]

Ltotal=LR+λLD,subscript𝐿totalsubscript𝐿𝑅𝜆subscript𝐿𝐷L_{\rm total}=L_{R}+\lambda L_{D},italic_L start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , (13)

where LRsubscript𝐿𝑅L_{R}italic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is the rate loss, LDsubscript𝐿𝐷L_{D}italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is the distortion loss, and λ𝜆\lambda\in\mathbb{R}italic_λ ∈ blackboard_R is a trade-off weight. LDsubscript𝐿𝐷L_{D}italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is the expectation of the mean squared errors between the input and reconstructed images

LD=𝔼xl,xrpx[xlx^l22+xrx^r22].subscript𝐿𝐷subscript𝔼similar-tosubscript𝑥𝑙subscript𝑥𝑟subscript𝑝𝑥delimited-[]subscriptsuperscriptnormsubscript𝑥𝑙subscript^𝑥𝑙22subscriptsuperscriptnormsubscript𝑥𝑟subscript^𝑥𝑟22L_{D}=\mathbb{E}_{x_{l},x_{r}\sim p_{x}}\big{[}\left\|x_{l}-\hat{x}_{l}\right% \|^{2}_{2}+\left\|x_{r}-\hat{x}_{r}\right\|^{2}_{2}\big{]}.italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] . (14)

Following the rate loss in ECSIC, LRsubscript𝐿𝑅L_{R}italic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is the expected sum of the cross entropy between the predicted distribution of our entropy model and the true distribution of the latents or hyper-latents

LR=𝔼xl,xrpx[log2p(z^lϕlz)log2p(z^rΦcontz,ϕrz,z^l)log2p(y^lΦhD,z^r,z^l)log2p(y^rΦconty,ΦhD,y^l,z^r,z^l)],subscript𝐿𝑅subscript𝔼similar-tosubscript𝑥𝑙subscript𝑥𝑟subscript𝑝𝑥delimited-[]subscript2𝑝conditionalsubscript^𝑧𝑙subscriptsuperscriptitalic-ϕ𝑧𝑙subscript2𝑝conditionalsubscript^𝑧𝑟subscriptΦsubscriptcont𝑧subscriptsuperscriptitalic-ϕ𝑧𝑟subscript^𝑧𝑙subscript2𝑝conditionalsubscript^𝑦𝑙subscriptΦsubscript𝐷subscript^𝑧𝑟subscript^𝑧𝑙subscript2𝑝subscript^𝑦𝑟subscriptΦsubscriptcont𝑦subscriptΦsubscript𝐷subscript^𝑦𝑙subscript^𝑧𝑟subscript^𝑧𝑙\displaystyle\begin{split}L_{R}=\mathbb{E}_{x_{l},x_{r}\sim p_{x}}\big{[}&-% \log_{2}p(\hat{z}_{l}\mid\phi^{z}_{l})\\ &-\log_{2}p(\hat{z}_{r}\mid\Phi_{{\rm cont}_{z}},\phi^{z}_{r},\hat{z}_{l})\\ &-\log_{2}p(\hat{y}_{l}\mid\Phi_{h_{D}},\hat{z}_{r},\hat{z}_{l})\\ &-\log_{2}p(\hat{y}_{r}\mid\Phi_{{\rm cont}_{y}},\Phi_{h_{D}},\hat{y}_{l},\hat% {z}_{r},\hat{z}_{l})\big{]},\end{split}start_ROW start_CELL italic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ end_CELL start_CELL - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∣ italic_ϕ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∣ roman_Φ start_POSTSUBSCRIPT roman_cont start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ϕ start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∣ roman_Φ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∣ roman_Φ start_POSTSUBSCRIPT roman_cont start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ] , end_CELL end_ROW (15)

where Φcontz,Φconty,ΦhDsubscriptΦsubscriptcont𝑧subscriptΦsubscriptcont𝑦subscriptΦsubscript𝐷\Phi_{{\rm cont}_{z}},\Phi_{{\rm cont}_{y}},\Phi_{h_{D}}roman_Φ start_POSTSUBSCRIPT roman_cont start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT roman_cont start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the parameters of contz,conty,subscriptcont𝑧subscriptcont𝑦{\rm cont}_{z},{\rm cont}_{y},roman_cont start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , roman_cont start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , and hDsubscript𝐷h_{D}italic_h start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, respectively.

Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption(a) E-NeRV(b) HNeRV(c) NeRVI(d) ECSIC(e) FCNR(f) GTRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionmissing-subexpressionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionmissing-subexpressionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionmissing-subexpressionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionmissing-subexpressionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionmissing-subexpressionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionmissing-subexpression(a) E-NeRV(b) HNeRV(c) NeRVI(d) ECSIC(e) FCNR(f) GTmissing-subexpression\begin{array}[]{c@{\hspace{0.01in}}c@{\hspace{0.01in}}c@{\hspace{0.01in}}c@{% \hspace{0.01in}}c@{\hspace{0.01in}}c@{\hspace{0.01in}}c}\includegraphics[width% =69.38078pt]{images/baselines/vortex_ir/VORTS_0004_-11.640723_-138.189685_-0.6% 52955_-0.201774_-0.730026_ENeRV.png}\hfil\hskip 0.72229pt&\includegraphics[wid% th=69.38078pt]{images/baselines/vortex_ir/VORTS_0004_-11.640723_-138.189685_-0% .652955_-0.201774_-0.730026_HNeRV.png}\hfil\hskip 0.72229pt&\includegraphics[w% idth=69.38078pt]{images/baselines/vortex_ir/VORTS_0004_-11.640723_-138.189685_% -0.652955_-0.201774_-0.730026_NeRVI.png}\hfil\hskip 0.72229pt&\includegraphics% [width=69.38078pt]{images/baselines/vortex_ir/VORTS_0004_-11.640723_-138.18968% 5_-0.652955_-0.201774_-0.730026_ECSIC.png}\hfil\hskip 0.72229pt&% \includegraphics[width=69.38078pt]{images/baselines/vortex_ir/VORTS_0004_-11.6% 40723_-138.189685_-0.652955_-0.201774_-0.730026_ours.png}\hfil\hskip 0.72229pt% &\includegraphics[width=69.38078pt]{images/baselines/vortex_ir/VORTS_0004_-11.% 640723_-138.189685_-0.652955_-0.201774_-0.730026.png}\hfil\hskip 0.72229pt\\ \includegraphics[width=69.38078pt]{images/baselines/tangaroa_ir/DIV_0003_61.55% 6732_-171.216769_-0.072728_0.879289_-0.470703_ENeRV.png}\hfil\hskip 0.72229pt&% \includegraphics[width=69.38078pt]{images/baselines/tangaroa_ir/DIV_0003_61.55% 6732_-171.216769_-0.072728_0.879289_-0.470703_HNeRV.png}\hfil\hskip 0.72229pt&% \includegraphics[width=69.38078pt]{images/baselines/tangaroa_ir/DIV_0003_61.55% 6732_-171.216769_-0.072728_0.879289_-0.470703_NeRVI.png}\hfil\hskip 0.72229pt&% \includegraphics[width=69.38078pt]{images/baselines/tangaroa_ir/DIV_0003_61.55% 6732_-171.216769_-0.072728_0.879289_-0.470703_ECSIC.png}\hfil\hskip 0.72229pt&% \includegraphics[width=69.38078pt]{images/baselines/tangaroa_ir/DIV_0003_61.55% 6732_-171.216769_-0.072728_0.879289_-0.470703_ours.png}\hfil\hskip 0.72229pt&% \includegraphics[width=69.38078pt]{images/baselines/tangaroa_ir/DIV_0003_61.55% 6732_-171.216769_-0.072728_0.879289_-0.470703.png}\hfil\hskip 0.72229pt\\ \includegraphics[width=69.38078pt]{images/baselines/tornado_ir/TORNADO_0015_-1% 5.904598_144.808444_0.554250_-0.274036_-0.785946_ENeRV.png}\hfil\hskip 0.72229% pt&\includegraphics[width=69.38078pt]{images/baselines/tornado_ir/TORNADO_0015% _-15.904598_144.808444_0.554250_-0.274036_-0.785946_HNeRV.png}\hfil\hskip 0.72% 229pt&\includegraphics[width=69.38078pt]{images/baselines/tornado_ir/TORNADO_0% 015_-15.904598_144.808444_0.554250_-0.274036_-0.785946_NeRVI.png}\hfil\hskip 0% .72229pt&\includegraphics[width=69.38078pt]{images/baselines/tornado_ir/% TORNADO_0015_-15.904598_144.808444_0.554250_-0.274036_-0.785946_ECSIC.png}% \hfil\hskip 0.72229pt&\includegraphics[width=69.38078pt]{images/baselines/% tornado_ir/TORNADO_0015_-15.904598_144.808444_0.554250_-0.274036_-0.785946_% ours.png}\hfil\hskip 0.72229pt&\includegraphics[width=69.38078pt]{images/% baselines/tornado_ir/TORNADO_0015_-15.904598_144.808444_0.554250_-0.274036_-0.% 785946.png}\hfil\hskip 0.72229pt\\ \includegraphics[width=69.38078pt]{images/baselines/vortex_dvr/VORTS_0005_-0.0% 00000_-121.717474_-0.850651_-0.000000_-0.525731_ENeRV.png}\hfil\hskip 0.72229% pt&\includegraphics[width=69.38078pt]{images/baselines/vortex_dvr/VORTS_0005_-% 0.000000_-121.717474_-0.850651_-0.000000_-0.525731_HNeRV.png}\hfil\hskip 0.722% 29pt&\includegraphics[width=69.38078pt]{images/baselines/vortex_dvr/VORTS_0005% _-0.000000_-121.717474_-0.850651_-0.000000_-0.525731_NeRVI.png}\hfil\hskip 0.7% 2229pt&\includegraphics[width=69.38078pt]{images/baselines/vortex_dvr/VORTS_00% 05_-0.000000_-121.717474_-0.850651_-0.000000_-0.525731_ECSIC.png}\hfil\hskip 0% .72229pt&\includegraphics[width=69.38078pt]{images/baselines/vortex_dvr/VORTS_% 0005_-0.000000_-121.717474_-0.850651_-0.000000_-0.525731_ours.png}\hfil\hskip 0% .72229pt&\includegraphics[width=69.38078pt]{images/baselines/vortex_dvr/VORTS_% 0005_-0.000000_-121.717474_-0.850651_-0.000000_-0.525731.png}\hfil\hskip 0.722% 29pt\\ \includegraphics[width=69.38078pt]{images/baselines/tangaroa_dvr/VTM_0002_-51.% 455284_-131.494950_-0.466730_-0.782122_-0.412854_ENeRV.png}\hfil\hskip 0.72229% pt&\includegraphics[width=69.38078pt]{images/baselines/tangaroa_dvr/VTM_0002_-% 51.455284_-131.494950_-0.466730_-0.782122_-0.412854_HNeRV.png}\hfil\hskip 0.72% 229pt&\includegraphics[width=69.38078pt]{images/baselines/tangaroa_dvr/VTM_000% 2_-51.455284_-131.494950_-0.466730_-0.782122_-0.412854_NeRVI.png}\hfil\hskip 0% .72229pt&\includegraphics[width=69.38078pt]{images/baselines/tangaroa_dvr/VTM_% 0002_-51.455284_-131.494950_-0.466730_-0.782122_-0.412854_ECSIC.png}\hfil% \hskip 0.72229pt&\includegraphics[width=69.38078pt]{images/baselines/tangaroa_% dvr/VTM_0002_-51.455284_-131.494950_-0.466730_-0.782122_-0.412854_ours.png}% \hfil\hskip 0.72229pt&\includegraphics[width=69.38078pt]{images/baselines/% tangaroa_dvr/VTM_0002_-51.455284_-131.494950_-0.466730_-0.782122_-0.412854.png% }\hfil\hskip 0.72229pt\\ \includegraphics[width=69.38078pt]{images/baselines/tornado_dvr/TORNADO_0012_8% .108483_35.191556_0.570550_0.141048_0.809060_ENeRV.png}\hfil\hskip 0.72229pt&% \includegraphics[width=69.38078pt]{images/baselines/tornado_dvr/TORNADO_0012_8% .108483_35.191556_0.570550_0.141048_0.809060_HNeRV.png}\hfil\hskip 0.72229pt&% \includegraphics[width=69.38078pt]{images/baselines/tornado_dvr/TORNADO_0012_8% .108483_35.191556_0.570550_0.141048_0.809060_NeRVI.png}\hfil\hskip 0.72229pt&% \includegraphics[width=69.38078pt]{images/baselines/tornado_dvr/TORNADO_0012_8% .108483_35.191556_0.570550_0.141048_0.809060_ECSIC.png}\hfil\hskip 0.72229pt&% \includegraphics[width=69.38078pt]{images/baselines/tornado_dvr/TORNADO_0012_8% .108483_35.191556_0.570550_0.141048_0.809060_ours.png}\hfil\hskip 0.72229pt&% \includegraphics[width=69.38078pt]{images/baselines/tornado_dvr/TORNADO_0012_8% .108483_35.191556_0.570550_0.141048_0.809060.png}\hfil\hskip 0.72229pt\\ \mbox{\footnotesize(a) E-NeRV}\hfil\hskip 0.72229pt&\mbox{\footnotesize(b) % HNeRV}\hfil\hskip 0.72229pt&\mbox{\footnotesize(c) NeRVI}\hfil\hskip 0.72229pt% &\mbox{\footnotesize(d) ECSIC}\hfil\hskip 0.72229pt&\mbox{\footnotesize(e) % FCNR}\hfil\hskip 0.72229pt&\mbox{\footnotesize(f) GT}\hfil\hskip 0.72229pt\end% {array}start_ARRAY start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL (a) E-NeRV end_CELL start_CELL (b) HNeRV end_CELL start_CELL (c) NeRVI end_CELL start_CELL (d) ECSIC end_CELL start_CELL (e) FCNR end_CELL start_CELL (f) GT end_CELL start_CELL end_CELL end_ROW end_ARRAY

Figure 3: Decompressed IR and DVR images. The datasets are vortex, Tangaroa, and tornado, respectively.

3 Results and Discussion

Datasets, setting, and training. Table 1 shows the three datasets we experimented with. We picked 30 consecutive timesteps for each dataset. To ensure an even distribution of viewpoints across the volume data, we determined camera positions using the vertices of an icosphere, approximating a sphere using equilateral triangles. We selected a subdivision level resulting in 812 vertices to generate the training set for each timestep, producing a good quantity of images for compression. The image resolutions were all set to 1024×\times×1024. For ECSIC and FCNR, which require image pairs, we sorted the images at each timestep first by θ𝜃\thetaitalic_θ and then by φ𝜑\varphiitalic_φ if there is a tie. We then chose the image with an even index j𝑗jitalic_j (starting from 0) as the left image and the image with index j+1𝑗1j+1italic_j + 1 as its right counterpart. Since ECSIC and FCNR possess generalization ability, we decreased the number of training images to 1/6161/61 / 6 (evenly selected 1/2121/21 / 2 of the sampled views and 1/3131/31 / 3 of the timesteps), i.e., 4060 images. We evaluated these two models on all 24360 images during inference.

We implemented FCNR using PyTorch. We chose the number of channels in the encoding and decoding modules to be 192, the number of latent channels to be 48, and the number of attention heads in JCTM to be 2. All experiments were run on an NVIDIA A40 GPU. Adam optimizer was utilized for gradient descent (β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=0.9, β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=0.999), and the learning rate was set as 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We trained FCNR with a batch size of 1. The number of training epochs was 3 for the vortex and Tangaroa datasets and 1 for the tornado dataset.

Baselines and evaluation metrics. We compared FCNR with three state-of-the-art INR-based methods, including E-NeRV [12], HNeRV [4], and NeRVI [8], and one stereo image compression method, ECSIC [21]. We extended E-NeRV by feeding all (t,θ,φ)𝑡𝜃𝜑(t,\theta,\varphi)( italic_t , italic_θ , italic_φ ) to the model, first normalized to [0,1]01[0,1][ 0 , 1 ] and then input to the network after PE. All INR-based methods were trained until convergence for a fair comparison. The number of training epochs for ECSIC was the same as FCNR for each dataset. For quantitative evaluation in the image space, we employed peak signal-to-noise ratio (PSNR) and learned perceptual image patch similarity (LPIPS) [23]. Besides, encoding time (ET) and decoding time (DT) are also recorded.

Table 3: Average PSNR (dB) and LPIPS, BPP, and total ET (hours) DT (seconds) of ECSIC, its varying modifications, and FCNR on the tornado IR dataset.
method PSNR\uparrow LPIPS\downarrow BPP\downarrow ET\downarrow DT\downarrow
ECSIC 36.30 0.0700 0.0580 0.40 231.71
JCT-Only 37.81 0.0685 0.0666 0.39 270.15
PE-Only 37.19 0.0708 0.0370 0.40 228.91
FCNR 38.07 0.0685 0.0280 0.39 326.91
Refer to caption
Figure 4: PSNR comparison of all methods on the vortex IR dataset. E-NeRV, HNeRV, and NeRVI were all trained for 200 epochs. Both ECSIC and FCNR were trained for 3 epochs.

Results. Table 2 compares FCNR with state-of-the-art baselines quantitatively. Figure 3 shows the decompressed rendering images for all datasets under chosen (t,θ,φ)𝑡𝜃𝜑(t,\theta,\varphi)( italic_t , italic_θ , italic_φ ). All images are cropped for closer comparison.

FCNR achieves the highest PSNR and lowest LPIPS for most cases, even on images unseen during training, demonstrating its interpolation ability. Although INR-based methods (E-NeRV, HNeRV, and NeRVI) lead to lower BPP, they have limited reconstruction quality as depicted in Figure 3. The vortex and Tangaroa datasets vary greatly at different timesteps, posing greater reconstruction challenges. HNeRV produces the most blurry images. The results of E-NeRV and NeRVI are much clearer. However, they still yield artifacts and distortions when multiple components overlap in rendering images and miss some components when images become complex. For the tornado dataset, all methods generate clear images, but FCNR generates images with better PSNR and LPIPS than E-NeRV and HNeRV. Though NeRVI achieves the highest PSNR and the lowest LPIPS in the tornado IR dataset, it fails to reconstruct high-frequency details as shown in Figure 3. By contrast, FCNR generates the best visual quality images, with the clearest high-frequency details and the fewest artifacts.

Moreover, the significant encoding time (48.36×\times× to 232.21×\times×) and decoding time (1.45×\times× to 29.39×\times×) over FCNR put INR-based baselines at a disadvantage. This is because E-NeRV and NeRVI need substantial training and more parameters to restore the lost information and reconstruct images from the rather limited and low-dimensional input and the necessity of learning both input embedding and decoder weights in HNeRV leads to a more complex design. In contrast, like ECSIC, FCNR drastically enhances encoding and decoding speed by fully utilizing the images for direct compression and reconstruction and exploiting mutual information between them. Compared with ECSIC, FCNR achieves very close, high-quality results for all datasets, with slight improvements in PSNR and LPIPS for the majority of cases. While ECSIC performs similarly to FCNR in image quality and encoding and decoding speed, its BPP is from 18.56% (vortex DVR dataset) to 173.54% (tornado DVR dataset) higher than FCNR across all cases. Such improvements make FCNR stand out from ECSIC.

Figure 4 compares FCNR and baseline methods in PSNR during training on the vortex IR dataset. It shows that all methods have been trained until convergence to ensure a fair comparison. The comparison highlights the effectiveness and efficiency of FCNR.

Ablation study. We performed an ablation study on the tornado IR dataset to show FCNR’s differences from ECSIC. We compared FCNR and ECSIC with two architectural modifications: JCT-only and PE-only. For the JCT-only case, we modified ECSIC by changing all SCAMs to JCTMs. For the PE-only case, we extended ECSIC by transforming (tl,θl,φl)subscript𝑡𝑙subscript𝜃𝑙subscript𝜑𝑙(t_{l},\theta_{l},\varphi_{l})( italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) to the input ψlzsuperscriptsubscript𝜓𝑙𝑧\psi_{l}^{z}italic_ψ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT of xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT’s AD and (tr,θr,φr)subscript𝑡𝑟subscript𝜃𝑟subscript𝜑𝑟(t_{r},\theta_{r},\varphi_{r})( italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) to the corresponding parameter ϕrzsuperscriptsubscriptitalic-ϕ𝑟𝑧\phi_{r}^{z}italic_ϕ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT of contzsubscriptcont𝑧{\rm cont}_{z}roman_cont start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT through PE and MLP.

Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption(a) ECSIC(b) JCT-only(c) PE-only(d) FCNR(e) GTRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionmissing-subexpressionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionmissing-subexpression(a) ECSIC(b) JCT-only(c) PE-only(d) FCNR(e) GTmissing-subexpression\begin{array}[]{c@{\hspace{0.0in}}c@{\hspace{0.0in}}c@{\hspace{0.0in}}c@{% \hspace{0.0in}}c@{\hspace{0.0in}}c}\includegraphics[width=83.25562pt]{images/% ablation/TORNADO_0011_9.874506_-124.959774_-0.807413_0.171491_-0.564513_ECSIC_% bbox_combined.png}&\includegraphics[width=83.25562pt]{images/ablation/TORNADO_% 0011_9.874506_-124.959774_-0.807413_0.171491_-0.564513_JCT_bbox_combined.png}&% \includegraphics[width=83.25562pt]{images/ablation/TORNADO_0011_9.874506_-124.% 959774_-0.807413_0.171491_-0.564513_PE_bbox_combined.png}&\includegraphics[wid% th=83.25562pt]{images/ablation/TORNADO_0011_9.874506_-124.959774_-0.807413_0.1% 71491_-0.564513_ours_bbox_combined.png}&\includegraphics[width=83.25562pt]{% images/ablation/TORNADO_0011_9.874506_-124.959774_-0.807413_0.171491_-0.564513% _bbox_combined.png}&\\ \includegraphics[width=83.25562pt]{images/ablation/TORNADO_0011_9.874506_-124.% 959774_-0.807413_0.171491_-0.564513_ECSIC_bbox.png}&\includegraphics[width=83.% 25562pt]{images/ablation/TORNADO_0011_9.874506_-124.959774_-0.807413_0.171491_% -0.564513_JCT_bbox.png}&\includegraphics[width=83.25562pt]{images/ablation/% TORNADO_0011_9.874506_-124.959774_-0.807413_0.171491_-0.564513_PE_bbox.png}&% \includegraphics[width=83.25562pt]{images/ablation/TORNADO_0011_9.874506_-124.% 959774_-0.807413_0.171491_-0.564513_ours_bbox.png}&\includegraphics[width=83.2% 5562pt]{images/ablation/TORNADO_0011_9.874506_-124.959774_-0.807413_0.171491_-% 0.564513_bbox.png}&\\ \mbox{\footnotesize(a) ECSIC}&\mbox{\footnotesize(b) JCT-only}&\mbox{% \footnotesize(c) PE-only}&\mbox{\footnotesize(d) FCNR}&\mbox{\footnotesize(e) % GT}\end{array}start_ARRAY start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL (a) ECSIC end_CELL start_CELL (b) JCT-only end_CELL start_CELL (c) PE-only end_CELL start_CELL (d) FCNR end_CELL start_CELL (e) GT end_CELL start_CELL end_CELL end_ROW end_ARRAY

Figure 5: First row: decompressed images of FCNR and ECSIC with variations. Second row: zoom-ins for closer examination.

Table 3 shows that both modifications lead to improvements in image quality measured by PSNR (JCT-only and PE-only) and LPIPS (JCT-only). Moreover, though JCT-only leads to higher BPP than ECSIC, PE-only lowers BPP. By incorporating both modifications, FCNR achieves even lower BPP with the best PSNR and LPIPS. Figure 5 demonstrates visual improvements in image quality. JCT-only yields images with smoother surfaces and more natural lighting, and PE-only reduces visual artifacts to some extent and enhances lighting concentration. As the zoom-ins indicate, FCNR further improves border clearness, lighting concentration, and color consistency, generating an image closest to GT.

Discussion. Our results demonstrate that FCNR can compress a large collection of visualization images in high fidelity within a short time. It is much more promising than INR-based methods when image quality and encoding and decoding speed are of greater importance. Though ECSIC can achieve high-quality compression in a similar timeframe, FCNR leads to a higher compression ratio.

4 Conclusions and Future Work

We present FCNR, a novel method for neural compression of visualization images borrowing insights from stereo image compression frameworks. The model of ECSIC reduces the bitrate with distributions of the right image learned from the left image using SCMs. We integrate this model with JCTMs to extract mutual information globally and incorporate visualization parameters to allow for more detailed quantitative differences between images, further improving image quality and compression ratio. Compared with state-of-the-art INR-based methods, FCNR provides previously unavailable interpolation ability and demonstrates improved encoding and decoding time. Compared with ECSIC, FCNR achieves a higher compression ratio and slightly better reconstruction quality.

The future work of FCNR can be summarized as follows. First, given the substantial differences between stereo images and visualization images, designing a more tailored model architecture for visualization images is necessary for further gains in quality and speed. Second, FCNR lags behind INR-based methods in terms of compression ratio. Our method will be more promising if BPP can be reduced to the same level as INR-based methods. Finally, more visualization parameters, such as isovalues and transfer functions, may be included, and a better fusion of these parameters with images is worthy of exploration.

Acknowledgements.
This research was supported in part by the U.S. National Science Foundation through grants IIS-1955395, IIS-2101696, OAC-2104158, and IIS-2401144, and the U.S. Department of Energy through grant DE-SC0023145. The authors would like to thank the anonymous reviewers for their insightful comments.

References

  • [1] J. Ballé, V. Laparra, and E. P. Simoncelli. End-to-end optimized image compression. In Proceedings of International Conference on Learning Representations, 2017.
  • [2] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston. Variational image compression with a scale hyperprior. In Proceedings of International Conference on Learning Representations, 2018.
  • [3] H. Chen, M. Gwilliam, B. He, S.-N. Lim, and A. Shrivastava. CNeRV: Content-adaptive neural representation for visual data. In Proceedings of British Machine Vision Conference, pp. 510:1–510:20, 2022.
  • [4] H. Chen, M. Gwilliam, S.-N. Lim, and A. Shrivastava. HNeRV: A hybrid neural representation for videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 10270–10279, 2023. doi: 10 . 1109/CVPR52729 . 2023 . 00990
  • [5] H. Chen, B. He, H. Wang, Y. Ren, S.-N. Lim, and A. Shrivastava. NeRV: Neural representations for videos. In Proceedings of Advances in Neural Information Processing Systems, pp. 21557–21568, 2021.
  • [6] R. A. Crawfis and N. Max. Texture splats for 3D scalar and vector field visualization. In Proceedings of IEEE Visualization Conference, pp. 261–267, 1993. doi: 10 . 1109/VISUAL . 1993 . 398877
  • [7] X. Deng, W. Yang, R. Yang, M. Xu, E. Liu, Q. Feng, and R. Timofte. Deep homography for efficient stereo image compression. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1501, 2021. doi: 10 . 1109/CVPR46437 . 2021 . 00154
  • [8] P. Gu, D. Z. Chen, and C. Wang. NeRVI: Compressive neural representation of visualization images for communicating volume visualization results. Computers & Graphics, 116:216–227, 2023. doi: 10 . 1016/J . CAG . 2023 . 08 . 024
  • [9] J. Han and C. Wang. CoordNet: Data generation and visualization generation for time-varying volumes via a coordinate-based neural network. IEEE Transactions on Visualization and Computer Graphics, 29(12):4951–4963, 2023. doi: 10 . 1109/TVCG . 2022 . 3197203
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of IEEE International Conference on Computer Vision, pp. 1026–1034, 2015. doi: 10 . 1109/ICCV . 2015 . 123
  • [11] H. M. Kwan, G. Gao, F. Zhang, A. Gower, and D. Bull. HiNeRV: Video compression with hierarchical encoding based neural representation. In Proceedings of Advances in Neural Information Processing Systems, 2023.
  • [12] Z. Li, M. Wang, H. Pi, K. Xu, J. Mei, and Y. Liu. E-NeRV: Expedite neural video representation with disentangled spatial-temporal context. In Proceedings of European Conference on Computer Vision, pp. 267–284, 2022. doi: 10 . 1007/978-3-031-19833-5_16
  • [13] J. Liu, S. Wang, and R. Urtasun. DSIC: Deep stereo image compression. In Proceedings of IEEE International Conference on Computer Vision, pp. 3136–3145, 2019. doi: 10 . 1109/ICCV . 2019 . 00323
  • [14] S. R. Maiya, S. Girish, M. Ehrlich, H. Wang, K. S. Lee, P. Poirson, P. Wu, C. Wang, and A. Shrivastava. NIRVANA: Neural implicit representations of videos with adaptive networks and autoregressive patch-wise modeling. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 14378–14387, 2023. doi: 10 . 1109/CVPR52729 . 2023 . 01382
  • [15] D. Minnen and S. Singh. Channel-wise autoregressive entropy models for learned image compression. In Proceedings of IEEE International Conference on Image Processing, pp. 3339–3343, 2020. doi: 10 . 1109/ICIP40778 . 2020 . 9190935
  • [16] S. Popinet, M. Smith, and C. Stevens. Experimental and numerical study of the turbulence characteristics of airflow around a research vessel. Journal of Atmospheric and Oceanic Technology, 21(10):1575–1589, 2004. doi: 10 . 1175/1520-0426(2004)021<1575:EANSOT>2 . 0 . CO;2
  • [17] D. Silver and X. Wang. Tracking and visualizing turbulent 3D features. IEEE Transactions on Visualization and Computer Graphics, 3(2):129–141, 1997. doi: 10 . 1109/2945 . 597796
  • [18] K. Tang and C. Wang. ECNR: Efficient compressive neural representation of time-varying volumetric datasets. In Proceedings of IEEE Pacific Visualization Conference, pp. 72–81, 2024. doi: 10 . 1109/PACIFICVIS60374 . 2024 . 00017
  • [19] K. Tang and C. Wang. STSR-INR: Spatiotemporal super-resolution for time-varying multivariate volumetric data via implicit neural representation. Computers & Graphics, 119:103874, 2024. doi: 10 . 1016/J . CAG . 2024 . 01 . 001
  • [20] C. Wang and J. Han. DL4SciVis: A state-of-the-art survey on deep learning for scientific visualization. IEEE Transactions on Visualization and Computer Graphics, 29(8):3714–3733, 2023. doi: 10 . 1109/TVCG . 2022 . 3167896
  • [21] M. Wödlinger, J. Kotera, M. Keglevic, J. Xu, and R. Sablatnig. ECSIC: Epipolar cross attention for stereo image compression. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, pp. 3436–3445, 2024.
  • [22] M. Wödlinger, J. Kotera, J. Xu, and R. Sablatnig. SASIC: Stereo image compression with latent shifts and stereo attention. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 661–670, 2022. doi: 10 . 1109/CVPR52688 . 2022 . 00074
  • [23] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595, 2018. doi: 10 . 1109/CVPR . 2018 . 00068
  • [24] X. Zhang, J. Shao, and J. Zhang. LDMIC: Learning-based distributed multi-view image coding. In Proceedings of International Conference on Learning Representations, 2023.
  • [25] Q. Zhao, M. S. Asif, and Z. Ma. DNeRV: Modeling inherent dynamics via difference neural representation for videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2031–2040, 2023. doi: 10 . 1109/CVPR52729 . 2023 . 00202