SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

Dave Zhenyu Chen¹, Haoxuan Li¹, Hsin-Ying Lee², Sergey Tulyakov², Matthias Nießner¹,

¹Technical University of Munich, ²Snap Research

Paper arXiv Video Code

SceneTex generates high-quality textures for 3D indoor scenes from the given text prompts. At its core, SceneTex proposes a multiresolution texture field to implicitly encode the mesh appearance. We optimize the target texture via a score-distillation-based objective function in respective RGB renderings. To further secure the style consistency across views, we introduce a cross-attention decoder to predict the RGB values by cross-attending to the pre-sampled reference locations in each instance. Our method enables various and accurate texture synthesis for 3D-FRONT scenes, demonstrating significant improvements in visual quality and prompt fidelity over the prior texture generation methods.

Video

Comparison with baselines

Latent-Paint

MVDiffusion

Text2Tex

Ours

"a Scandinavian style living room"

Latent-Paint

MVDiffusion

Text2Tex

Ours

"a Japanese style bedroom"

Stylization for 3D-FRONT Scenes

"... Bohemian ..."

"... Baroque ..."

"... French country ..."

"... Japanese ..."

"a ... style living room"

"... Bohemian ..."

"... Baroque ..."

"... French country ..."

"... Japanese ..."

"a ... style bedroom"

Method Overview

In SceneTex, the target mesh is first projected to a given viewpoint via a rasterizer. Then, we render an RGB image with the proposed multiresolution texture field module. Specifically, each rasterized UV coordinate is taken as input to sample the UV embeddings from a multiresoultion texture. Afterward, the UV embeddings are mapped to an RGB image of shape 768 x 768 x 3 via a cross-attention texture decoder. We use a pre-trained VAE encoder to compress the input RGB image to a 96 x 96 x 4 latent feature. Finally, the Variational Score Distillation loss is computed from the latent feature to update the texture field.

Cross-Attention Texture Decoder

To secure the style-consistency, SceneTex incorporates a Cross-attention Texture Decoder. For each rasterized UV coordinate, we apply a UV instance mask to mask out the corresponding instance texture features. Then, we obtain the rendering UV embeddings for the rasterized locations in the view. At the same time, we extract the texture features for the pre-sampled UVs scattered across this instance as the reference UV embeddings. We deploy a multi-head cross-attention module to produce the instance-aware UV embeddings. Here, we treat the rendering UV embeddings as the Query, and the reference UV embeddings as the Key and Value. Finally, a shared MLP maps the instance-aware UV embeddings to RGB values in the rendered view.

BibTeX


@misc{chen2023scenetex,
    title={SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors}, 
    author={Dave Zhenyu Chen and Haoxuan Li and Hsin-Ying Lee and Sergey Tulyakov and Matthias Nießner},
    year={2023},
    eprint={2311.17261},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}