License: arXiv.org perpetual non-exclusive license
arXiv:2312.04563v1 [cs.CV] 07 Dec 2023

Visual Geometry Grounded Deep Structure From Motion

Jianyuan Wang1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT    Nikita Karaev 1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT    Christian Rupprecht11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    David Novotny22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT   
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTVisual Geometry Group, University of Oxford                 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTMeta AI
Abstract

Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches. Furthermore, we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally, we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.

Refer to caption
Figure 1: Reconstruction of In-the-wild Photos with VGGSfM, displaying estimated point clouds (in blue) and cameras (orange).
Refer to caption
Figure 2: Overview of VGGSfM. Our method extracts 2D tracks from input images, reconstructs cameras using image and track features, initializes a point cloud based on these tracks and camera parameters, and applies a bundle adjustment layer for reconstruction refinement. The whole framework is fully differentiable and designed for end-to-end training.

1 Introduction

Reconstructing the camera parameters and the 3D structure of a scene from a set of unconstrained 2D images is a long-standing problem in the computer vision community. Among many other applications [62, 33, 91, 9, 36], it has recently emerged as an important component of learning neural fields [56, 41, 11, 34, 44, 88]. The problem is usually solved via the Structure-from-Motion (SfM) framework which estimates the 3D point cloud (Structure) and the parameters of each camera (Motion) in the scene. State-of-the-art methods [46, 31] follow the incremental SfM paradigm whose origins can be traced back to the early 2000s [68, 28]. It usually begins with a small set of correspondence-rich images as initialization, and gradually adds more views into the reconstruction, through keypoint detection, matching, verification, image registration, triangulation, bundle adjustment (BA), and so on [69, 2, 25, 32].

Recent research efforts have predominantly revolved around leveraging the power of deep learning techniques to enhance specific elements within the original pipeline while preserving the incremental SfM framework as a whole. For instance, SuperPoint and SuperGlue [67, 18] focus on improving keypoint detection and matching. Pixel-perfect SfM [46] proposes deep feature-metric refinement to adjust both keypoints and bundles. Detector-free feature matching methods [75, 87, 13] bypass early keypoint detection by means of attention, which is powerful in poorly textured scenes. Detector-free SfM [31] builds a coarse SfM model through quantized detector-free matches and then iteratively refines it with multi-view consistency constraints. These advancements successfully combine deep learning approaches (such as deep feature matching) with well-established hand-engineered components, such as the incremental camera registration of COLMAP [69].

The widespread success of end-to-end training warrants the question of what benefits it can bring to long-standing frameworks such as SfM. Naturally, it is often difficult to assess the merits of new approaches when compared with decades of continuous improvements. Nonetheless, in this paper, we answer this question by introducing a fully-differentiable SfM pipeline, dubbed Visual Geometry Grounded Deep Structure From Motion (VGGSfM), which trains in an end-to-end manner. We find that this allows the pipeline to be simpler than prior frameworks while achieving better or comparable performance. Training end-to-end allows each component to generate outputs that facilitate the task of its successor.

To build a fully-differentiable pipeline, we make several substantial changes to the SfM procedures and overall obtain better performance. Specifically, our model builds on recent advances in deep 2D point tracking [27, 19, 20, 39] to directly extract reliable pixel-accurate tracks. This simplifies the correspondence estimation step in traditional SfM, which first estimates pairwise matches and then connects them into tracks. Then, based on the image and track features, VGGSfM estimates all cameras jointly via a Transformer [84], and subsequently all 3D points. Different from Incremental SfM, this approach is simpler and easier to differentiate as it does not depend on a discrete, combinatorial correspondence chaining step. Finally, for bundle adjustment, we replace the commonly employed non-differentiable Ceres solver [3] with the fully differentiable second-order Theseus solver [63].

Hence, we fuse all the SfM components into a single fully differentiable reconstruction function f𝑓fitalic_f. Besides that, in our experiments, we also show that the individual modules perform well in isolation. Ultimately, end-to-end training yields another performance improvement, that surpasses the performance of isolated components.

We evaluate VGGSfM for the task of camera pose estimation on the CO3Dv2 [64] and IMC Phototourism [38] datasets, and for 3D triangulation on the ETH3D [70] dataset. Our method attains strong performance on all benchmarks. At the same time, we conduct in-the-wild reconstruction to validate the generalization ability of our proposed framework, as shown in Fig. 1.

2 Related Work

Structure from Motion

is a fundamental problem in computer vision and has been investigated for decades [28, 62, 60]. The classical pipelines usually solve the SfM problem in a global [58, 92, 16] or incremental [74, 2, 24, 93, 69] manner. Both of which are usually based on pairwise image keypoint matching. Incremental SfM is arguably the most widely adopted strategy (e.g., the popular framework COLMAP [69]). Therefore, in the following sections, we refer to incremental SfM as “classical” or “traditional” SfM. We defer the discussion of global SfM to the supplementary.

Traditional SfM frameworks often start by detecting keypoints and feature descriptors [50, 51, 5, 54]. They then search for image pairs with overlapping frusta by matching these keypoints across different images (e.g., with a nearest-neighbour search) [49, 2, 69]. These image pairs are further verified via two-view epipolar geometry or homography [28] through RANSAC [23]. Then, a pair or a small set of images is carefully selected for initialization. New images are gradually registered by solving the Perspective-n𝑛nitalic_n-Point (PnP) problem [52], followed by triangulating 3D points, and bundle adjustment [81]. This process is iterated until all the frames are either registered or discarded. 2D correspondences (multi-view tracks) are the basis of the whole process, however, they are usually simply constructed by chaining two-view matches [69].

Many deep-learning approaches have been proposed to enhance this framework. For example, [96, 18, 82] provide better keypoint detection and  [67, 12, 47, 71, 37] focus on matching. Furthermore, detector-free matching methods [75, 87, 13] propose to avoid sparse keypoint detection by building semi-dense matches via self and cross attention. Some studies improve the performance of RANSAC by making it trainable [7, 6, 89]. Recent state-of-the-art methods are PixSfM [46] and the concurrent Detector-free SfM (DFSfM) [31]. PixSfM refines the tracks and structure estimated by COLMAP through feature-metric keypoint adjustment and feature-metric bundle adjustment. Detector-free SfM proposes to first build a coarse SfM model using detector-free matches and COLMAP (or other frameworks), and then to iteratively refine the tracks and the structure of the coarse model by enforcing multi-view consistency.

Recently, fully differentiable SfM pipelines have also been explored. They usually use deep neural networks to regress camera poses and depths [99, 83, 77, 90, 85, 78, 80]. Although using an approximation of bundle adjustment [77, 90, 80], these methods suffer from limited generalizability and scalability (very few input frames) [99, 83, 77, 85, 90], or rely on temporal relationship [78, 80]. Meanwhile, some methods are category-specific [94, 53, 95]. The recent efforts on deep camera pose estimation can scale up to more than 50 frames, but they do not reconstruct the scene [97, 72, 86, 43].

Point tracking.

Since VGGSfM proposes a novel point tracker, next, we review recent advances in this field. Inspired by the optical-flow architecture of RAFT [79], PIPs [27] revisited point tracking, a task related to Particle Video [66], and proposed a highly accurate tracker of isolated points in a video. TAP-Vid [19] (i.e., “Tracking Any Point”) introduced a benchmark for point tracking and a baseline model, which was later improved in TAPIR [20] by integrating the iterative update mechanism from PIPs. PointOdyssey [98] simplified PIPs and proposed a benchmark for the long-term version of point tracking. CoTracker [39] closed the gap between single point tracking and dense Optical Flow with joint point tracking. However, these works are designed for videos, i.e. temporally-ordered sequences of frames. In our point tracker, given the input frames are unordered, we do not assume a temporal relationship between input frames. We therefore process all frames jointly, avoiding windowed inference of [39]. Since SfM relies on highly accurate correspondences, our tracks are further refined in a coarse-to-fine manner to achieve sub-pixel accuracy.

3 Method

In this section, we describe the components of VGGSfM and how they are composed in a fully differentiable pipeline. An overview of our framework is shown in Fig. 2.

Problem setting

Given a set of free-form images observing a scene, VGGSfM estimates their corresponding camera parameters and the 3D scene shape represented as a point cloud. Formally, given a tuple =(I1,,INI)subscript𝐼1subscript𝐼subscript𝑁𝐼\mathcal{I}=\big{(}I_{1},...,I_{N_{I}}\big{)}caligraphic_I = ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) of NIsubscript𝑁𝐼N_{I}\in\mathbb{N}italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ blackboard_N RGB images Ii3×H×Wsubscript𝐼𝑖superscript3𝐻𝑊I_{i}\in\mathbb{R}^{3\times H\times W}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT, VGGSfM estimates the corresponding camera projection matrices 𝒫=(P1,,PNI|Pi3×4)𝒫subscript𝑃1conditionalsubscript𝑃subscript𝑁𝐼subscript𝑃𝑖superscript34\mathcal{P}=\big{(}P_{1},...,P_{N_{I}}\big{|}P_{i}\subset\mathbb{R}^{3\times 4% }\big{)}caligraphic_P = ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT 3 × 4 end_POSTSUPERSCRIPT ) and the scene cloud X={𝐱j}j=1N𝐱𝑋superscriptsubscriptsuperscript𝐱𝑗𝑗1subscript𝑁𝐱X=\{\mathbf{x}^{j}\}_{j=1}^{N_{\mathbf{x}}}italic_X = { bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of N𝐱subscript𝑁𝐱N_{\mathbf{x}}\in\mathbb{N}italic_N start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ∈ blackboard_N 3D points 𝐱j3superscript𝐱𝑗superscript3\mathbf{x}^{j}\in\mathbb{R}^{3}bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. Each projection matrix Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of extrinsics (pose) gi𝕊𝔼(3)subscript𝑔𝑖𝕊𝔼3g_{i}\in\mathbb{SE}(3)italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_S blackboard_E ( 3 ) and intrinsics Ki3×3subscript𝐾𝑖superscript33K_{i}\in\mathbb{R}^{3\times 3}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT.

A 3D point 𝐱jsuperscript𝐱𝑗\mathbf{x}^{j}bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT can be projected to the i𝑖iitalic_i-th camera yielding a 2D screen coordinate 𝐲ij=Pi(𝐱j)λKi𝐱^ij;λ+formulae-sequencesuperscriptsubscript𝐲𝑖𝑗subscript𝑃𝑖superscript𝐱𝑗similar-to𝜆subscript𝐾𝑖superscriptsubscript^𝐱𝑖𝑗𝜆subscript\mathbf{y}_{i}^{j}=P_{i}(\mathbf{x}^{j})\sim\lambda K_{i}\hat{\mathbf{x}}_{i}^% {j};\lambda\in\mathbb{R}_{+}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∼ italic_λ italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ; italic_λ ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, where 𝐱^ij=gi𝐱jsuperscriptsubscript^𝐱𝑖𝑗subscript𝑔𝑖superscript𝐱𝑗\hat{\mathbf{x}}_{i}^{j}=g_{i}\mathbf{x}^{j}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the world coordinate 𝐱jsuperscript𝐱𝑗\mathbf{x}^{j}bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT expressed in view-coordinates of the i𝑖iitalic_i-th camera. The projection of the point 𝐱jsuperscript𝐱𝑗\mathbf{x}^{j}bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT to all input cameras is a track Tj=((y1j,v1j),(yNIj,vNIj))superscript𝑇𝑗superscriptsubscript𝑦1𝑗superscriptsubscript𝑣1𝑗subscriptsuperscript𝑦𝑗subscript𝑁𝐼subscriptsuperscript𝑣𝑗subscript𝑁𝐼T^{j}=\big{(}(y_{1}^{j},v_{1}^{j})...,(y^{j}_{N_{I}},v^{j}_{N_{I}})\big{)}italic_T start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = ( ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) … , ( italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) consisting of NIsubscript𝑁𝐼N_{I}italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT matching 2D points 𝐲ij2superscriptsubscript𝐲𝑖𝑗superscript2\mathbf{y}_{i}^{j}\in\mathbb{R}^{2}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and their corresponding binary indicators vij{0,1}superscriptsubscript𝑣𝑖𝑗01v_{i}^{j}\in\{0,1\}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ { 0 , 1 } denoting visibility of the j𝑗jitalic_j-th point in the i𝑖iitalic_i-th camera. We denote 𝒯i={Ti1,,TiNT}subscript𝒯𝑖subscriptsuperscript𝑇1𝑖superscriptsubscript𝑇𝑖subscript𝑁𝑇\mathcal{T}_{i}=\{T^{1}_{i},...,T_{i}^{N_{T}}\}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } as the set of all tracks Tijsubscriptsuperscript𝑇𝑗𝑖T^{j}_{i}italic_T start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the i𝑖iitalic_i-th camera.

3.1 Method overview

VGGSfM implements SfM via a single function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT

fθ()=𝒫,Xsubscript𝑓𝜃𝒫𝑋f_{\theta}(\mathcal{I})=\mathcal{P},X\vspace{-1mm}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_I ) = caligraphic_P , italic_X (1)

accepting the set of NIsubscript𝑁𝐼N_{I}italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT scene images \mathcal{I}caligraphic_I and outputting the camera parameters 𝒫𝒫\mathcal{P}caligraphic_P and the scene point cloud X𝑋Xitalic_X. Importantly, fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is fully differentiable and, as such, its parameters θ𝜃\thetaitalic_θ are learned by minimizing the training loss \mathcal{L}caligraphic_L:

θ=argminθs=1S(fθ(s),𝒫s,𝒯s,Xs),superscript𝜃subscriptargmin𝜃superscriptsubscript𝑠1𝑆subscript𝑓𝜃subscript𝑠subscriptsuperscript𝒫𝑠subscriptsuperscript𝒯𝑠superscriptsubscript𝑋𝑠\theta^{\star}=\operatorname*{arg\,min}_{\theta}\sum_{s=1}^{S}\mathcal{L}(f_{% \theta}(\mathcal{I}_{s}),\mathcal{P}^{\star}_{s},\mathcal{T}^{\star}_{s},X_{s}% ^{\star}),\vspace{-1mm}italic_θ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT caligraphic_L ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) , (2)

summing over S𝑆S\in\mathbb{N}italic_S ∈ blackboard_N training image sets ssubscript𝑠\mathcal{I}_{s}caligraphic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT annotated with ground-truth cameras 𝒫ssuperscriptsubscript𝒫𝑠\mathcal{P}_{s}^{\star}caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, tracks 𝒯ssuperscriptsubscript𝒯𝑠\mathcal{T}_{s}^{\star}caligraphic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, and point clouds Xssuperscriptsubscript𝑋𝑠X_{s}^{\star}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. We defer the details of \mathcal{L}caligraphic_L to Sec. 3.5 and, in the following paragraphs, discuss the architecture of fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

The reconstruction function

Following traditional SfM [69], VGGSfM decomposes the reconstruction function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT into four seamless stages: 1) point tracking T, 2) initial camera estimator 𝔗𝒫subscript𝔗𝒫\mathfrak{T}_{\mathcal{P}}fraktur_T start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT, 3) triangulator 𝔗Xsubscript𝔗𝑋\mathfrak{T}_{X}fraktur_T start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and, 4) Bundle Adjustment BA, as follows:

𝒯𝒯\displaystyle\mathcal{T}caligraphic_T =𝚃()absent𝚃\displaystyle=\texttt{T}(\mathcal{I})= T ( caligraphic_I ) (3)
𝒫^^𝒫\displaystyle\hat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG =𝔗𝒫(,𝒯)absentsubscript𝔗𝒫𝒯\displaystyle=\mathfrak{T}_{\mathcal{P}}(\mathcal{I},\mathcal{T})= fraktur_T start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_I , caligraphic_T )
X^^𝑋\displaystyle\hat{X}over^ start_ARG italic_X end_ARG =𝔗X(𝒯,𝒫^)absentsubscript𝔗𝑋𝒯^𝒫\displaystyle=\mathfrak{T}_{X}(\mathcal{T},\hat{\mathcal{P}})= fraktur_T start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( caligraphic_T , over^ start_ARG caligraphic_P end_ARG )
𝒫,X𝒫𝑋\displaystyle\mathcal{P},Xcaligraphic_P , italic_X =𝙱𝙰(𝒯,𝒫^,X^).absent𝙱𝙰𝒯^𝒫^𝑋\displaystyle=\texttt{BA}(\mathcal{T},\hat{\mathcal{P}},\hat{X}).= BA ( caligraphic_T , over^ start_ARG caligraphic_P end_ARG , over^ start_ARG italic_X end_ARG ) .

The tracker T estimates 2D tracks 𝒯𝒯\mathcal{T}caligraphic_T given input images \mathcal{I}caligraphic_I. Subsequently, 𝔗𝒫subscript𝔗𝒫\mathfrak{T}_{\mathcal{P}}fraktur_T start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT and 𝔗Xsubscript𝔗𝑋\mathfrak{T}_{X}fraktur_T start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT provide initial cameras 𝒫^^𝒫\hat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG and an initial point cloud X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG respectively. Finally, BA enhances accuracy by refining the cameras and 3D points together.

3.2 Tracking

Establishing precise 2D correspondences is important for achieving accurate 3D reconstruction. Traditionally, SfM frameworks first estimate pairwise image-to-image correspondences that are later chained into multi-image tracks T𝑇Titalic_T [46, 69, 31]. Here, typically only point-pair matching benefits from learned components [18, 82, 67, 47], while the chaining of pairwise correspondences remains a hand-engineered process.

Instead, VGGSfM significantly simplifies SfM correspondence tracking by employing a deep feed-forward tracking function. It accepts a collection of images and directly outputs a set of reliable point trajectories across all images. We achieve this by exploiting recent advances in video point tracking methods [27, 19, 20, 39]. Although developed for video-point tracking, these methods are inherently appropriate for SfM which requires a very compact set of highly accurate tracks (e.g., dense optical flow is too memory-demanding). Furthermore, point trackers mitigate the potential errors (sometimes called drift) caused by chaining of pairwise matches. However, as we describe below, our design differs from video trackers because SfM, which accepts free-form images, cannot assume temporal smoothness or ordering, and requires sub-pixel accuracy.

Refer to caption
Figure 3: Architecture of Tracker T. We adopt a coarse-to-fine design for the tracker. The coarse tracker first locates the approximate positions of corresponding points, and the fine tracker then refines these initial predictions.

Tracker architecture

The design of our tracker T follows [27, 39], and is illustrated in Fig. 3. More specifically, given NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT query points {𝐲^i1,𝐲^iNT}subscriptsuperscript^𝐲1𝑖subscriptsuperscript^𝐲subscript𝑁𝑇𝑖\{\hat{\mathbf{y}}^{1}_{i},...\hat{\mathbf{y}}^{N_{T}}_{i}\}{ over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } in a frame Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we bilinearly sample their corresponding query descriptors {𝐦i1,,𝐦iNT}subscriptsuperscript𝐦1𝑖subscriptsuperscript𝐦subscript𝑁𝑇𝑖\{\mathbf{m}^{1}_{i},...,\mathbf{m}^{N_{T}}_{i}\}{ bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , bold_m start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } from image feature maps output by a 2D CNN. Then, each query descriptor is correlated with the feature maps of all NIsubscript𝑁𝐼N_{I}italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT frames at different spatial resolutions, which constructs a cost-volume pyramid. Flattening the latter yields tokens VNT×NI×C𝑉superscriptsubscript𝑁𝑇subscript𝑁𝐼𝐶V\in\mathbb{R}^{N_{T}\times N_{I}\times C}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT, where C𝐶Citalic_C is the total number of elements in the cost-volume pyramid. Feeding the tokens to a Transformer, we obtain tracks 𝒯={Tj}j=1NT𝒯superscriptsubscriptsuperscript𝑇𝑗𝑗1subscript𝑁𝑇\mathcal{T}=\{T^{j}\}_{j=1}^{N_{T}}caligraphic_T = { italic_T start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Recall that each track Tjsuperscript𝑇𝑗T^{j}italic_T start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT comprises the set of NIsubscript𝑁𝐼N_{I}italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT tracked 2D locations 𝐲ijsuperscriptsubscript𝐲𝑖𝑗\mathbf{y}_{i}^{j}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT together with predicted visibility indicators vijsuperscriptsubscript𝑣𝑖𝑗v_{i}^{j}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT.

It is worth noting that, differently from [27, 39], our tracker does not assume temporal continuity. Therefore, we avoid the sliding window approach and, instead, attend to all the input frames together. Furthermore, unlike in [39], we predict each track independently of others. This allows to track a larger number of points at test time leading to increased density of reconstructed point clouds.

Predicting tracking confidence

In SfM, it is crucial to filter out any outlier correspondences as they can negatively impact the subsequent reconstruction stages. To this end, we enhance the tracker with the ability to estimate confidence for each track-point prediction.

More specifically, we leverage the aleatoric uncertainty [40, 59] model which predicts variance σijsuperscriptsubscript𝜎𝑖𝑗\mathbf{\sigma}_{i}^{j}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT together with each 2D track point 𝐲ijsuperscriptsubscript𝐲𝑖𝑗\mathbf{y}_{i}^{j}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, so that the resulting normal distribution 𝒩(𝐲ij|𝐲ij,σij)𝒩conditionalsuperscriptsubscript𝐲𝑖𝑗superscriptsubscript𝐲𝑖𝑗superscriptsubscript𝜎𝑖𝑗\mathcal{N}(\mathbf{y}_{i}^{j\star}|\mathbf{y}_{i}^{j},\mathbf{\sigma}_{i}^{j})caligraphic_N ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j ⋆ end_POSTSUPERSCRIPT | bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) tightly peaks around each ground-truth 2D track point 𝐲ijsuperscriptsubscript𝐲𝑖𝑗\mathbf{y}_{i}^{j\star}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j ⋆ end_POSTSUPERSCRIPT. Hence, during training, the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT/2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss, originally used in video point tracking, is replaced with the (negated) logarithm of the latter probability evaluated at each ground truth point 𝐲ijsuperscriptsubscript𝐲𝑖𝑗\mathbf{y}_{i}^{j\star}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j ⋆ end_POSTSUPERSCRIPT. Once trained, the confidence measure 1/σij1superscriptsubscript𝜎𝑖𝑗1/\sigma_{i}^{j}1 / italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is proportional to the inverse of the predicted variance. In practice, we assume a diagonal covariance matrix resulting in horizontal and vertical uncertainties σijsuperscriptsubscript𝜎𝑖𝑗\mathbf{\sigma}_{i}^{j}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT.

Coarse-to-fine tracking

Moreover, since SfM requires highly accurate (pixel or sub-pixel level) correspondences, we employ a coarse-to-fine point-tracking strategy. As described above, we first coarsely track image points using feature maps that fully cover the input images I𝐼Iitalic_I. Then, we form P×P𝑃𝑃P\times Pitalic_P × italic_P patches by cropping input images around the coarse point estimates and execute the tracking again to obtain a sub-pixel estimate. Recall that, differently from the chained matching of traditional SfM, our tracker is fully differentiable. This enables back-propagating the gradient of the training loss \mathcal{L}caligraphic_L through the whole framework to the tracker parameters. This reinforces the synergy between the tracking and the ensuing stages, which are described next.

3.3 Learnable camera & point initialization

As discussed above, a traditional SfM pipeline [69, 46] usually relies on an an incremental loop, which often initializes with a correspondence-rich image pair, gradually registers new frames, enlarges the point cloud, and conducts joint optimization (e.g., BA). However, although the framework has been fortified in robustness and accuracy through decades of improvements, this cumulative process has inevitably led to increased complexity. Furthermore, Incremental SfM is largely non-differentiable which complicates end-to-end learning from annotated data.

Thus, in pursuit of simplicity and differentiability, our method departs from the classical SfM scheme. Inspired by recent advances in deep camera pose estimators [86, 43, 97], we propose to initialize the cameras and the point cloud with a pair of deep Transformer [84] networks. Importantly, we register all cameras and reconstruct all scene points collectively in a non-incremental differentiable fashion.

Learnable camera initializer

The predictor of initial cameras 𝒫^^𝒫\hat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG is implemented as a deep Transformer architecture 𝔗𝒫subscript𝔗𝒫\mathfrak{T}_{\mathcal{P}}fraktur_T start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT:

𝒫^=𝔗𝒫({ϕ(Ii)|Ii},{d𝒫(yij)|Ti𝒯yijTi}).^𝒫subscript𝔗𝒫conditional-setitalic-ϕsubscript𝐼𝑖subscript𝐼𝑖conditional-setsuperscript𝑑𝒫superscriptsubscript𝑦𝑖𝑗for-allsubscript𝑇𝑖𝒯for-allsuperscriptsubscript𝑦𝑖𝑗subscript𝑇𝑖\hat{\mathcal{P}}=\mathfrak{T}_{\mathcal{P}}(\{\phi(I_{i})|I_{i}\in\mathcal{I}% \},\{d^{\mathcal{P}}(y_{i}^{j})|\forall T_{i}\in\mathcal{T}~{}\forall y_{i}^{j% }\in T_{i}\}).over^ start_ARG caligraphic_P end_ARG = fraktur_T start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( { italic_ϕ ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_I } , { italic_d start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) | ∀ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T ∀ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) . (4)

It accepts a set of tokens comprising global ResNet50 [29] features ϕ(Ii)italic-ϕsubscript𝐼𝑖\phi(I_{i})italic_ϕ ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of input images \mathcal{I}caligraphic_I, and the set of descriptors d𝒫(𝐲ij)superscript𝑑𝒫superscriptsubscript𝐲𝑖𝑗d^{\mathcal{P}}(\mathbf{y}_{i}^{j})italic_d start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) of track points 𝐲ijTi𝒯superscriptsubscript𝐲𝑖𝑗subscript𝑇𝑖𝒯\mathbf{y}_{i}^{j}\in T_{i}\in\mathcal{T}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T. Here, each track descriptor is output by an auxiliary branch of the tracker T. Given these inputs, 𝔗𝒫subscript𝔗𝒫\mathfrak{T}_{\mathcal{P}}fraktur_T start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT first applies cross-attention between the global image feature (query) and the track-descriptor (key-value) pairs yielding NI=||subscript𝑁𝐼N_{I}=|\mathcal{I}|italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = | caligraphic_I | tokens per scene. The output of cross-attention is then concatenated with an embedding of a preliminary camera estimated by the 8-point algorithm taking the correspondences between track points 𝐲ijsuperscriptsubscript𝐲𝑖𝑗\mathbf{y}_{i}^{j}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT as input. Finally, we feed this concatenation to a Transformer trunk resulting in the initial cameras 𝒫^^𝒫\hat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG.

Learnable triangulation

Given initial cameras 𝒫^^𝒫\hat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG and 2D tracks 𝒯𝒯\mathcal{T}caligraphic_T, the triangulator outputs the initial point cloud X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG. Similar to the camera predictor, the triangulator is a Transformer 𝔗Xsubscript𝔗𝑋\mathfrak{T}_{X}fraktur_T start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT

X^=𝔗X({dX(yij)|Ti𝒯yijTi})^𝑋subscript𝔗𝑋conditional-setsuperscript𝑑𝑋superscriptsubscript𝑦𝑖𝑗for-allsubscript𝑇𝑖𝒯for-allsuperscriptsubscript𝑦𝑖𝑗subscript𝑇𝑖\hat{X}=\mathfrak{T}_{X}(\{d^{X}(y_{i}^{j})|\forall T_{i}\in\mathcal{T}\forall y% _{i}^{j}\in T_{i}\})over^ start_ARG italic_X end_ARG = fraktur_T start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( { italic_d start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) | ∀ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T ∀ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) (5)

accepting descriptors dX(𝐲ij)superscript𝑑𝑋superscriptsubscript𝐲𝑖𝑗d^{X}(\mathbf{y}_{i}^{j})italic_d start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) each comprising a tracker feature, and a positional harmonic embedding [55] of points 𝐱¯jX¯superscript¯𝐱𝑗¯𝑋\bar{\mathbf{x}}^{j}\in\bar{X}over¯ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ over¯ start_ARG italic_X end_ARG from a preliminary point cloud X¯¯𝑋\bar{X}over¯ start_ARG italic_X end_ARG. The preliminary point cloud is formed via closed-form multi-view Direct Linear Transform (DLT) 3D triangulation [28] of the tracks 𝒯𝒯\mathcal{T}caligraphic_T given the initial cameras 𝒫^^𝒫\hat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG. Please refer to the supplementary for a detailed description of both initializers.

3.4 Bundle adjustment

Given the tracks 𝒯𝒯\mathcal{T}caligraphic_T (Sec. 3.2), initial cameras 𝒫^^𝒫\hat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG, and the initial point cloud X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG (Sec. 3.3), Bundle Adjustment BA minimizes the reprojection loss BAsubscriptBA\mathcal{L}_{\text{BA}}caligraphic_L start_POSTSUBSCRIPT BA end_POSTSUBSCRIPT[28, 69, 2, 1]:

X,𝒫𝑋𝒫\displaystyle X,\mathcal{P}italic_X , caligraphic_P =𝙱𝙰(𝒯,X^,𝒫^)=argminX,𝒫BAabsent𝙱𝙰𝒯^𝑋^𝒫subscriptargmin𝑋𝒫subscriptBA\displaystyle=\texttt{BA}(\mathcal{T},\hat{X},\hat{\mathcal{P}})=\operatorname% *{arg\,min}_{X,\mathcal{P}}\mathcal{L}_{\text{BA}}= BA ( caligraphic_T , over^ start_ARG italic_X end_ARG , over^ start_ARG caligraphic_P end_ARG ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_X , caligraphic_P end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT BA end_POSTSUBSCRIPT (6)
BAsubscriptBA\displaystyle\mathcal{L}_{\text{BA}}caligraphic_L start_POSTSUBSCRIPT BA end_POSTSUBSCRIPT =i=1NIj=1N𝐱vijPi(𝐱j)yij,absentsuperscriptsubscript𝑖1subscript𝑁𝐼superscriptsubscript𝑗1subscript𝑁𝐱superscriptsubscript𝑣𝑖𝑗normsubscript𝑃𝑖superscript𝐱𝑗superscriptsubscript𝑦𝑖𝑗\displaystyle=\sum_{i=1}^{N_{I}}\sum_{j=1}^{N_{\mathbf{x}}}v_{i}^{j}\|P_{i}(% \mathbf{x}^{j})-y_{i}^{j}\|,= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ ,

summing over all reprojection errors Pi(𝐱j)𝐲ijnormsubscript𝑃𝑖superscript𝐱𝑗superscriptsubscript𝐲𝑖𝑗\|P_{i}(\mathbf{x}^{j})-\mathbf{y}_{i}^{j}\|∥ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) - bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ each comprising the distance between the projection Pi(𝐱j)subscript𝑃𝑖superscript𝐱𝑗P_{i}(\mathbf{x}^{j})italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) of the point cloud 𝐱jXsuperscript𝐱𝑗𝑋\mathbf{x}^{j}\in Xbold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ italic_X to camera Pi𝒫subscript𝑃𝑖𝒫P_{i}\in\mathcal{P}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_P, and the i𝑖iitalic_i-th 2D point 𝐲ijTjsuperscriptsubscript𝐲𝑖𝑗superscript𝑇𝑗\mathbf{y}_{i}^{j}\in T^{j}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ italic_T start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT of the track Tjsuperscript𝑇𝑗T^{j}italic_T start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. Additionally, the error terms are filtered out if the corresponding points have low visibility, low confidence, or do not fit the geometric constraints defined by [69]. Points with large reprojections errors are also filtered [74, 93, 69]. More details are provided in the supplementary material.

Differentiable Levenberg-Marquardt

Following common practice [69, 46], we minimize Eq. 6 with second-order Levenberg-Marquardt (LM) optimizer [57]. However, optimizing the main training loss (Eq. 2) via backpropagation requires differentiability of Eq. 6 which is non-trivial. Therefore, we leverage the recently proposed Theseus library [63] which exploits the implicit function theorem to backpropagate through deep networks with nested optimization loops.

Refer to caption
Figure 4: Camera and point-cloud reconstructions of VGGSfM on Co3D (left) and IMC Phototourism (right).

3.5 Method details

Camera parameterization

Each camera pose P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P is parameterized with 8-degrees of freedom: the quaternion q(R)4𝑞𝑅superscript4q(R)\in\mathbb{R}^{4}italic_q ( italic_R ) ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT of the rotation R𝕊𝕆(3)𝑅𝕊𝕆3R\in\mathbb{SO}(3)italic_R ∈ blackboard_S blackboard_O ( 3 ) and the translation 𝐭3𝐭superscript3\mathbf{t}\in\mathbb{R}^{3}bold_t ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT components of P𝑃Pitalic_P’s extrinsics g𝕊𝔼(3)𝑔𝕊𝔼3g\in\mathbb{SE}(3)italic_g ∈ blackboard_S blackboard_E ( 3 ), and the logarithm ln(𝔣)𝔣\ln(\mathfrak{f})\in\mathbb{R}roman_ln ( fraktur_f ) ∈ blackboard_R of the camera focal length 𝔣+𝔣superscript\mathfrak{f}\in\mathbb{R}^{+}fraktur_f ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Given these values, the 3×4343\times 43 × 4 pose matrix is defined as P=KR[𝕀3|𝐭]𝑃𝐾𝑅delimited-[]conditionalsubscript𝕀3𝐭P=KR[\mathbb{I}_{3}|\mathbf{t}]italic_P = italic_K italic_R [ blackboard_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | bold_t ], with the calibration matrix K=[𝔣,0,px;0,𝔣,py;0,0,1]3×3𝐾𝔣0subscript𝑝𝑥0𝔣subscript𝑝𝑦001superscript33K=[\mathfrak{f},0,p_{x};0,\mathfrak{f},p_{y};0,0,1]\subset\mathbb{R}^{3\times 3}italic_K = [ fraktur_f , 0 , italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; 0 , fraktur_f , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ; 0 , 0 , 1 ] ⊂ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT (row major order). Following standard practice [69, 46], we set the principal-point coordinates px,pysubscript𝑝𝑥subscript𝑝𝑦p_{x},p_{y}\in\mathbb{R}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ blackboard_R to the center of the image.

Training loss

The training loss \mathcal{L}caligraphic_L (Eq. 2) is defined as:

(fθ(),𝒫,𝒯,X)=j=1NT|𝐱j𝐱j|ϵ+|𝐱j𝐱^j|ϵ+subscript𝑓𝜃superscript𝒫superscript𝒯superscript𝑋superscriptsubscript𝑗1subscript𝑁𝑇subscriptsuperscript𝐱absent𝑗superscript𝐱𝑗italic-ϵlimit-fromsubscriptsuperscript𝐱absent𝑗superscript^𝐱𝑗italic-ϵ\displaystyle\mathcal{L}(f_{\theta}(\mathcal{I}),\mathcal{P}^{\star},\mathcal{% T}^{\star},X^{\star})=\sum_{j=1}^{N_{T}}|\mathbf{x}^{\star j}-\mathbf{x}^{j}|_% {\epsilon}+|\mathbf{x}^{\star j}-\hat{\mathbf{x}}^{j}|_{\epsilon}+caligraphic_L ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_I ) , caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | bold_x start_POSTSUPERSCRIPT ⋆ italic_j end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT + | bold_x start_POSTSUPERSCRIPT ⋆ italic_j end_POSTSUPERSCRIPT - over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT + (7)
+i=1NIe𝒫(Pi,Pi)+e𝒫(Pi,P^i)i=1NIj=1NTlog𝒩(𝐲ij|𝐲ij,σij)superscriptsubscript𝑖1subscript𝑁𝐼subscript𝑒𝒫subscriptsuperscript𝑃𝑖subscript𝑃𝑖subscript𝑒𝒫subscriptsuperscript𝑃𝑖subscript^𝑃𝑖superscriptsubscript𝑖1subscript𝑁𝐼superscriptsubscript𝑗1subscript𝑁𝑇𝒩conditionalsubscriptsuperscript𝐲𝑗𝑖superscriptsubscript𝐲𝑖𝑗superscriptsubscript𝜎𝑖𝑗\displaystyle+\sum_{i=1}^{N_{I}}e_{\mathcal{P}}(P^{\star}_{i},P_{i})+e_{% \mathcal{P}}(P^{\star}_{i},\hat{P}_{i})-\sum_{i=1}^{N_{I}}\sum_{j=1}^{N_{T}}% \log\mathcal{N}(\mathbf{y}^{j\star}_{i}|\mathbf{y}_{i}^{j},\sigma_{i}^{j})+ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_e start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_log caligraphic_N ( bold_y start_POSTSUPERSCRIPT italic_j ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT )

Here, |𝐱j𝐱j,t|ϵsubscriptsuperscript𝐱absent𝑗superscript𝐱𝑗𝑡italic-ϵ|\mathbf{x}^{\star j}-\mathbf{x}^{j,t}|_{\epsilon}| bold_x start_POSTSUPERSCRIPT ⋆ italic_j end_POSTSUPERSCRIPT - bold_x start_POSTSUPERSCRIPT italic_j , italic_t end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT and |𝐱j𝐱^j|ϵsubscriptsuperscript𝐱absent𝑗superscript^𝐱𝑗italic-ϵ|\mathbf{x}^{\star j}-\hat{\mathbf{x}}^{j}|_{\epsilon}| bold_x start_POSTSUPERSCRIPT ⋆ italic_j end_POSTSUPERSCRIPT - over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT evaluate the ϵitalic-ϵ\epsilonitalic_ϵ-thresholded pseudo-Huber loss ||ϵ|\cdot|_{\epsilon}| ⋅ | start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT [10] between the ground truth 3D points 𝐱jsuperscript𝐱absent𝑗\mathbf{x}^{\star j}bold_x start_POSTSUPERSCRIPT ⋆ italic_j end_POSTSUPERSCRIPT and the initial and BA-refined 3D points 𝐱^jXsuperscript^𝐱𝑗𝑋\hat{\mathbf{x}}^{j}\in Xover^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ italic_X, 𝐱jX^superscript𝐱𝑗^𝑋\mathbf{x}^{j}\in\hat{X}bold_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ over^ start_ARG italic_X end_ARG respectively. The camera errors e𝒫(Pi,Pi)subscript𝑒𝒫subscriptsuperscript𝑃𝑖subscript𝑃𝑖e_{\mathcal{P}}(P^{\star}_{i},P_{i})italic_e start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and e𝒫(Pi,P^i)subscript𝑒𝒫subscriptsuperscript𝑃𝑖subscript^𝑃𝑖e_{\mathcal{P}}(P^{\star}_{i},\hat{P}_{i})italic_e start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) compare the predicted initial pose P^i𝒫^subscript^𝑃𝑖^𝒫\hat{P}_{i}\in\hat{\mathcal{P}}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over^ start_ARG caligraphic_P end_ARG and bundle-adjusted camera pose Pi𝒫subscript𝑃𝑖𝒫P_{i}\in\mathcal{P}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_P to the ground truth camera annotation Pi𝒫isubscriptsuperscript𝑃𝑖subscriptsuperscript𝒫𝑖P^{\star}_{i}\in\mathcal{P}^{\star}_{i}italic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Here, e𝒫(P,P)subscript𝑒𝒫𝑃superscript𝑃e_{\mathcal{P}}(P,P^{\prime})italic_e start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_P , italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is defined as the Huber-loss ||ϵ|\cdot|_{\epsilon}| ⋅ | start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT between the parameterizations of poses P𝑃Pitalic_P, Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Finally, log𝒩(𝐲ij|𝐲ij,σij)𝒩conditionalsubscriptsuperscript𝐲𝑗𝑖superscriptsubscript𝐲𝑖𝑗superscriptsubscript𝜎𝑖𝑗\log\mathcal{N}(\mathbf{y}^{j\star}_{i}|\mathbf{y}_{i}^{j},\sigma_{i}^{j})roman_log caligraphic_N ( bold_y start_POSTSUPERSCRIPT italic_j ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) computes the likelihood of a ground-truth track point 𝐲ijTisubscriptsuperscript𝐲𝑗𝑖subscriptsuperscript𝑇𝑖\mathbf{y}^{j\star}_{i}\in T^{\star}_{i}bold_y start_POSTSUPERSCRIPT italic_j ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT under a probabilistic track-point estimate defined by a 2D gaussian with mean and variance predictions 𝐲ijsuperscriptsubscript𝐲𝑖𝑗\mathbf{y}_{i}^{j}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and σijsuperscriptsubscript𝜎𝑖𝑗\sigma_{i}^{j}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT respectively (i.e. the aleatoric uncertainty model described in Sec. 3.2).

4 Experiments

In this section, we first introduce the datasets together with the protocols for training and evaluation. Then, we provide comparison to existing methods and ablation studies.

Refer to caption
Figure 5: Qualitative Evaluation of Tracking. In each row, the left-most frame contains the query image with query points (crosses). The predicted track points 𝐲ijsuperscriptsubscript𝐲𝑖𝑗\mathbf{y}_{i}^{j}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT (dots) are shown to the right. The top-right part also highlights our track-point confidence predictions (described in Sec. 3.2), illustrated as ellipses with extent proportional to the predicted variance σijsuperscriptsubscript𝜎𝑖𝑗\mathbf{\sigma}_{i}^{j}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. Note how the uncertainty corresponds to an expectation, e.g., a keypoint covering a vertical stripe has higher uncertainty along the y𝑦yitalic_y-axis.
Co3D Dataset Incremental SfM Deep
Method COLMAP (SP + SG) PixSfM (SP + SG) RelPose[97] PoseReg[86] RelPose++[43] PoseDiffusion[86] Ours w/o Joint Ours
RRE@15°°\degree° 31.6 33.7 57.1 53.2 82.3 80.5 88.2 92.1
RTE@15°°\degree° 27.3 32.9 - 49.1 77.2 79.8 83.4 88.3
AUC@30°°\degree° 25.3 30.1 - 45.0 65.1 66.5 70.7 74.0
Table 1: Camera Pose Estimation on Co3D, where the proposed method outperforms previous methods by a large margin. Ours w/o Joint indicates the variant of our framework without training all components jointly.

Datasets.

Following prior work [86, 46, 31], we evaluate camera pose estimation on Co3Dv2 [64] and IMC Phototourism datasets [38], and 3D triangulation on ETH3D [70]. Co3D is an object-centric dataset comprising turnable-style videos from 51 categories of MS COCO [45]. IMC Phototourism, provided by Image Matching Challenge [38], contains 8 testing scenes and 3 validation scenes of famous landmarks. Generally, the Co3D scenes have much wider baselines, making them challenging for traditional frameworks such as COLMAP, while the IMC samples often have sufficiently overlapping fursta, which is where COLMAP excels. ETH3D provides highly accurate point clouds (captured by laser scanner) for 13 indoor and outdoor scenes, and hence is suitable for the evaluation of triangulation.

Training.

For the model evaluated on IMC Phototourism and ETH3D, we follow the protocol of  [67, 46, 31] and train on the MegaDepth dataset [42]. MegaDepth contains 1M crowd-sourced images depicting 196 tourism landmarks, auto-annotated with SfM tools. Hyper-parameters are tuned using the IMC validation set. As in prior work [47, 31], some scenes of MegaDepth are excluded from training due to their low quality or due to overlap with the IMC test set. For Co3Dv2, we conduct training and evaluation on 41 categories as in [86, 97, 43].

We chose a multi-stage training strategy for better stability. We first train the tracker T on the synthetic Kubric dataset [26] following the training protocol of [19, 39]. Then, the tracker is fine-tuned solely on Co3D or MegaDepth, depending on the test dataset. Subsequently, we train the camera initializer, with the tracker frozen. Next, the triangulator is trained with the aforementioned two components held frozen. Finally, all components are trained end-to-end. In all stages, we randomly sample a training batch of 3 to 30 frames.

Testing.

Given input test frames \mathcal{I}caligraphic_I, we first select the query frame by identifying the image that is closest to all others based on the cosine similarity between global descriptors extracted by an off-the-shelf ResNet50 [29]. Then, we extract SuperPoint and SIFT keypoints from the query frame to serve as the query points for the tracker T. Although our method can track any query point, it performs better when the queries are distinctive. To improve accuracy, we iterate the whole reconstruction function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT multiple times until reaching sub-pixel BA reprojection error 𝙱𝙰subscript𝙱𝙰\mathcal{L}_{\texttt{BA}}caligraphic_L start_POSTSUBSCRIPT BA end_POSTSUBSCRIPT. After the first iteration, the query image for each subsequent iteration is the one that is farthest from the query image of the previous iteration (as measured by the ResNet50 cosine similarity). The re-projections of the point cloud from the current iteration initialize the tracking in the next iteration.

4.1 Camera pose estimation

IMC Dataset Method AUC@3°°\degree° AUC@5°°\degree° AUC@10°°\degree°
Deep DeepSFM 10.27 19.36 31.35
PoseDiffusion 12.31 23.17 36.82
Incremental SfM COLMAP (SIFT+NN) 23.58 32.66 44.79
PixSfM (SIFT + NN) 25.54 34.80 46.73
PixSfM (LoFTR) 44.06 56.16 69.61
PixSfM (SP + SG) 45.19 57.22 70.47
DFSfM (LoFTR) 46.55 58.74 72.19
Deep Ours w/o Joint 38.23 51.60 68.35
Ours 45.23 58.89 73.92
Table 2: Camera Pose Estimation on IMC. Our method achieves better accuracy than state-of-the-art Incremental SfM approaches on 2 out of 3 AUC thresholds.

Following [86, 46, 31, 38], we report the metric area-under-curve (AUC) to evaluate camera pose accuracy. In Co3D, similar to [86], we also report the relative rotation error (RRE) and relative translation error (RTE). More specifically, for every pair of input frames, we compute the angular translation and rotation error, which are later compared to a threshold yielding accuracies RTE and RRE respectively. For a range of thresholds, AUC picks the lower between RRE and RTE, and outputs the area under the accuracy-threshold curve. The results on IMC and CO3D are presented in Tab. 1 and Tab. 2 respectively. For a fair comparison on IMC, we finetune DeepSFM [90] and PoseDiffusion [86] on MegaDepth using their open-source code. The results of Incremental SfM methods are copied from Detector-free SfM [31].

Results indicate that VGGSfM outperforms existing methods by a large margin (+9 accuracy points for each metric) on the CO3D dataset. Here, traditional SfM pipelines suffer because of the wide baselines between test frames. On IMC, with a good overlap between views, traditional SfM remains superior to recent data-driven deep-learning pipelines [90, 86]. Our end-to-end trained VGGSfM, however, outperforms all other methods on AUC@10 and AUC@5, and ranks second on AUC@3, convincingly demonstrating its ability to perform well in both narrow- and wide-baseline regimes. The accuracy and completeness of our point clouds can be further qualitatively evaluated in Fig. 4.

4.2 3D triangulation

We evaluate the accuracy and completeness of 3D triangulation on the ETH3D dataset [70] using the same protocol as [22, 46, 31], which triangulates points with fixed camera poses and intrinsics. Results are shown in Tab. 3, where metrics are averaged over all scenes. Our VGGSfM achieves better accuracy and completeness than all baselines (PatchFlow [22], PixSfM [46], and DFSfM [31]), regardless of which keypoint detection or matching method they use. This is especially obvious from the completeness attained at the 5cm threshold, with our 33.96% compared to 29.54% of the best prior work.

Method Accuracy (%percent\%%) Completeness (%percent\%%)
1cm 2cm 5cm 1cm 2cm 5cm
PatchFlow (LoFTR) 66.73 78.73 89.93 3.48 11.34 30.96
PixSfM (LoFTR) 74.42 84.08 92.63 2.91 9.39 27.31
PixSfM (SIFT + NN) 76.18 85.60 93.16 0.17 0.71 3.29
PixSfM (SP + SG) 79.01 87.04 93.80 0.75 2.77 11.28
DFSfM (LoFTR) 80.38 89.01 95.83 3.73 11.07 29.54
Ours 80.62 89.49 96.52 4.52 13.11 33.96
Table 3: 3D Triangulation on ETH3D [70] reporting the accuracy and completeness metrics at different thresholds.

4.3 Ablation study

End-to-end Training.

As reported in Tab. 1 and Tab. 2, the end-to-end joint training of the whole framework is important for achieving state-of-the-art performance. Specifically, comparing VGGSfM to an ablation which lacks end-to-end training (Ours w/o joint) we record an improvement from 70.7% AUC@30 to 74.0% on the Co3D dataset, and 68.35% AUC@10 to 73.92% on the IMC dataset. This demonstrates the benefits of our fully-differentiable design, and the synergy between its components.

Tracking or Pairwise Matching.

We compare the performance of our predicted tracks to pairwise matching methods on the IMC dataset. Specifically, we split our 2D tracks into pairwise matches and feed these matches to PixSfM. Also, we construct 2D tracks by pairwise matching (based on the open-source implementation of PixSfM) and feed them to our framework. It is worth noting that tracks from pairwise matching have a lot of “holes” because pairwise matching cannot guarantee proper point tracking. We fix these holes by setting their locations as the point locations in the query frame, marking them as invisible. At the same time, for fair comparison, cameras are still initialized using our tracks, because SP+SG cannot provide track features to our camera initializer. The results are shown in Tab. 4. Although COLMAP (the basis of PixSfM) is designed and carefully engineered for pairwise matching, our tracks achieve a slightly better result than the state-of-the-art matching option SP+SG. Instead, directly feeding SP+SG tracks to our framework leads to a performance drop. We attribute this to the fact that using SP+SG tracks cannot benefit from the joint training. We also provide a qualitative evaluation of our tracking accuracy in Fig. 5 on the IMC dataset.

PixSfM(SP + SG) PixSfM(Our Tracks) Ours(SP + LG) Ours
AUC@10°°\degree° 70.47 70.62 68.78 73.92
Table 4: Tracking or Pairwise Matching. We respectively provide tracks predicted by our tracker or matches estimated by SP+SG to PixSfM and to our VGGSfM.

Camera Initializer and Triangulator.

We also validate the design of our camera initializer 𝔗𝒫subscript𝔗𝒫\mathfrak{T}_{\mathcal{P}}fraktur_T start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT and triangulator 𝔗Xsubscript𝔗𝑋\mathfrak{T}_{X}fraktur_T start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT on the IMC dataset. As reported in Tab. 5, AUC@10 drops clearly if we replace them with alternatives, proving that they provide sufficiently accurate initialization for our global bundle adjustment BA, without the need for incremental camera registration.

PoseDiffusion Our Camera Initializer
DLT 62.18 69.42
Our Triangulator 66.37 73.92
Table 5: Ablation Study for Camera Initializer and Triangulator. A clear performance drop is observed when replacing our camera initializer by deep camera prediction method PoseDiffusion, or replacing triangulator by DLT.

Coarse-to-fine Tracking.

As dicussed above, accurate correspondences are important for structure from motion. We demonstrate the significance of our coarse-to-fine tracking mechanism for the method performance. By conducting an ablation study where the fine tracker is removed, we observe a significant performance drop on the IMC dataset, with AUC@10 dropping from 73.92% to 62.30%.

5 Conclusion

In this paper, we have presented VGGSfM, a fully differentiable SfM approach. We find that even long-standing pipelines, such as Structure-from-Motion, benefit from a learned adaptation between their components. This allows VGGSfM to be simpler than traditional SfM frameworks while achieving better performance across benchmark datasets. Moreover, our framework is fully implemented in Python, which will allow for easy modification and improvements in the future. While VGGSfM already achieves good performance, it cannot yet compete with established pipelines in all application domains. For example, it currently lacks the capability to process thousands of images as in traditional SfM frameworks. Nonetheless, we find differentiable SfM a promising direction of research, and our approach lays the foundation for further advances.

Appendix A Implementation Details

Training

As discussed in the main manuscript, the training process involves multiple stages. We first train the tracker T on the synthetic Kubric dataset, then separately train the tracker T, camera initializer 𝔗𝒫subscript𝔗𝒫\mathfrak{T}_{\mathcal{P}}fraktur_T start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT, and triangulator 𝔗Xsubscript𝔗𝑋\mathfrak{T}_{X}fraktur_T start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT on Co3D or MegaDepth, and finally jointly train the whole framework on Co3D or MegaDepth. We use the AdamW [48] optimizer with a cyclic learning rate scheduler [73] where each cycle spans 30303030 epochs. The learning rate is 0.00010.00010.00010.0001 for the joint training phase and 0.00050.00050.00050.0005 for all prior stages. We train the model on 32323232 NVIDIA A100 (80808080GB) GPUs until convergence. The batch size varies for each iteration because we randomly sample 3333 to 30303030 frames for each scene (batch) as in [86]. The training on the synthetic Kubric dataset takes about one day. The separate training of the tracker T, camera initializer 𝔗𝒫subscript𝔗𝒫\mathfrak{T}_{\mathcal{P}}fraktur_T start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT, and triangulator 𝔗Xsubscript𝔗𝑋\mathfrak{T}_{X}fraktur_T start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT takes two days, two days, and one day respectively. The final joint training takes one day. For training, we track 256256256256 query points and run bundle adjustment for 5555 optimization steps. We use gradient clipping to ensure stable training, which constrains the gradients’ norm to a maximum value of 1111. Additionally, we normalize the ground-truth cameras in the same way as in [86], and the point cloud correspondingly.

Moreover, we augment the samples using a combination of augmentation transformations. This includes color jittering (brightness, contrast, saturation, and hue) with a 65%percent6565\%65 % probability, Gaussian Blur with a 50%percent5050\%50 % probability, and a 15%percent1515\%15 % chance of converting images to grayscale. Please note that different frames from a single scene will receive different augmentations. Images are resized to 512×512512512512\times 512512 × 512 with zero padding. Ground truth tracks that remain invisible in over 50%percent5050\%50 % of the frames are excluded from the training for the tracker T. For the MegaDepth dataset, similar to  [67, 47], we construct the training batches by only sampling frames with an overlap score with the query frame exceeding 0.10.10.10.1. Here, overlap scores are derived from the pre-processing steps outlined in [21].

Inference Time

On a single NVIDIA A100 80808080GB GPU, given 25252525 frames and 4096409640964096 query points, the inference of the tracker, camera initializer, and triangulator takes around 4.34.34.34.3, 0.90.90.90.9, and 0.20.20.20.2 seconds respectively. In comparison, the popular pairwise matching variant SuperPoint + SuperGlue usually takes around 20202020 seconds. In the bundle adjustment process, each optimization step requires approximately 0.70.70.70.7 seconds. For each run of the whole reconstruction function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (as discussed in the main manuscript, fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is run multiple times until reaching sub-pixel BA reprojection error), bundle adjustment is executed for 30303030 steps, unless early convergence is achieved.

Tracker

We use the 2D convolutional architecture from [39, 27] as the backbone of our tracker. Specifically, for the coarse tracker, this structure consists of an initial convolutional layer with a 7×7777\times 77 × 7 kernel and stride of 2222, followed by eight residual blocks with 3×3333\times 33 × 3 kernels and instance normalization. Finally, the architecture concludes with a pair of convolutional layers, one using a 3×3333\times 33 × 3 kernel and the other a 1×1111\times 11 × 1 kernel. This backbone outputs a 128128128128-dimensional feature map reducing the spatial resolution by a factor of 8888. We use 5555 levels of correlation pyramids where each level uses a correlation radius of 4444. Therefore, the tokens (flattened cost volume) VNT×NI×C𝑉superscriptsubscript𝑁𝑇subscript𝑁𝐼𝐶V\in\mathbb{R}^{N_{T}\times N_{I}\times C}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT have a feature dimension of C=5×(2×4+1)2=405𝐶5superscript2412405C=5\times(2\times 4+1)^{2}=405italic_C = 5 × ( 2 × 4 + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 405. The tokens are subsequently processed by a transformer with eight self-attention layers with a hidden dimension of 512512512512 and 8888 heads. Finally, a multilayer perceptron (MLP) is applied to predict the point location y𝑦yitalic_y, visibility v𝑣vitalic_v, and inverse confidence σ𝜎\sigmaitalic_σ. The architecture uses GELU activation functions.

The architecture of the fine tracker is similar to the coarse tracker but shallower. The backbone of the fine tracker consists of one 3×3333\times 33 × 3 convolution layer, two residual blocks with 3×3333\times 33 × 3 kernels and instance normalization, and one 1×1111\times 11 × 1 convolution layer. The correlation pyramid of the fine tracker uses 3333 levels and each level uses a radius of 3333, which leads to tokens with a feature dimension of 3×(2×3+1)2=1473superscript23121473\times(2\times 3+1)^{2}=1473 × ( 2 × 3 + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 147. The shallow transformer uses four self-attention layers, with a hidden dimension of 384384384384 and 4444 heads.

Following [39, 27], we train the tracker with 4444 iterative updates and evaluate it with 6666 iterative updates.

Refer to caption
Figure 6: Architecture of Camera Initializer. Generally, we use the image features, track features, and the harmonic embedding of preliminary cameras to predict camera parameters. These parameters, represented in an NI×8subscript𝑁𝐼8N_{I}\times 8italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × 8 matrix, comprise a quaternion (4444 dimensions), a translation vector (3333 dimensions), and a focal length (1111 dimension).

Camera Initializer

The camera initializer (Fig. 6) takes frames I𝐼Iitalic_I and track features d𝒫superscript𝑑𝒫d^{\mathcal{P}}italic_d start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT as input, and outputs initial cameras 𝒫^^𝒫\hat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG. We extract features from the input images in a multi-scale manner as in [86]. However, we use ResNet [30] instead of DINO [8] as the camera initializer backbone, because we empirically found that DINO is harder to train jointly with other components. Each image is mapped to a 512512512512-dimensional feature vector ϕ(Ii)italic-ϕsubscript𝐼𝑖\phi(I_{i})italic_ϕ ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Since the track features carry information about the image-to-image correspondence which provides grounding for camera-pose estimation, we fuse the stack of track features d𝒫(𝐲)superscript𝑑𝒫𝐲d^{\mathcal{P}}(\mathbf{y})italic_d start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( bold_y ), with shape NT×NI×256subscript𝑁𝑇subscript𝑁𝐼256N_{T}\times N_{I}\times 256italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × 256, into the NI×512subscript𝑁𝐼512N_{I}\times 512italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × 512 image features ϕ(I)italic-ϕ𝐼\phi(I)italic_ϕ ( italic_I ) with 4444 cross-attention layers with 4444 heads. This results in a NI×512subscript𝑁𝐼512N_{I}\times 512italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × 512 global image descriptor.

Similar to the tracker, we adopt an iterative update mechanism inside the camera initializer. For each update, we obtain a set of 8888-dimensionsal preliminary camera representations and map them to 128128128128 dimensions with a positional harmonic embedding [55]. We then concatenate the global image descriptors and the embedding of the preliminary cameras, use an MLP to project the concatenated features to 512512512512 dimensions, and feed the latter to a trunk transformer. The trunk transformer consists of 8888 self-attention layers (transformer encoder) with 4444 heads, whose hidden dimension is 512512512512. The trunk transformer’s output is further processed with another MLP layer, which predicts the camera parameters. This procedure is repeated four times. In the first run, the preliminary cameras are derived from each frame’s relative camera pose to the query frame, which is computed from tracks using the 8-point algorithm. Following the approach of COLMAP [69], the focal lengths are initialized based on the longer side of the image size. In subsequent runs, the preliminary cameras (intrinsic and extrinsic) are the result of the previous prediction. In this process, the trunk transformer is run four times while the feature backbone is only run once.

It is noteworthy that the traditional 8-point algorithm is commonly used in conjunction with RANSAC to filter out noisy matches. In our approach, we employ a batched 8-point algorithm to approximate a similar effect to RANSAC while avoiding a time-consuming for loop. For each scene, we randomly select 20202020 sets, each comprising 50505050 point pairs. We then apply the 8-point algorithm to these sets in parallel, yielding 20202020 relative camera poses. Similar to RANSAC, we calculate the inlier count for each camera pose candidate using all available point pairs. A point pair is considered as an inlier if its Sampson epipolar error is less than 0.60.60.60.6 divided by the image width in pixels. Ultimately, the camera pose candidate with the highest number of inliers is selected. Our implementation of the 8-point algorithm is based on [89].

Refer to caption
Figure 7: Architecture of Triangulator. We first estimate a preliminary point cloud using the camera parameters and tracks. Subsequently, we calculate the distance from this preliminary point cloud to all camera rays, as well as identify the nearest points on these rays. This information (along with the preliminary point cloud) is concatenated to the track features and fed into a transformer to predict the point cloud X^normal-^𝑋\hat{X}over^ start_ARG italic_X end_ARG.

Triangulator

Given camera parameters and tracks, the triangulator (Fig. 7) 𝔗Xsubscript𝔗𝑋\mathfrak{T}_{X}fraktur_T start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT initially estimates a preliminary point cloud X¯¯𝑋\bar{X}over¯ start_ARG italic_X end_ARG (of size NT×3subscript𝑁𝑇3N_{T}\times 3italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × 3) using a closed-form multi-view Direct Linear Transform (DLT) for 3D triangulation. Furthermore, for each frame and the corresponding 2D point, a camera ray is computed. The distance from this camera ray to the associated 3D point in X¯¯𝑋\bar{X}over¯ start_ARG italic_X end_ARG, along with the nearest point on the camera ray, are calculated. This results in the preliminary point cloud X¯¯𝑋\bar{X}over¯ start_ARG italic_X end_ARG with shape NT×3subscript𝑁𝑇3N_{T}\times 3italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × 3, the ray distance with shape NT×NI×1subscript𝑁𝑇subscript𝑁𝐼1N_{T}\times N_{I}\times 1italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × 1 and nearest points to camera rays of shape NT×NI×3subscript𝑁𝑇subscript𝑁𝐼3N_{T}\times N_{I}\times 3italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × 3. These vectors are then concatenated (resulting in a tensor of shape NT×NI×7subscript𝑁𝑇subscript𝑁𝐼7N_{T}\times N_{I}\times 7italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × 7) and embedded into a 256-dimensional space (NT×NI×256subscript𝑁𝑇subscript𝑁𝐼256N_{T}\times N_{I}\times 256italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × 256) through positional encoding. The embedded vectors are further concatenated with the track feature d𝒫(𝐲)superscript𝑑𝒫𝐲d^{\mathcal{P}}(\mathbf{y})italic_d start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( bold_y ), leading to a shape of NT×NI×512subscript𝑁𝑇subscript𝑁𝐼512N_{T}\times N_{I}\times 512italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × 512. Averaging over the NIsubscript𝑁𝐼N_{I}italic_N start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT dimension yields a descriptor for the point cloud with dimensions NT×512subscript𝑁𝑇512N_{T}\times 512italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × 512. This descriptor is input into a transformer comprising 4444 self-attention layers, each with 4444 heads and a hidden dimension of 384384384384. The output of the transformer is processed by a two-layer MLP (the hidden dimension is 256256256256) to estimate X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG.

Outlier Filtering

It is important to filter out noisy correspondences in SfM, especially for BA optimization. For our framework, first, we drop 2D points with a visibility score v<0.6𝑣0.6v<0.6italic_v < 0.6 or variance σ>1𝜎1\sigma>1italic_σ > 1 (horizontally or vertically). Then, we use the preliminary cameras estimated by the 8-point algorithm and the initial cameras 𝒫^^𝒫\hat{\mathcal{P}}over^ start_ARG caligraphic_P end_ARG to remove correspondences with a Sampson epipolar error of more than 0.80.80.80.8 divided by the image width. Following Bundler [74] and COLMAP[69], we also require that at least one pair within each track has a triangulation angle of more than 3333 degrees. Otherwise, the track (and the associated 3D point) is discarded. Moreover, for bundle adjustment, the 2D points with a reprojection error of more than 3333 pixels are removed. Tracks with less than 3 points are discarded as well. It is worth mentioning that Homography verification [69] does not seem to be important for our framework, although it is common in incremental SfM.

Appendix B Discussions and Ablation

Global SfM

As discussed in the Related Work section of the main manuscript, there are two popular approaches for SfM: incremental and global. Global SfM approaches [4, 14, 17, 58, 76, 65, 61, 35, 16, 15] usually predict the parameters for all the cameras at the same time and only perform bundle adjustment once. These methods often use rotation averaging and translation averaging to align pairwise relative camera poses into a consistent coordinate system. Our proposed method bears similarities to global SfM. However, it diverges in several key aspects: (1) unlike global SfM, which relies on pairwise matching (akin to incremental SfM), our method directly predicts tracks; (2) instead of rotation averaging and translation averaging in global SfM, we use a learnable network to predict camera parameters; (3) we iteratively apply the reconstruction function multiple times during testing, with bundle adjustment at each iteration. Besides these differences, our method is complementary to global SfM.

w/o Filtering w/o BA Ours
AUC@10°°\degree° 2.31 18.34 73.92
RRE@5°°\degree° 8.17 70.25 95.61
RTE@5°°\degree° 5.42 39.42 81.03
Table 6: Ablation Study for Bundle Adjustment. We try the setting without using bundle adjustment, or using bundle adjustment but not filtering the correspondences.

Bundle Adjustment

Bundle adjustment is a key component for accurate SfM. As shown in Tab. 6, without bundle adjustment, we observe a clear performance drop, with AUC@10 from 73.9273.9273.9273.92 to 18.3418.3418.3418.34. BA is also known to be strongly susceptible to noisy inputs. Indeed, using bundle adjustment without track filtering (described in previous paragraphs) destroys the estimate and, as such, reduces the AUC@10 nearly to zero. Notably, even without bundle adjustment, our framework’s estimation remains relatively robust; for instance, the rotation errors for over 70%percent7070\%70 % of image pairs remain within 5555 degrees (RRE@5°°\degree° >70%absentpercent70>70\%> 70 %). However, executing bundle adjustment without track filtering results in incorrect optimization of camera parameters, whose RRE@5°°\degree° is also just around 8%percent88\%8 %. At the same time, please note that all the methods in Table 2 of the main manuscript use bundle adjustment or its approximation. For example, PoseDiffusion [86] uses geometry-guided sampling and DeepSfM [90] adopts a special form of bundle adjustment. Without geometry-guided sampling, the AUC@10 of PoseDiffusion is around 11%percent1111\%11 %.

Refer to caption
Figure 8: Histogram of Tracking Errors of the scene British Museum 10 bag 000 on the IMC dataset. The horizontal axis denotes error in pixels, while the vertical axis shows the percentage (%) for each bin.

Track Error Distribution

We present a histogram in Fig. 8 to depict the distribution of tracking errors for the scene British Museum 10 bag 000 within the IMC dataset, consisting of 10 images. As indicated, the distribution’s peak, represented by the orange dashed line, approximately aligns with 0.40.40.40.4 pixels, while the median, depicted by the blue dash-dotted line, is around 0.60.60.60.6 pixels. Notably, most of the tracking predictions maintain an error margin of less than 3333 pixels, highlighting the accuracy of our method. Some predictions even approach a near-zero error margin. Invisible points (e.g., occluded or outside the view) are not included in this histogram.

AUC@10°°\degree° RRE@5°°\degree° RTE@5°°\degree°
PiPs [27] 43.27 82.15 51.39
Our Trakcer 73.92 95.61 81.03
Table 7: Ablation Study for Tracking. We try the video tracking method PiPs [27] in our framework, which shows a clear performance drop .

Video Tracking

To verify the effect of our proposed tracker, we also try to use the video tracking method PiPs inside our framework. The results, presented in Table 7, reveal a noticeable decline in performance when using PiPs as opposed to our tracker. This contrast underlines the effectiveness of our proposed tracking solution.

References

  • Agarwal et al. [2010] Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. Bundle adjustment in the large. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part II 11, pages 29–42. Springer, 2010.
  • Agarwal et al. [2011] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011.
  • Agarwal et al. [2022] Sameer Agarwal, Keir Mierle, and The Ceres Solver Team. Ceres Solver, 2022.
  • Arie-Nachimson et al. [2012] Mica Arie-Nachimson, Shahar Z Kovalsky, Ira Kemelmacher-Shlizerman, Amit Singer, and Ronen Basri. Global motion estimation from point matches. In 2012 Second international conference on 3D imaging, modeling, processing, visualization & transmission, pages 81–88. IEEE, 2012.
  • Bay et al. [2008] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-Up Robust Features (SURF). CVIU, 110(3), 2008.
  • Brachmann and Rother [2019] Eric Brachmann and Carsten Rother. Neural-guided ransac: Learning where to sample model hypotheses. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4322–4331, 2019.
  • Brachmann et al. [2017] Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. Dsac-differentiable ransac for camera localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6684–6692, 2017.
  • Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  • Carrivick et al. [2016] Jonathan L Carrivick, Mark W Smith, and Duncan J Quincey. Structure from Motion in the Geosciences. John Wiley & Sons, 2016.
  • Charbonnier et al. [1997] Pierre Charbonnier, Laure Blanc-féraud, Gilles Aubert, and Michel Barlaud. Deterministic edge-preserving regularization in computed imaging. IEEE Trans. Image Processing, 6:298–311, 1997.
  • Chen et al. [2022a] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision, pages 333–350. Springer, 2022a.
  • Chen et al. [2021] Hongkai Chen, Zixin Luo, Jiahui Zhang, Lei Zhou, Xuyang Bai, Zeyu Hu, Chiew-Lan Tai, and Long Quan. Learning to match features with seeded graph matching network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6301–6310, 2021.
  • Chen et al. [2022b] Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David Mckinnon, Yanghai Tsin, and Long Quan. Aspanformer: Detector-free image matching with adaptive span transformer. In European Conference on Computer Vision, pages 20–36. Springer, 2022b.
  • Crandall et al. [2012] David J Crandall, Andrew Owens, Noah Snavely, and Daniel P Huttenlocher. Sfm with mrfs: Discrete-continuous optimization for large-scale structure from motion. IEEE transactions on pattern analysis and machine intelligence, 35(12):2841–2853, 2012.
  • Cui et al. [2017] Hainan Cui, Xiang Gao, Shuhan Shen, and Zhanyi Hu. Hsfm: Hybrid structure-from-motion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1212–1221, 2017.
  • Cui and Tan [2015] Zhaopeng Cui and Ping Tan. Global structure-from-motion by similarity averaging. In Proceedings of the IEEE International Conference on Computer Vision, pages 864–872, 2015.
  • Cui et al. [2015] Zhaopeng Cui, Nianjuan Jiang, Chengzhou Tang, and Ping Tan. Linear global translation estimation with feature tracks. arXiv preprint arXiv:1503.01832, 2015.
  • DeTone et al. [2018] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018.
  • Doersch et al. [2022] Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens Continente, Kucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video. In NeurIPS Datasets Track, 2022.
  • Doersch et al. [2023] Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. arXiv preprint arXiv:2306.08637, 2023.
  • Dusmanu et al. [2019] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 8092–8101, 2019.
  • Dusmanu et al. [2020] Mihai Dusmanu, Johannes L Schönberger, and Marc Pollefeys. Multi-view optimization of local feature geometry. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 670–686. Springer, 2020.
  • Fischler and Bolles [1981] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  • Frahm et al. [2010] Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, et al. Building rome on a cloudless day. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pages 368–381. Springer, 2010.
  • Furukawa et al. [2010] Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski. Towards Internet-scale multi-view stereo. In Proc. CVPR. IEEE, 2010.
  • Greff et al. [2022] Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Mehdi S. M. Sajjadi, Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun, Suhani Vora, Ziyu Wang, Tianhao Wu, Kwang Moo Yi, Fangcheng Zhong, and Andrea Tagliasacchi. Kubric: a scalable dataset generator. 2022.
  • Harley et al. [2022] Adam W Harley, Zhaoyuan Fang, and Katerina Fragkiadaki. Particle video revisited: Tracking through occlusions using point trajectories. In ECCV, 2022.
  • Hartley and Zisserman [2000] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • He et al. [2023] Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Detector-free structure from motion. In arxiv, 2023.
  • Heinly et al. [2015] Jared Heinly, Johannes L. Schonberger, Enrique Dunn, and Jan-Michael Frahm. Reconstructing the world* in six days *(as captured by the yahoo 100 million image dataset). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Iglhaut et al. [2019] Jakob Iglhaut, Carlos Cabo, Stefano Puliti, Livia Piermattei, James O’Connor, and Jacqueline Rosette. Structure from motion photogrammetry in forestry: A review. Current Forestry Reports, 5:155–168, 2019.
  • Jiang et al. [2022] Hanwen Jiang, Zhenyu Jiang, Kristen Grauman, and Yuke Zhu. Few-view object reconstruction with unknown categories and camera poses. ArXiv, 2212.04492, 2022.
  • Jiang et al. [2013] Nianjuan Jiang, Zhaopeng Cui, and Ping Tan. A global linear method for camera pose registration. In Proceedings of the IEEE international conference on computer vision, pages 481–488, 2013.
  • Jiang et al. [2020] San Jiang, Cheng Jiang, and Wanshou Jiang. Efficient structure from motion for large-scale uav images: A review and a comparison of sfm tools. ISPRS Journal of Photogrammetry and Remote Sensing, 167:230–251, 2020.
  • Jiang et al. [2021] Wei Jiang, Eduard Trulls, Jan Hosang, Andrea Tagliasacchi, and Kwang Moo Yi. Cotr: Correspondence transformer for matching across images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6207–6217, 2021.
  • Jin et al. [2021] Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 129(2):517–547, 2021.
  • Karaev et al. [2023] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. CoTracker: It is better to track together. 2023.
  • Kendall and Gal [2017] Alex Kendall and Yarin Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? Proc. NeurIPS, 2017.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
  • Li and Snavely [2018] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018.
  • Lin et al. [2023] Amy Lin, Jason Y Zhang, Deva Ramanan, and Shubham Tulsiani. Relpose++: Recovering 6d poses from sparse-view observations. arXiv preprint arXiv:2305.04926, 2023.
  • Lin et al. [2021] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey. Barf: Bundle-adjusting neural radiance fields. In IEEE International Conference on Computer Vision (ICCV), 2021.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In Proc. ECCV, 2014.
  • Lindenberger et al. [2021] Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-Perfect Structure-from-Motion with Featuremetric Refinement. arXiv.cs, abs/2108.08291, 2021.
  • Lindenberger et al. [2023] Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. arXiv preprint arXiv:2306.13643, 2023.
  • Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Lou et al. [2012] Yin Lou, Noah Snavely, and Johannes Gehrke. Matchminer: Efficient spanning structure mining in large image collections. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part II 12, pages 45–58. Springer, 2012.
  • Lowe [1999] David G. Lowe. Object Recognition from Local Scale-Invariant Features. In Proc. ICCV, 1999.
  • Lowe [2004] David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 60(2), 2004.
  • Lu [2018] Xiao Xin Lu. A review of solutions for perspective-n-point problem in camera pose estimation. In Journal of Physics: Conference Series, page 052009. IOP Publishing, 2018.
  • Ma et al. [2022] Wei-Chiu Ma, Anqi Joyce Yang, Shenlong Wang, Raquel Urtasun, and Antonio Torralba. Virtual correspondence: Humans as a cue for extreme-view geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15924–15934, 2022.
  • Matas et al. [2002] Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. In Proc. BMVC, 2002.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Proc. ECCV, 2020.
  • Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • Moré [2006] Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical analysis: proceedings of the biennial Conference held at Dundee, June 28–July 1, 1977, pages 105–116. Springer, 2006.
  • Moulon et al. [2013] Pierre Moulon, Pascal Monasse, and Renaud Marlet. Global fusion of relative motions for robust, accurate and scalable structure from motion. In Proceedings of the IEEE international conference on computer vision, pages 3248–3255, 2013.
  • Novotny et al. [2017] David Novotny, Diane Larlus, and Andrea Vedaldi. Learning 3d object categories by looking around them. In Proc. ICCV, 2017.
  • Oliensis [2000] John Oliensis. A critique of structure-from-motion algorithms. Computer Vision and Image Understanding, 80(2):172–214, 2000.
  • Ozyesil and Singer [2015] Onur Ozyesil and Amit Singer. Robust camera location estimation by convex programming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2674–2683, 2015.
  • Özyeşil et al. [2017] Onur Özyeşil, Vladislav Voroninski, Ronen Basri, and Amit Singer. A survey of structure from motion*. Acta Numerica, 26:305–364, 2017.
  • Pineda et al. [2022] Luis Pineda, Taosha Fan, Maurizio Monge, Shobha Venkataraman, Paloma Sodhi, Ricky TQ Chen, Joseph Ortiz, Daniel DeTone, Austin Wang, Stuart Anderson, et al. Theseus: A library for differentiable nonlinear optimization. Advances in Neural Information Processing Systems, 35:3801–3818, 2022.
  • Reizenstein et al. [2021] Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021.
  • Rother [2003] Rother. Linear multiview reconstruction of points, lines, planes and cameras using a reference plane. In Proceedings Ninth IEEE International Conference on Computer Vision, pages 1210–1217. IEEE, 2003.
  • Sand and Teller [2008] Peter Sand and Seth Teller. Particle video: Long-range motion estimation using point trajectories. International journal of computer vision, 80:72–91, 2008.
  • Sarlin et al. [2020] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
  • Schaffalitzky and Zisserman [2002] Frederik Schaffalitzky and Andrew Zisserman. Multi-view Matching for Unordered Image Sets, or ”How Do I Organize My Holiday Snaps?”. In Proc. ECCV, 2002.
  • Schönberger and Frahm [2016] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Schops et al. [2017] Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017.
  • Shi et al. [2022] Yan Shi, Jun-Xiong Cai, Yoli Shavit, Tai-Jiang Mu, Wensen Feng, and Kai Zhang. Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12517–12526, 2022.
  • Sinha et al. [2023] Samarth Sinha, Jason Y Zhang, Andrea Tagliasacchi, Igor Gilitschenski, and David B Lindell. Sparsepose: Sparse-view camera pose regression and refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21349–21359, 2023.
  • Smith and Topin [2019] Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, pages 369–386. SPIE, 2019.
  • Snavely et al. [2006] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. In ACM siggraph 2006 papers, pages 835–846. 2006.
  • Sun et al. [2021] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021.
  • Sweeney et al. [2015] Chris Sweeney, Torsten Sattler, Tobias Hollerer, Matthew Turk, and Marc Pollefeys. Optimizing the viewing graph for structure-from-motion. In Proceedings of the IEEE international conference on computer vision, pages 801–809, 2015.
  • Tang and Tan [2018] Chengzhou Tang and Ping Tan. Ba-net: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807, 2018.
  • Teed and Deng [2018] Zachary Teed and Jia Deng. Deepv2d: Video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605, 2018.
  • Teed and Deng [2020] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  • Teed and Deng [2021] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021.
  • Triggs et al. [2000] Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew W. Fitzgibbon. Bundle Adjustment - A Modern Synthesis. In Proc. ICCV Workshop, 2000.
  • Tyszkiewicz et al. [2020] Michał Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk: Learning local features with policy gradient. Advances in Neural Information Processing Systems, 33:14254–14265, 2020.
  • Ummenhofer et al. [2017] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for learning monocular stereo. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5038–5047, 2017.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Proc. NeurIPS, 2017.
  • Wang et al. [2021a] Jianyuan Wang, Yiran Zhong, Yuchao Dai, Stan Birchfield, Kaihao Zhang, Nikolai Smolyanskiy, and Hongdong Li. Deep two-view structure-from-motion revisited. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8953–8962, 2021a.
  • Wang et al. [2023] Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9773–9783, 2023.
  • Wang et al. [2022] Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, and Rainer Stiefelhagen. Matchformer: Interleaving attention in transformers for feature matching. In Proceedings of the Asian Conference on Computer Vision, pages 2746–2762, 2022.
  • Wang et al. [2021b] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. NeRF--- -: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021b.
  • Wei et al. [2023] Tong Wei, Yash Patel, Alexander Shekhovtsov, Jiri Matas, and Daniel Barath. Generalized differentiable ransac. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17649–17660, 2023.
  • Wei et al. [2020] Xingkui Wei, Yinda Zhang, Zhuwen Li, Yanwei Fu, and Xiangyang Xue. Deepsfm: Structure from motion via deep bundle adjustment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 230–247. Springer, 2020.
  • Westoby et al. [2012] Matthew J Westoby, James Brasington, Niel F Glasser, Michael J Hambrey, and Jennifer M Reynolds. ‘structure-from-motion’photogrammetry: A low-cost, effective tool for geoscience applications. Geomorphology, 179:300–314, 2012.
  • Wilson and Snavely [2014] Kyle Wilson and Noah Snavely. Robust global translations with 1dsfm. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III 13, pages 61–75. Springer, 2014.
  • Wu [2013] Changchang Wu. Towards linear-time incremental structure from motion. In 2013 International Conference on 3D Vision-3DV 2013, pages 127–134. IEEE, 2013.
  • Wu et al. [2020] Shangzhe Wu, Christian Rupprecht, and Andrea Vedaldi. Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1–10, 2020.
  • Wu et al. [2023] Shangzhe Wu, Ruining Li, Tomas Jakab, Christian Rupprecht, and Andrea Vedaldi. MagicPony: Learning articulated 3d animals in the wild. 2023.
  • Yi et al. [2016] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. LIFT: Learned Invariant Feature Transform. In Proc. ECCV, 2016.
  • Zhang et al. [2022] Jason Y Zhang, Deva Ramanan, and Shubham Tulsiani. Relpose: Predicting probabilistic relative rotation for single objects in the wild. In ECCV, pages 592–611. Springer, 2022.
  • Zheng et al. [2023] Yang Zheng, Adam W Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19855–19865, 2023.
  • Zhou et al. [2017] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017.