Fast Robust Large-scale Mapping from Video and Internet Photo ...

Fast Robust Large-scale Mapping from Video and Internet Photo ... Fast Robust Large-scale Mapping from Video and Internet Photo ...

29.01.2014 Views

Fast Robust Large-scale Mapping from Video and Internet Photo Collections Jan-Michael Frahm a , Marc Pollefeys b,a , Svetlana Lazebnik a , Christopher Zach b , David Gallup a , Brian Clipp a , Rahul Raguram a , Changchang Wu a , Tim Johnson a a Universtiy of North Carolina at Chapel Hill, NC, USA b Eidgenössische Technische Hochschule Zürich Abstract Fast automatic modeling of large-scale environments has been enabled through recent research in computer vision. In this paper we present a system approaching the fully automatic modeling of large-scale environments. Our system takes as input either a video stream or collection of photographs, which can be readily obtained from Internet photo sharing web-sites such as Flickr. The design of the system is aimed at achieving high computational performance in the modeling. This is achieved through algorithmic optimizations for efficient robust estimation, a two stage stereo estimation reducing the computational cost while maintaining competitive modeling results, and the use of image based recognition for efficient grouping of corresponding images in the photo collection. The second major improvement in computational speed is achieved through parallelization and execution on commodity graphics hardware. These improvements lead to real-time video processing and to reconstruction from tens of thousands of images within the span of a day on a single commodity computer. We demonstrate the performance of both systems on a variety of real world data. Preprint submitted to ISPRS January 22, 2010

<strong>Fast</strong> <strong>Robust</strong> <strong>Large</strong>-<strong>scale</strong> <strong>Mapping</strong> <strong>from</strong> <strong>Video</strong> <strong>and</strong><br />

<strong>Internet</strong> <strong>Photo</strong> Collections<br />

Jan-Michael Frahm a , Marc Pollefeys b,a , Svetlana Lazebnik a , Christopher<br />

Zach b , David Gallup a , Brian Clipp a , Rahul Raguram a , Changchang Wu a ,<br />

Tim Johnson a<br />

a Universtiy of North Carolina at Chapel Hill, NC, USA<br />

b Eidgenössische Technische Hochschule Zürich<br />

Abstract<br />

<strong>Fast</strong> automatic modeling of large-<strong>scale</strong> environments has been enabled through<br />

recent research in computer vision. In this paper we present a system approaching<br />

the fully automatic modeling of large-<strong>scale</strong> environments. Our<br />

system takes as input either a video stream or collection of photographs,<br />

which can be readily obtained <strong>from</strong> <strong>Internet</strong> photo sharing web-sites such as<br />

Flickr. The design of the system is aimed at achieving high computational<br />

performance in the modeling. This is achieved through algorithmic optimizations<br />

for efficient robust estimation, a two stage stereo estimation reducing<br />

the computational cost while maintaining competitive modeling results, <strong>and</strong><br />

the use of image based recognition for efficient grouping of corresponding<br />

images in the photo collection. The second major improvement in computational<br />

speed is achieved through parallelization <strong>and</strong> execution on commodity<br />

graphics hardware. These improvements lead to real-time video processing<br />

<strong>and</strong> to reconstruction <strong>from</strong> tens of thous<strong>and</strong>s of images within the span of a<br />

day on a single commodity computer. We demonstrate the performance of<br />

both systems on a variety of real world data.<br />

Preprint submitted to ISPRS January 22, 2010


Keywords:<br />

1. Introduction<br />

The fully automatic modeling of large-<strong>scale</strong> environments has been a research<br />

goal in photogrammetry <strong>and</strong> computer vision since a long time. Detailed<br />

3D models automatically acquired <strong>from</strong> the real world have many uses<br />

including civil <strong>and</strong> military planning, mapping, virtual tourism, games, <strong>and</strong><br />

movies. In this paper we present a system approaching the fully automatic<br />

modeling of large-<strong>scale</strong> environments, either <strong>from</strong> video or <strong>from</strong> photo collections.<br />

Our system has been designed for efficiency <strong>and</strong> scalability. GPU implementations<br />

for SIFT feature matching, KLT feature tracking, gist feature<br />

extraction <strong>and</strong> multi-view stereo allow our system to rapidly process large<br />

amounts of video <strong>and</strong> photographs. For large photo collections, iconic image<br />

clustering allows the dataset to be broken into small closely related parts.<br />

The parts are then processed <strong>and</strong> merged, allowing tens of thous<strong>and</strong>s of photos<br />

to be registered. Similarly, for videos, loop detection finds intersections in<br />

the camera path, which reduces drift for long sequences <strong>and</strong> allows multiple<br />

videos to be registered.<br />

Recently mapping systems like Microsoft Bing Maps <strong>and</strong> Google Earth<br />

have started to use 3D models of cities in their visualizations. Currently these<br />

systems still require a human in the loop for delivering models of reasonable<br />

quality. Nevertheless they already achieve impressive results, modeling large<br />

areas with regular updates. However, these models are very low complexity,<br />

<strong>and</strong> do not provide enough detail for ground-level viewing. Furthermore the<br />

2


Figure 1: The left shows an overview model of Chapel Hill reconstructed <strong>from</strong> 1.3 million<br />

video frames on a single PC in 11.5 hrs. On the right a model of the statue of liberty<br />

is shown. The reconstruction algorithm registered 9025 cameras out of 47238 images<br />

downloaded <strong>from</strong> the <strong>Internet</strong>.<br />

availability is restricted to only a small number of cities across the globe.<br />

On the other h<strong>and</strong>, our system can automatically produce 3D models <strong>from</strong><br />

ground-level images with ground-level detail. And the efficiency <strong>and</strong> scalability<br />

of our system presents the possibility to reconstruct 3D models for all<br />

the world’s cities quickly <strong>and</strong> cheaply.<br />

In this paper we will review the details of our system, <strong>and</strong> present 3D<br />

models of large areas reconstructed <strong>from</strong> thous<strong>and</strong>s to millions of images.<br />

These models are produced with state-of-the-art computational efficiency,<br />

yet are competitive with the state of the art in quality.<br />

2. Related Work<br />

In the last years there has been a considerable progress in the area of<br />

large <strong>scale</strong> reconstruction <strong>from</strong> video for urban environments <strong>and</strong> arial data<br />

[1, 2, 3] as well as there is significant progress on reconstruction <strong>from</strong> <strong>Internet</strong><br />

3


photo collections. Here we discuss the most related work in the areas of fast<br />

structure <strong>from</strong> motion, camera registration <strong>from</strong> <strong>Internet</strong> photo collections,<br />

real-time dense stereo, scene summarization, <strong>and</strong> l<strong>and</strong>mark recognition.<br />

Besides the purely image based approaches there is also work on modeling<br />

the geometry by combining cameras <strong>and</strong> active range scanners. Früh <strong>and</strong><br />

Zakhor [4] proposed a mobile system mounted on a vehicle to capture large<br />

amounts of data while driving in urban environments. Earlier systems by<br />

Stamos <strong>and</strong> Allen [5] <strong>and</strong> El-Hakim et al. [6] constrained the scanning laser<br />

to be in a few pre-determined viewpoints. In contrast to the comparably expensive<br />

active systems, our approach for real-time reconstruction <strong>from</strong> video<br />

uses only cameras leveraging the methodology developed by the computer<br />

vision community within the last two decades [7, 8].<br />

The first step in our systems is to establish correspondences between the<br />

video frames or the different images of the photo collection respectively. We<br />

use two different approaches for establishing correspondences. For video data<br />

we use an extended KLT tracker [9] that exploits the inherent parallelism<br />

of the tracking problem to improve the computational performance through<br />

execution on the graphics processor (GPU). The specific approach used is<br />

introduced by Zach et al. in detail in [10].<br />

In the case of <strong>Internet</strong> photo collections we have to address the challenge<br />

of dataset collection, which is the following problem: starting with the<br />

heavily contaminated output of an <strong>Internet</strong> image search query, extract a<br />

high-precision subset of images that are actually relevant to the query. Existing<br />

approaches to this problem [11, 12, 13, 14, 15] consider general visual<br />

categories not necessarily related by rigid 3D structure. These techniques<br />

4


use statistical models to combine different kinds of 2D image features (texture,<br />

color, keypoints), as well as text <strong>and</strong> tags. The approach by Philbin<br />

et al. [16, 17] uses a loose spatial consistency constraint. In contrast, our<br />

system enforces a rigid 3D scene structure shared by the images.<br />

Our systems determines the camera registration <strong>from</strong> images or video<br />

frames. In general there are two classes of methods used to determine the<br />

camera registration. The first class leverages the work in multiple view geometry<br />

<strong>and</strong> typically alternates between robustly estimating camera poses <strong>and</strong><br />

3D point locations directly [18, 19, 20, 21]. Often bundle adjustment [22] is<br />

used in the process to refine the estimate. The other class of methods uses<br />

an extended Kalman filter to estimate both camera motion <strong>and</strong> 3D point<br />

locations jointly as the state of the filter [23, 24].<br />

For reconstruction <strong>from</strong> <strong>Internet</strong> photo collections our system uses a hierarchical<br />

reconstruction method starting <strong>from</strong> a set of canonical or iconic<br />

views representing different viewpoints as well as different parts of the scene.<br />

The issue of canonical view selection is one that has been addressed both in<br />

psychology [25, 26] <strong>and</strong> computer vision literature [27, 28, 29, 30, 31, 32].<br />

Simon et al. [31] observed that community photo collections provide a likelihood<br />

distribution over the viewpoints <strong>from</strong> which people prefer to take<br />

photographs. Hence, canonical view selection identifies prominent clusters<br />

or modes of this distribution. Simon et al. find these modes by clustering<br />

images based on the output of local feature matching <strong>and</strong> epipolar geometry<br />

verification between every pair of images in the dataset through a 3D registration<br />

of the data. While this solution is effective, it is computationally<br />

expensive, <strong>and</strong> it treats scene summarization as a by-product of 3D recon-<br />

5


struction. By contrast, we view summarization as an image organization step<br />

that precedes 3D reconstruction, <strong>and</strong> we take advantage of the observation<br />

that a good subset of representative or “iconic” images can be obtained using<br />

relatively simple 2D appearance-based techniques [29, 33].<br />

Given that [34, 35] provide excellent surveys on the work on binocular<br />

or multiple-view stereo we refer the reader to these articles. Our system<br />

for reconstruction <strong>from</strong> video streams extends the idea by Collins [36] to<br />

dense geometry estimation. This approach permits an efficient execution<br />

on the GPU as proposed originally by Yang <strong>and</strong> Pollefeys [37]. There has<br />

been a number of approaches proposed to specifically target urban environments<br />

using mainly the fact that they predominantly consist of planar<br />

surfaces [38, 39, 40], or even more strict orthogonality constraints [41]. To<br />

ensure computational feasibility large-<strong>scale</strong> systems generate partial reconstructions,<br />

which are afterwards merged into a common model. Conflicts<br />

<strong>and</strong> errors in the partial reconstructions are identified <strong>and</strong> resolved during<br />

the merging process. Most approaches so far target data produced by range<br />

finders, where noise levels <strong>and</strong> the fraction of outliers are typically significantly<br />

lower than those of passive stereo. Turk <strong>and</strong> Levoy [42] proposed to<br />

remove the overlapping mesh parts after registering the triangular meshes.<br />

A similar approach was proposed by Soucy <strong>and</strong> Laurendeau [43]. Volumetric<br />

model merging was proposed by Hilton et al. [44], Curless <strong>and</strong> Levoy [45],<br />

which was later extended by Wheeler et al. [46]. Koch et al. [47] presented a<br />

volumetric approach for fusion as part of an uncalibrated 3D modeling system.<br />

Given depth maps for all images, the depth estimates for all pixels are<br />

projected in the voxelized 3D space. Each depth estimate votes for a voxel in<br />

6


a probabilistic way <strong>and</strong> the final surfaces are extracted by thresholding the<br />

3D probability volume.<br />

Recently there has been a number of systems covering the large-<strong>scale</strong><br />

reconstruction of urban environments or l<strong>and</strong>marks. The 4D Atlanta project<br />

of Schindler et al. [48, 49], models the evolution of an urban environment<br />

over time <strong>from</strong> an unordered collection of photos. To achieve fast processing<br />

<strong>and</strong> provide efficient visualization Cornelis et al. [50] proposed a ruled surface<br />

model for urban environments. Specifically, in their approach the facades are<br />

modeled as ruled surfaces parallel to the gravity vector.<br />

One of the first works to demonstrate 3D reconstruction of l<strong>and</strong>marks<br />

<strong>from</strong> <strong>Internet</strong> photo collections is the <strong>Photo</strong> Tourism system [51, 52]. This<br />

system achieves high-quality reconstruction results with the help of exhaustive<br />

pairwise image matching <strong>and</strong> global bundle adjustment after inserting<br />

each new view. Unfortunately, this process becomes very computationally<br />

expensive for large data sets, <strong>and</strong> it is particularly inefficient for heavily contaminated<br />

collections, most of whose images cannot be registered to each<br />

other. Since the introduction of the seminal <strong>Photo</strong> Tourism system, several<br />

researchers have developed SfM methods that exploit the redundancy<br />

in community photo collections to make reconstruction more efficient. In<br />

[53], a technique for out-of-core bundle-adjustment is proposed which takes<br />

advantage of this redundancy by locally optimizing the “hot spots” <strong>and</strong> then<br />

connecting the local solutions into a global one. Snavely et al. [54] find<br />

skeletal sets of images <strong>from</strong> the collection whose reconstruction provides a<br />

good approximation to a reconstruction involving all the images. One remaining<br />

limitation of this work is that it still requires as an initial step the<br />

7


exhaustive computation of all two-view relationships in the dataset. Using<br />

the independence of the pairwise evaluation Agarwal et al. [55] address this<br />

computational challenge by using a computing cluster with up to 500 cores<br />

to reduce the computation time significantly. Similar to our system they<br />

use image recognition techniques like approximate nearest neighbor search<br />

[56] <strong>and</strong> query expansion [57] to reduce the number of c<strong>and</strong>idate relations.<br />

In contrast, our system uses single image descriptors to efficiently identify<br />

correlating images leading to an even smaller number of c<strong>and</strong>idates. Images<br />

can then be registered on a single computation core in time comparable to<br />

the 500 cores of Agarwal et al.<br />

3. Overview<br />

In this paper we describe our system for real-time computation of 3D<br />

models <strong>from</strong> ground reconnaissance video <strong>and</strong> for efficient 3D reconstruction<br />

<strong>from</strong> <strong>Internet</strong> photo collections downloaded <strong>from</strong> photo sharing web sites like<br />

Flickr 1 . The design of the system aims at efficient reconstruction through<br />

parallelization of the algorithms used enabling their execution on the GPU.<br />

The processing task for each type of input algorithmically share a large number<br />

of methods with some adaptation accounting for the significantly different<br />

characteristics of the data sources. In this section we will shortly discuss<br />

the different characteristics of the data sources <strong>and</strong> then introduce the main<br />

building blocks for the system.<br />

Our method for 30 Hz real-time reconstruction <strong>from</strong> video uses the video<br />

1 www.flickr.com<br />

8


streams of multiple cameras mounted on a vehicle to obtain a 3D reconstruction<br />

of the scene. Our system can operate <strong>from</strong> video alone, but additional<br />

sensors, such as differential GPS <strong>and</strong> an inertial sensor (INS), reduce accumulated<br />

error (drift) <strong>and</strong> provide registrations to the Earth’s coordinate system.<br />

Our system efficiently fuses the GPS, INS <strong>and</strong> camera measurements into an<br />

absolute camera position. Additionally, our system exploits the temporal<br />

order of the video frames, which implies a spatial relation ship of the frames,<br />

to efficiently perform the camera motion tracking <strong>and</strong> the dense geometry<br />

computation.<br />

In contrast to video, in <strong>Internet</strong> photo collections we encounter a high<br />

number of photos that do not share a rigid geometry with the images of<br />

the scene. These images are often mislabeled images <strong>from</strong> the data base,<br />

artistic images of the scene, duplicates, or parodies of the scene. An example<br />

for these different classes of unrelated images are shown in Figure 2. To<br />

quantify the amount of unrelated images in <strong>Internet</strong> photo collections we<br />

labeled a r<strong>and</strong>om subset for three downloaded datasets. Besides the dataset<br />

for the Statue of Liberty (47238 images with approximately 40% outliers) we<br />

downloaded two more datasets by searching for “San Marco” (45322 images<br />

with approximately 60 % outliers) <strong>and</strong> “Notre Dame” (11928 images with<br />

approximately 49%). The iconic clustering employed by our system can<br />

efficiently h<strong>and</strong>le such high outlier ratios which is essential for h<strong>and</strong>ling large<br />

<strong>Internet</strong> photo collections.<br />

Despite the differences in the input data our systems share a common<br />

algorithmic structure as shown in Algorithm 1:<br />

• The local correspondence estimation establishes correspondences<br />

9


Figure 2: Images downloaded form the the <strong>Internet</strong> by searching for “Statue of Liberty”,<br />

which do not represent the Statue of Liberty in New York.<br />

for salient feature points to neighboring views. In the case of video these<br />

correspondences are determined through KLT-tracking on the GPU<br />

with simultaneous estimation of the camera gain (Section 4.1.1). For<br />

<strong>Internet</strong> photo collections our system first establishes a spatial neighborhood<br />

relationship for each image through a clustering of single image<br />

descriptors, which effectively clusters close viewpoints. The feature<br />

points used are the SIFT features [58] due to their enhanced invariance<br />

to larger viewpoint changes (Section 4.1).<br />

• Camera pose/motion estimation <strong>from</strong> local correspondences is robustly<br />

performed through an efficient RANSAC technique called ARRSAC<br />

(Section 4.1.2). It determines the inlier correspondences <strong>and</strong> the camera<br />

position with respect to the previous image <strong>and</strong> the known model.<br />

10


In the case of available GPS data our system uses the local correspondences<br />

to perform sensor fusion with the GPS <strong>and</strong> INS data 4.1.3.<br />

• Global Correspondence estimation is performed to enable global<br />

drift correction. This step searches beyond the temporal neighbors in<br />

video <strong>and</strong> beyond the neighbors in the cluster for frames with overlap<br />

to the current frame (Section 4.2).<br />

• Bundle adjustment uses the global correspondences to reduce the<br />

accumulated drift of the camera registrations <strong>from</strong> the local camera<br />

pose estimates (Section 4.3).<br />

• Dense geometry estimation in real-time is performed by our 3D<br />

reconstruction <strong>from</strong> video system to extract the depth of all pixels<br />

<strong>from</strong> the camera (Section 5). Our system uses a two stage estimation<br />

strategy. First we use efficient GPU based stereo to determine the<br />

depth map for every frame. Then we use the temporal redundancy of<br />

the stereo estimations to filter out erroneous depth estimates <strong>and</strong> fuse<br />

the correct depth estimates.<br />

• Model extraction is performed to extract a triangular mesh representing<br />

the scene <strong>from</strong> the depth maps.<br />

The next section will first introduce our algorithm for real-time 3D reconstruction<br />

<strong>from</strong> video. Afterwards in Section 4.1 we will detail the differences<br />

<strong>from</strong> our algorithm for camera registration of <strong>Internet</strong> photo collections.<br />

11


Algorithm 1 Reconstruction Scheme<br />

for all images/frames do<br />

Local correspondence estimation with immediate spatial neighbors<br />

Camera pose/motion estimation <strong>from</strong> local correspondences<br />

end for<br />

for all registered images/frames do<br />

Global Correspondence estimation to non-neighbors to identify path<br />

intersections <strong>and</strong> overlaps not initially found<br />

Bundle Adjustment for global error mitigation using the global <strong>and</strong> local<br />

correspondences<br />

end for<br />

for all frames in video do<br />

Dense geometry estimation using stereo <strong>and</strong> depth map fusion<br />

Model extraction<br />

end for<br />

4. Camera Pose Estimation<br />

In this section we discuss the methods used for camera registration in<br />

our system. We first introduce the methods for camera registration for video<br />

sequences in Section 4.1. These methods inherently use the known temporal<br />

relationships of the frames for the camera motion estimation providing<br />

efficient constraints for camera registration. Second, in Section 4.4 we discuss<br />

the extension of the camera registration to <strong>Internet</strong> photo collections to<br />

address the challenge of an unordered data set.<br />

12


4.1. Camera Pose <strong>from</strong> <strong>Video</strong><br />

Our system for the registration of video streams employs the fact that<br />

typically in a video sequence the observed motion between consecutive frames<br />

is small, namely a few pixel. This allows our system to treat the correspondence<br />

estimation as a tracking problem as discussed in Section 4.1.1. From<br />

these correspondences we estimate the camera motion through our recently<br />

proposed efficient adaptive real-time r<strong>and</strong>om sampling consensus method<br />

(ARRSAC) [59] introduced in section 4.1.2. Alternatively, if GPS <strong>and</strong> inertial<br />

measurements are available the camera motion is estimated through<br />

a Kalman filter, efficiently fusing visual correspondences <strong>and</strong> the six DOF<br />

measurements for the camera pose as introduced in Section 4.1.3.<br />

4.1.1. Local Correspondences<br />

Reconstructing the structure of a scene <strong>from</strong> images begins with finding<br />

corresponding features between pairs of images. In a video sequence we can<br />

take advantage of the temporal ordering <strong>and</strong> small camera motion between<br />

frames to speed up correspondence finding. Kanade-Lucas-Tomassi feature<br />

tracking [60, 61] is a differential tracking method that first finds strong corner<br />

features in a video frame. These strong features are then tracked using a<br />

linear approximation of the image gradients over a small window of pixels<br />

surrounding the feature. One of the limitations of this approach is that<br />

feature motion of more than one pixel cannot be differentially tracked. A<br />

<strong>scale</strong> space approach based on an image pyramid is generally used to alleviate<br />

this problem. If a feature’s motion is more than a pixel at the finest <strong>scale</strong> we<br />

can still track it because its motion will be less than one pixel at a courser<br />

image <strong>scale</strong>.<br />

13


Building image pyramids <strong>and</strong> tracking features are both highly parallel<br />

operations <strong>and</strong> so can be programmed efficiently on the graphics processor<br />

(GPU). Building an image pyramid is simply a series of convolution operations<br />

with Gaussian blur kernels followed by down sampling. Differential<br />

tracking can be implemented in parallel by assigning each feature to track to<br />

a separate processor. With typically on the order of 1024 features to track<br />

<strong>from</strong> frame to frame in a 1024x768 image we can achieve a great deal of<br />

parallelism in this way.<br />

One major limitation of st<strong>and</strong>ard KLT tracking is that it uses the absolute<br />

difference between the windows of pixels in two images to calculate<br />

the feature track update. When a large change in intensity occurs between<br />

frames this can cause the tracking to fail. This happens frequently in videos<br />

shot outdoors when the camera moves <strong>from</strong> light into shadow for example.<br />

Kim et al. in [62] developed a gain adaptive KLT tracker that measures<br />

the change in mean intensity of all the tracked features <strong>and</strong> compensates for<br />

this change in the KLT update equations. Estimating the gain can also be<br />

performed on the GPU at minimal additional cost in comparison to st<strong>and</strong>ard<br />

GPU KLT.<br />

4.1.2. <strong>Robust</strong> Pose Estimation<br />

The estimated local correspondences typically contain a significant portion<br />

of erroneous correspondences. To determine the correct camera position<br />

we apply a R<strong>and</strong>om Sample Consensus (RANSAC) algorithm [63] to estimate<br />

the relative camera motion through the essential matrix[20]. While<br />

being highly robust the RANSAC algorithm can also be computationally<br />

very expensive, with a runtime that is exponential in outlier ratio <strong>and</strong> model<br />

14


complexity. Our recently proposed Adaptive Real-Time R<strong>and</strong>om Sample<br />

Consensus (ARRSAC) algorithm [64], is capable of providing accurate realtime<br />

estimation over a wide range of inlier ratios.<br />

ARRSAC moves away <strong>from</strong> the traditional hypothesize-<strong>and</strong>-verify framework<br />

of RANSAC by adopting a more parallel approach, along the lines of<br />

preemptive RANSAC [65]. However, while preemptive RANSAC makes the<br />

limiting assumption that a good estimate of the inlier ratio is available beforeh<strong>and</strong>,<br />

ARRSAC is capable of efficiently adapting to the contamination<br />

level of the data. This results in high computational savings.<br />

In real-time scenarios, the goal of a robust estimation algorithm is to deliver<br />

the best possible solution within a fixed time-budget. While a parallel,<br />

or breadth-first, approach is indeed a natural formulation for real-time applications,<br />

adapting the number of hypotheses to the contamination level of the<br />

data requires an estimate of the inlier ratio, which necessitates the adoption<br />

of a depth-first scan. The ARRSAC framework retains the benefits of both<br />

approaches (i.e., bounded run-time as well as adaptivity) by operating in a<br />

partially depth-first manner as detailed in Algorithm 2.<br />

The performance of ARRSAC was evaluated in detail on synthetic <strong>and</strong><br />

real data, over a wide range of inlier ratios <strong>and</strong> dataset sizes. From the<br />

results in [64], it can be seen that the ARRSAC approach produces significant<br />

computational speed-ups, while simultaneously providing accurate<br />

robust estimation in real-time. For the case of epipolar geometry estimation,<br />

for instance, ARRSAC operates with estimation speeds ranging between 55-<br />

350 Hz, which represent speedups ranging <strong>from</strong> 11.1 to 305.7 compared to<br />

st<strong>and</strong>ard RANSAC.<br />

15


Algorithm 2 The ARRSAC algorithm<br />

Partition the data into k equal size r<strong>and</strong>om partitions<br />

for all k partitions do<br />

if #hypothesis less than required then<br />

Non-uniform sampling for hypothesis generation<br />

end if<br />

Hypothesis evaluation using Sequential Probability Ratio Test (SPRT) [66]<br />

Perform local optimization to generate additional hypotheses<br />

Use surviving hypotheses to provide an estimate of the inlier ratio<br />

Use inlier ratio estimate to determine number of hypotheses needed<br />

for all hypotheses do<br />

if<br />

SPRT(hypothesis) is valid then<br />

Update inlier ratio estimate based on the current best hypothesis<br />

end if<br />

end for<br />

end for<br />

4.1.3. Camera Pose Estimation through Fusion<br />

In some instances we may have not only images or video data but also<br />

geo-location or orientation information. This additional information can be<br />

derived <strong>from</strong> global positioning system (GPS) sensors as well as inertial navigation<br />

systems containing accelerometers <strong>and</strong> gyroscopes. We can combine<br />

data <strong>from</strong> these other sensors with image correspondences to get a more<br />

accurate pose (rotation <strong>and</strong> translation) than we might get with vision or<br />

geo-location/orientation alone.<br />

In order to fuse image correspondences with other types of sensor mea-<br />

16


surements we must normalize all of the measurements so that they have<br />

comparable units. For example, without doing this normalization one cannot<br />

compare pixels of reprojection error with meters of error relative to a<br />

GPS measurement. Our sensor fusion approach, first described in [67], fuses<br />

measurements <strong>from</strong> a GPS/INS unit with image correspondences using an<br />

extended Kalman filter (EKF). Our EKF maintains a state which was a map<br />

with the current camera pose, velocity <strong>and</strong> rotation rate as well as 3D features<br />

which correspond to features extracted <strong>and</strong> tracked <strong>from</strong> image to image in<br />

a video. The EKF uses a smooth motion model to predict the next camera<br />

pose. Then the system performs an update of the current camera pose <strong>and</strong><br />

3D feature measurements based on the difference between the predicted <strong>and</strong><br />

measured feature measurements <strong>and</strong> GPS/INS measurements.<br />

These corrections are weighted both by the EKF’s estimate of its certainty<br />

of the camera pose <strong>and</strong> 3D points (its covariance matrix) <strong>and</strong> a measurement<br />

error model for the sensors. For the feature measurements we assume the<br />

presence of additive Gaussian noise. For the GPS/INS measurements we<br />

assume a Gaussian distribution of rotation errors <strong>and</strong> use a constant bias<br />

model for the position error. The Gaussian distribution of rotation error is<br />

justified by the fact that the GPS/INS we use gives us post processed data<br />

with extremely high accuracy <strong>and</strong> precision. However, the position error was<br />

seen to occasionally grow when the vehicle was stationary. This might be due<br />

to multipath error on the GPS signals because we were working in an urban<br />

environment surrounded by two to three level buildings. Using the constant<br />

bias model for position error the image features could keep the estimate of<br />

the vehicle position stationary while the vehicle was truly not moving even<br />

17


as the GPS/INS reported that the vehicle moved up 10 cm in one example.<br />

4.2. Global Correspondences<br />

Since our system creates the camera registration sequentially <strong>from</strong> the<br />

input frames the obtained registrations are always subject to drift. Each<br />

small inaccuracy in motion estimation will propagate forward <strong>and</strong> the absolute<br />

positions <strong>and</strong> motions will be inaccurate. It is therefore necessary to<br />

do a global optimization step afterwards to remove the drift. This makes<br />

constraints necessary that are capable to remove drift. Such constraints can<br />

come <strong>from</strong> global pose measurements like GPS, but more interesting is to<br />

exploit internal consistencies like loops <strong>and</strong> intersections of the camera path.<br />

In this section we will therefore discuss solutions to the challenging task of<br />

detecting loops <strong>and</strong> intersections <strong>and</strong> using them for global optimization.<br />

Registering the camera with respect to the previously estimated path<br />

provides an estimate of the accumulated drift error. For robustness the path<br />

self-intersection itself can only rely on the views itself <strong>and</strong> not on the estimated<br />

camera motion, which can drift unbounded. Our method determines<br />

the path intersection by evaluating the similarity of salient image features in<br />

the current frame to all features in all previous views. We found that the<br />

SIFT-feature [58] provides a good salient feature for our purposes. For the<br />

fast computation of SIFT features, we make use our SiftGPU 2 , which can<br />

for example extract SIFT features at 16Hz <strong>from</strong> 1024 × 768 images on an<br />

NVidia GTX280. To enhance the robustness for the video streams we could<br />

use local geometry (described in Section 5) to employ our view invariant<br />

2 Available online: http://cs.unc.edu/∼ccwu/siftgpu/<br />

18


VIP-features [68].<br />

To avoid a computationally prohibitive exhaustive search within all other<br />

frames we use the vocabulary tree [69] to find a small set of potentially corresponding<br />

views. The vocabulary tree provides a computationally efficient<br />

indexing for the set of previous views. It provides a mapping <strong>from</strong> the feature<br />

space to an integer called visual word. The union of the visual words defines<br />

the possibly corresponding images through an inverted file.<br />

Determining the visual words for the features extracted <strong>from</strong> the query<br />

image requires traversal of the vocabulary tree for each extracted feature in<br />

the current view including a number of comparisons for the query feature with<br />

the node descriptors. Hereby the features <strong>from</strong> the query image are h<strong>and</strong>led<br />

independently, hence the tree traversal can be performed in parallel for each<br />

feature. To optimize performance we employ an in-house CUDA-based approach<br />

executed on the GPU for faster determination of the respective visual<br />

words. The speed-up induced by the GPU is about 20 on a GeForce GTX280<br />

versus a CPU implementation executed on an Intel Pentium D 3.2Ghz. This<br />

allows to perform more descriptor comparisons than in [69], i.e. a deeper<br />

tree with a smaller branching factor can be replaced by a shallower tree with<br />

a significantly higher number of branches. As pointed out in [70], a broader<br />

tree yields to a more uniform <strong>and</strong> therefore representative sampling of the<br />

high-dimensional descriptor space. To further increase the performance of the<br />

above described global correspondence search we use our recently proposed<br />

index compression method [71].<br />

Since the vocabulary tree only delivers a list of previous views potentially<br />

overlapping with the current view we perform a geometric verification of the<br />

19


delivered result. First we generate putative matches on the GPU. Afterwards<br />

the potential correspondences are tested for a valid two-view relationship<br />

between the views through ARRSAC <strong>from</strong> Section 4.1.2.<br />

4.3. Bundle Adjustment<br />

As the initial sparse model is subject to drift due to the incremental<br />

nature of the estimation process the reprojection error of the global correspondences<br />

is typically higher than the error of the local correspondences if<br />

drift occured. In order to obtain results with the highest possible accuracy,<br />

an additional refinement procedure, generally referred as bundle adjustment,<br />

is necessary. It is a non-linear optimization method, which jointly optimizes<br />

the camera calibration, the respective camera registration <strong>and</strong> the inaccuracies<br />

in the 3D points. It incorporates all available knowledge—initial camera<br />

parameters <strong>and</strong> poses, image correspondences <strong>and</strong> optionally other known<br />

applicable constraints, <strong>and</strong> minimizes a global error criterion over a large set<br />

of adjustable parameters, in particular the camera poses, the 3D structure,<br />

<strong>and</strong> optionally the intrinsic calibration parameters. In general, bundle adjustment<br />

delivers 3D models with sub-pixel accuracy for our system, e.g. an<br />

initial mean reprojection error in the order of pixels is typically reduced to<br />

0.2 or even 0.1 pixels mean error. Thus, the estimated two-view geometry<br />

is better aligned with the actual image features, <strong>and</strong> dense 3D models show<br />

a significantly higher precision. The major challenges with a bundle adjustment<br />

approach are the huge numbers of variables in the optimization problem<br />

<strong>and</strong> (to a lesser extent) the non-linearity <strong>and</strong> non-convexity of the objective<br />

function. The first issue is addressed by sparse representation of matrices<br />

<strong>and</strong> by utilizing appropriate numerical methods. Sparse techniques are still<br />

20


an active research topic in the computer vision <strong>and</strong> photogrammetry community<br />

[72, 73, 74, 75]. Our implementation of sparse bundle adjustment 3<br />

largely follows [72] by utilizing sparse Cholesky decomposition methods in<br />

combination with a suitable column reordering scheme [76]. For large data<br />

sets this approach is considerably faster than bundle adjustment implementation<br />

using dense matrix factorization after applying the Schur complement<br />

method (e.g. [77]).<br />

Our implementation requires less than 20min to perform full bundle adjustment<br />

on a 1700 image data set (which amounts to 0.7sec per image).<br />

For video sequences, the bundle adjustment can be performed incrementally,<br />

adjusting only the most recently computed poses with respect to the existing<br />

already-adjusted camera path. This allows bundle adjustment to run in realtime<br />

albeit with slightly higher error than a full offline bundle adjustment.<br />

This bundle adjustment technique is also used extensively by our system for<br />

processing photo collections as described hereafter.<br />

4.4. Camera Pose <strong>from</strong> Image Collections<br />

This section introduces the extension of the methods described in Section<br />

4.1 to accommodate <strong>Internet</strong> photo collections. The main difference to<br />

the video streams is the unknown initial relationship between the images.<br />

While for the video the temporal order implies a spatial relationship for the<br />

video frames. <strong>Photo</strong> collections downloaded <strong>from</strong> the <strong>Internet</strong> on the other<br />

h<strong>and</strong> do not have any order. In fact typically these collections are highly<br />

contaminated with outliers. Hence in contrast to video we first need to es-<br />

3 available in source at http://cs.unc.edu/∼cmzach/opensource.html<br />

21


tablish the spatial relationship between the images. In principle we can use<br />

feature based techniques similar to the method of Philbin et al. <strong>and</strong> Zheng<br />

et al. [16, 17] to detect related frames within the photo collection. These<br />

methods use computationally very intensive indexing based on local image<br />

features (keypoints) followed by loose spatial verification. Their spatial constraints<br />

are given by a 2D affine transformation or filters based on proximity<br />

of local features. Our method effectively enforces 3D structure <strong>from</strong> motion<br />

constraints (SfM constraints) for the dataset. Similarly methods like [51, 52]<br />

enforce 3D SfM constraints for the full set of registered frames. Their methods<br />

first exhaustively evaluate all possible pairs for a valid epipolar geometry<br />

<strong>and</strong> then enforce the stronger multi-view geometry constraints. Our method<br />

avoids the prohibitively expensive exhaustive pairwise matching using an initial<br />

stage in which images are grouped using global image features prior to<br />

indexing based on local features. This gives us a bigger gain in efficiency<br />

<strong>and</strong> an improved ability to group similar compositions mainly corresponding<br />

to similar viewpoints. We then enforce the SfM constraints on these groups<br />

which reduces the complexity of the computation by orders of magnitude.<br />

Snavely et al. [54] also reduced the complexity of the SfM constraints by<br />

minimizing the number of image pairs for which it is computed to a minimal<br />

set expected to obtain the same overall reconstruction.<br />

4.4.1. Efficiently Finding Corresponding Images in <strong>Photo</strong> Collections<br />

To efficiently identify related images our system uses the gist feature [78],<br />

which encodes the spatial layout of the image <strong>and</strong> perceptual properties of<br />

the image. The gist feature was found to be effective for grouping images by<br />

perceptual similarity <strong>and</strong> retrieving structurally similar scenes [79, 80]. To<br />

22


achieve high computational performance we developed a highly parallel gist<br />

feature extraction on the GPU. It derives a gist descriptor for each image<br />

as the concatenation of two independently computed sub-vectors. The first<br />

sub-vector is computed by convolving a downsampled to 128 × 128 gray<strong>scale</strong><br />

version of the image with Gabor filters at 3 different <strong>scale</strong>s (with 8, 8 <strong>and</strong><br />

4 orientations for the three <strong>scale</strong>s, respectively). The filter responses are<br />

aggregated to a 4 × 4 spatial resolution, downsampling each convolution to<br />

a 4 × 4 patch, <strong>and</strong> concatenating the results, yielding a 320-dimensional<br />

vector. In addition, we augment this gist descriptor with color information,<br />

consisting of a subsampled L*a*b image, at 4 × 4 spatial resolution. We thus<br />

obtain a 368-dimensional vector as a representation of each image in the<br />

dataset. The implementation on the GPU improves the computation time<br />

by a factor of 100 compared to a CPU implementation. For detailed timings<br />

of the gist computation please see Table 1.<br />

In the next step we use the fact that photos <strong>from</strong> nearby viewpoints with<br />

similar camera orientation have similar gist descriptors. Hence, to identify<br />

viewpoint clusters we use k-means clustering of the gist descriptors. At<br />

this point we aim for an over-segmentation since that will best reduce our<br />

computational complexity. We empirically found that searching for 10% as<br />

many clusters as images yields a sufficient over-segmentation. As shown by<br />

Table 1 the clustering of the gist descriptors can be executed very efficiently.<br />

This is key to the computational efficiency of our system since this early<br />

grouping allows us to limit all further geometric verifications avoiding an<br />

exhaustive search over the whole dataset.<br />

The clustering successfully identifies the popular viewpoints, although it<br />

23


Figure 3: A large gist cluster for the Statue of Liberty dataset. We show the hundred<br />

images closest to the cluster mean. The effectiveness of the low-dimensional gist representation<br />

can be gauged by noting that even without enforcing geometric consistency, these<br />

clusters display a remarkable degree of structural similarity.<br />

is sensitive to image variation such as clutter as for example people in front<br />

of the camera, lighting conditions, <strong>and</strong> camera zoom. We found that the<br />

large clusters are typically almost outlier free (see Figure 3 for an example).<br />

The smaller clusters experiences significantly more noise or belong to the<br />

unrelated images in the photo collection. The clusters now define a loose<br />

ordering of the dataset that defines corresponding images as being in the<br />

same cluster. The next step will employ this order to efficiently identify<br />

overlapping images as well as the outlier images on a per cluster basis using<br />

the same techniques (Section 4.1.2) as in the case of video streams.<br />

4.4.2. Iconic Image Registration<br />

While the clusters define a local relationship our method still needs to<br />

establish a global relationship between the photos. To facilitate an efficient<br />

global registration we further summarize the dataset. The objective of the<br />

next step is to find a representative image (iconic) for each cluster of views.<br />

This iconic is then later used for efficient global registration. The second<br />

objective is to enforce a valid two-view geometry with respect to the iconic<br />

24


for each image in the cluster to remove uncorrelated images.<br />

Given the typically significantly larger feature motion in the photo collections,<br />

similar to the global correspondence computation in Section 4.2 we<br />

use the SIFT features [58]. To establish correspondences for each considered<br />

pair, putative feature matching is performed on the GPU. In general, the<br />

feature matching process may be structured as a matrix multiplication between<br />

large <strong>and</strong> dense matrices. Our approach thus consists of a call to dense<br />

matrix multiplication in the CUBLAS 4 library with subsequent instructions<br />

to apply the distance ratio test [58].<br />

Next we use our ARRSAC algorithm <strong>from</strong> Section 4.1.2 to validate the<br />

two-view relationship between the views. To achieve robustness to degenerate<br />

data we combine ARRSAC with our RANSAC based robust model selection<br />

QDEGSAC [81].<br />

Our method first finds the triplet with mutual valid two-view relationships<br />

whose gist descriptors are the closest to the gist cluster center. The image<br />

with the highest number of inliers in this set is selected as the iconic image<br />

of the cluster. Then all remaining images in the cluster are tested for a valid<br />

two-view relationship to the iconic. Hence after this geometric-verification<br />

each cluster only contains images showing the same scene <strong>and</strong> each cluster is<br />

visually represented by its iconic image.<br />

Typically there is a significant number of images which are not geometrically<br />

consistent with the other images in the cluster or that are in clusters<br />

that are not geometrically consistent. Through evaluation we found that<br />

4 http://developer.download.nvidia.com/compute/cuda/1 0/CUBLAS Library 1.0.pdf<br />

25


these rejected images contain a significant number of images related to the<br />

scene of interest. To register these images we perform a reclustering using<br />

the clustering <strong>from</strong> Section 4.4.1 followed by the above geometric verification.<br />

Afterwards all unregistered images are tested against the five iconic images<br />

nearest in gist space. Using our labeling of the subsets of the datasets we<br />

verified that we typically register well over 90% of the frames belonging to<br />

the scene of interest.<br />

4.4.3. Iconic Image Registration<br />

After using the clusters to establish a local order <strong>and</strong> to remove unrelated<br />

images the next step is aiming at establishing a global relationship of<br />

the images. For efficiency reasons we use the small number of iconic images<br />

to bootstrap the global registration for all images. We first compute the exhaustive<br />

pairwise two-view relationships for all iconics using ARRSAC <strong>from</strong><br />

Section 4.1.2. This defines a graph of pairwise global relationships between<br />

the different iconics, which is called the iconic scene graph. The edges of the<br />

graph have a cost corresponding to the number of mutual matches between<br />

the images. This graph represents the global registration of the iconics for<br />

the dataset based on the 2D constraints given by the epipolar geometry or<br />

the homography between the views.<br />

Next our system performs an incremental structure <strong>from</strong> motion process<br />

using the iconic scene graph to obtain a registration for the iconic images.<br />

First an initial image pair is selected, which has sufficiently accurate registration<br />

as defined by the roundness of the uncertainty ellipsoids of the triangulated<br />

3D points [82]. Then the remaining pairs are added by selecting the<br />

image with the next strongest set of correspondences until no further iconic<br />

26


can be added to the 3D model. This gives the 3D sub-model for the component<br />

of the iconic scene graph corresponding to the initially selected pair.<br />

Each completed sub-model is refined by bundle adjustment as described in<br />

Section 4.3.<br />

Typically the iconic scene graph contains multiple independent or weakly<br />

connected components. These components for example correspond to parts<br />

of the scene that has no visual connection to the other scene parts, interior<br />

parts of the model or other scenes consistently mislabeled (an example would<br />

be Ellis Isl<strong>and</strong>, which is often also labeled as “Statue of Liberty” see Figure<br />

4). Hence we repeat the 3D modeling until all images are registered or none<br />

of the remaining images has any connections to any of the 3D sub-models.<br />

To obtain more complete 3D sub-models than possible through registering<br />

the iconics, our system afterwards searches for images in the clusters that<br />

support a matching for the iconics in both clusters. For efficiency reasons<br />

we only test for each image in the cluster if it matches to the iconic of the<br />

other cluster. If the match to the target iconic is sufficiently strong (empirically<br />

determined as more than 18 features) then the matching information<br />

of this image is stored. Once a sufficient number of connections is identified<br />

our method uses all constraints provided by the matches to merge the two<br />

sub-models to which the clusters belong. The merging is performed through<br />

estimating the similarity transformation between the sub-models. This merging<br />

process is continued until no more additional merges are possible.<br />

4.4.4. Cluster Camera Registration<br />

After the registration of the iconic images we have a valid global registration<br />

for the images of the iconic scene graph. In the last step we aim at<br />

27


Figure 4: Left: view of the 3D model <strong>and</strong> the 9025 registered cameras. Right: model <strong>from</strong><br />

the interior of Ellis Isl<strong>and</strong> erroneously labeled as “Statue of Liberty”.<br />

extending the registration to all cluster images in the clusters corresponding<br />

to the registered iconics. This process takes advantage of feature matches<br />

between the non-iconic images <strong>and</strong> their respective iconics. This process<br />

again uses the 2D matches to the iconic to determine the 2D-3D correspondences<br />

for the image. It then applies ARRSAC (Section 4.1.2 to determine<br />

the camera s registration. Given the incremental image registration during<br />

this process drift inherently accumulates. Hence we periodically apply<br />

bundle adjustment to perform a global error correction. In our current system<br />

bundle adjustment is performed after 25 images have been added to a<br />

sub-model. This saves significant computational resources while in practice<br />

keeping the accumulated error at a reasonable level. Detailed results of our<br />

3D reconstruction algorithm are shown in Figures 4-5.<br />

5. Dense Geometry<br />

The above described methods for video <strong>and</strong> for photo collections provide<br />

a registration for the camera poses of all video frames or photographs.<br />

These external <strong>and</strong> internal camera calibrations are the output of the sparse<br />

28


Figure 5: Left: 3D reconstruction of the San Marco Dataset with 10338 cameras. Right:<br />

Reconstruction <strong>from</strong> the Notre Dame dataset with 1300 registered cameras.<br />

Dataset<br />

Gist<br />

Gist clustering<br />

Geometric<br />

Re-clustering<br />

registered views<br />

total time<br />

Liberty 4 min 21 min 56 min 3.3 hr 13888 4 hr 42 min<br />

SM 4 min 19 min 24 min 2.76 hr 12253 3 hr 34 min<br />

ND 1 min 3 min 22 min 1.0 hr 3058 1 hr 28 min<br />

Table 1: Computation times for the photo collection reconstruction for the Statue of<br />

Liberty dataset, the San Marco dataset (SM) <strong>and</strong> the Notre Dame dataset (ND).<br />

reconstruction step of both systems. They allow for the photo collections<br />

to generate an image based browsing experience as shown by the “<strong>Photo</strong>-<br />

Tourism” paper [51, 52]. The camera registration can also be used as input<br />

to dense stereo methods. The method of [83] can extract dense scene geometry<br />

but with significant computational effort <strong>and</strong> therefore only for small<br />

scenes. Next we will introduce our real-time stereo system for video streams<br />

capable of reconstructing large scenes.<br />

We adopt a stereo/fusion approach for dense geometry reconstruction.<br />

29


For every frame of video, a depthmap is generated using a real-time GPUbased<br />

multi-view planesweep stereo algorithm. The planesweep algorithm,<br />

comprised primarily of image warping operations, is highly efficient on the<br />

GPU [62]. Because it is multi-view, it is quite robust <strong>and</strong> can efficiently<br />

h<strong>and</strong>le occlusions, a major problem in stereo. The stereo depthmaps are<br />

then fused using a visibility-based depthmap fusion method, which also runs<br />

on the GPU. The fusion method removes outliers <strong>from</strong> the depthmaps, <strong>and</strong><br />

also combines depth estimates to enhance their accuracy. Because of the redundancy<br />

in video (each surface is imaged multiple times), fused depthmaps<br />

only need to be produced for a subset of the video frames. This reduces<br />

processing time as well as the 3D model size as discussed in Section 6.<br />

The planesweep stereo algorithm [36] computes a depthmap by testing<br />

a family of plane hypotheses, <strong>and</strong> for each depthmap pixel recording the<br />

distance to the plane with the best photoconsistency score. One view is<br />

designated as reference, <strong>and</strong> a depthmap is computed for that view. The<br />

remaining views are designated matching views. For each plane, all matching<br />

views are back-projected onto the plane, <strong>and</strong> then projected into the<br />

reference view. This mapping, known as a plane homography [84], can be<br />

performed very efficiently on the GPU. Once the matching views are warped,<br />

the photoconsistency score is computed. We use a sum of absolute differences<br />

(SAD) photoconsistency measure. To h<strong>and</strong>le occlusions, the matching views<br />

are divided into a left <strong>and</strong> right subset, <strong>and</strong> the best matching score of the two<br />

subsets is kept [85]. Before the difference is computed, the image intensities<br />

are multiplied by the relative exposure (gain) computed during KLT tracking<br />

(Section 4.1.1). The stereo algorithm can produce a 512 × 384 depthmap<br />

30


<strong>from</strong> 11 images (10 matching, 1 reference) <strong>and</strong> 48 plane hypotheses at a rate<br />

of 42 Hz on an Nvidia GTX 280.<br />

After depthmaps have been computed, a fused depthmap is computed<br />

for a reference view using the surrounding stereo depthmaps [86]. All points<br />

<strong>from</strong> the depthmaps are projected into the reference view, <strong>and</strong> for each pixel<br />

the point with the highest stereo confidence is chosen. Stereo confidence is<br />

computed based on the matching scores <strong>from</strong> the planesweep:<br />

C(x) = ∑ e −(SAD(x,π)−SAD(x,π 0)) 2 (1)<br />

π≠π 0<br />

where x is the pixel in question, π is a plane <strong>from</strong> the planesweep, <strong>and</strong> π 0<br />

is the best plane chosen by the planesweep. This confidence measure prefers<br />

depth estimates with low uncertainty, where the matching score is much lower<br />

than scores for all other planes. For each pixel, after the most confident point<br />

is chosen, it is scored according to mutual support, free-space violations, <strong>and</strong><br />

occlusions. Supporting points are those that fall within 5% of the chosen<br />

point’s depth value. Occlusions are caused by points that would make the<br />

chosen point invisible in the reference view, <strong>and</strong> free-space violations are<br />

points that are made invisible in some other view by the chosen point. If<br />

the number of occlusions <strong>and</strong> free-space violations exceeds the number of<br />

support points, the chosen point is discarded <strong>and</strong> the depth value is marked<br />

as invalid. If the point survives, it is replaced with a new point computed as<br />

the average of the chosen point <strong>and</strong> support points. See Figure 6.<br />

Currently our dense geometry method is used only for video. The temporal<br />

sequence of the video makes it easy to select nearby views for stereo<br />

matching <strong>and</strong> depthmap fusion. For photo collections, a view selection strat-<br />

31


Figure 6: Our depthmap fusion method combines multiple stereo depthmaps (middle<br />

image shows an example depthmap) to produce a single fused depthmap (shown on the<br />

right) with greater accuracy <strong>and</strong> less redundancy.<br />

egy would be needed which identifies views with similar content <strong>and</strong> compatible<br />

appearance (day, night, summer, winter, etc.). This is left as future<br />

work.<br />

6. Model Extraction<br />

Once a fused depthmap has been computed, the next step is to convert<br />

it into a texture-mapped 3D polygonal mesh. This can be done trivially by<br />

olverlaying the depthmap image with a quadrilateral mesh, <strong>and</strong> assigning<br />

the 3D coordinate of each vertex to the 3D position given by the depthmap.<br />

The mesh can also be trivially texture-mapped with the color image for the<br />

corresponding depthmap. However, a number of improvements can be made.<br />

First, depth discontinuities must be detected to prevent the mesh <strong>from</strong><br />

joining discontinuous foreground <strong>and</strong> background objects. We simply use a<br />

threshold on the magnitude of the depth gradient to determine if a mesh edge<br />

can be created between two vertices. Afterwards, connected components with<br />

a small number of vertices are discarded.<br />

Second, surfaces are discarded which have already been reconstructed<br />

32


Original Mesh Simplified Mesh Textured Mesh<br />

Figure 7: Models are extracted <strong>from</strong> the fused depthmaps.<br />

in previous fused depthmaps. The surface reconstructed <strong>from</strong> the previous<br />

fused depthmap in the video sequence is projected into the current depthmap.<br />

Any vertices that fall within threshold of the previous surface are marked as<br />

invalid. Their neighboring vertices are then adjusted to coincide with the<br />

edge of the previous surface to prevent any cracks.<br />

Third, the mesh is simplified by hierarchically merging neighboring quadrilaterals<br />

that are nearly co-planar. Actually we use a top-down approach. We<br />

start with a single quadrilateral spanning the whole depthmap. If the distance<br />

to the quadrilaters of any of its contained points is greater than a<br />

threshold, the quadrilateral is split. This is repeated recursively, until every<br />

quadrilateral faithfully represents its underlying points. The splitting is<br />

performed along the longest direction or at r<strong>and</strong>om if the quadrilateral is<br />

square. This method represents planar <strong>and</strong> quasi-planar surfaces with fewer<br />

polygons <strong>and</strong> greatly reduces the size of the output models. See Figure 8.<br />

33


Figure 8: Several 3D models reconstructed <strong>from</strong> video sequences.<br />

34


7. Conclusions<br />

In this paper we presented methods for the real-time reconstruction <strong>from</strong><br />

video <strong>and</strong> the fast reconstruction <strong>from</strong> <strong>Internet</strong> photo collections on a single<br />

PC. We demonstrated the performance of the methods on a variety of<br />

large-<strong>scale</strong> datasets proving the computational performance of the proposed<br />

methods. We achieved this through the consequent parallelization of all computations<br />

involved enabling an execution on the graphics card as a highly<br />

parallel processor. Additionally, for the modeling <strong>from</strong> <strong>Internet</strong> photo collections<br />

we also combine constraints <strong>from</strong> recognition <strong>and</strong> geometric constraints<br />

leading to an orders of magnitude more efficient image registration than any<br />

existing system.<br />

Acknowledgements:. We would like to acknowledge the DARPA UrbanScape<br />

project, NSF Grant IIS-0916829 <strong>and</strong> other funding <strong>from</strong> the US government,<br />

as well as our collaborators David Nister, Amir Akbarzadeh, Philippos Mordohai,<br />

Paul Merrell, Chris Engels, Henrik Stewenius, Brad Talton, Liang<br />

Wang, Qingxiong Yang, Ruigang Yang, Greg Welch, Herman Towles, Xiaowei<br />

Li.<br />

[1] A. Fischer, T. Kolbe, F. Lang, A. Cremers, W. Förstner, L. Plümer,<br />

V. Steinhage, Extracting buildings <strong>from</strong> aerial images using hierarchical<br />

aggregation in 2d <strong>and</strong> 3d, Computer Vision <strong>and</strong> Image Underst<strong>and</strong>ing<br />

72 (2) (1998) 185–203.<br />

[2] A. Gruen, X. Wang, Cc-modeler: A topology generator for 3-d city<br />

models, ISPRS Journal of <strong>Photo</strong>grammetry & Remote Sensing 53 (5)<br />

(1998) 286–295.<br />

35


[3] Z. Zhu, A. Hanson, E. Riseman, Generalized parallel-perspective stereo<br />

mosaics <strong>from</strong> airborne video, IEEE Trans. on Pattern Analysis <strong>and</strong> Machine<br />

Intelligence 26 (2) (2004) 226–237.<br />

[4] C. Früh, A. Zakhor, An automated method for large-<strong>scale</strong>, ground-based<br />

city model acquisition, Int. J. of Computer Vision 60 (1) (2004) 5–24.<br />

[5] I. Stamos, P. Allen, Geometry <strong>and</strong> texture recovery of scenes of large<br />

<strong>scale</strong>, Computer Vision <strong>and</strong> Image Underst<strong>and</strong>ing 88 (2) (2002) 94–118.<br />

[6] S. El-Hakim, J.-A. Beraldin, M. Picard, A. Vettore, Effective 3d modeling<br />

of heritage sites, in: 4th International Conference of 3D Imaging<br />

<strong>and</strong> Modeling, 2003, pp. 302–309.<br />

[7] O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint,<br />

MIT Press, 1993.<br />

[8] R. I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision,<br />

2nd Edition, Cambridge University Press, 2004.<br />

[9] B. Lucas, T. Kanade, An Iterative Image Registration Technique with<br />

an Application to Stereo Vision, in: Int. Joint Conf. on Artificial Intelligence,<br />

1981, pp. 674–679.<br />

[10] C. Zach, D. Gallup, J. Frahm, <strong>Fast</strong> gain-adaptive KLT tracking on<br />

the GPU, in: CVPR Workshop on Visual Computer Vision on GPU’s<br />

(CVGPU), 2008.<br />

[11] R. Fergus, P. Perona, A. Zisserman, A visual category filter for Google<br />

images, in: ECCV, 2004.<br />

36


[12] T. Berg, D. Forsyth, Animals on the web, in: CVPR, 2006.<br />

[13] L.-J. Li, G. Wang, L. Fei-Fei, Optimol: automatic object picture collection<br />

via incremental model learning, in: CVPR, 2007.<br />

[14] F. Schroff, A. Criminisi, A. Zisserman, Harvesting image databases <strong>from</strong><br />

the web, in: ICCV, 2007.<br />

[15] B. Collins, J. Deng, L. Kai, L. Fei-Fei, Towards scalable dataset construction:<br />

An active learning approach, in: ECCV, 2008.<br />

[16] J. Philbin, A. Zisserman, Object mining using a matching graph on<br />

very large image collections, in: Proceedings of the Indian Conference<br />

on Computer Vision, Graphics <strong>and</strong> Image Processing, 2008.<br />

[17] Y.-T. Zheng, M. Zhao, Y. Song, H. Adam, U. Buddemeier, A. Bissacco,<br />

F. Brucher, T.-S. Chua, H. Neven, Tour the world: Building a web-<strong>scale</strong><br />

l<strong>and</strong>mark recognition engine, in: CVPR, 2009.<br />

[18] P. Beardsley, A. Zisserman, D. Murray, Sequential updating of projective<br />

<strong>and</strong> affine structure <strong>from</strong> motion, Int. J. Computer Vision 23 (3) (1997)<br />

235–259.<br />

[19] A. Fitzgibbon, A. Zisserman, Automatic camera recovery for closed or<br />

open image sequences, in: European Conf. on Computer Vision, 1998,<br />

pp. 311–326.<br />

[20] D. Nistér, An efficient solution to the five-point relative pose problem,<br />

IEEE Trans. on Pattern Analysis <strong>and</strong> Machine Intelligence 26 (6) (2004)<br />

756–777.<br />

37


[21] D. Nistér, O. Naroditsky, J. Bergen, Visual odometry for ground vehicle<br />

applications, Journal of Field Robotics 23 (1).<br />

[22] American Society of <strong>Photo</strong>grammetry, Manual of <strong>Photo</strong>grammetry (5th<br />

edition), Asprs Pubns, 2004.<br />

[23] A. Azarbayejani, A. Pentl<strong>and</strong>, Recursive estimation of motion,<br />

structure, <strong>and</strong> focal length, IEEE Trans. on Pattern<br />

Analysis <strong>and</strong> Machine Intelligence 17 (6) (1995) 562–575.<br />

doi:http://doi.ieeecomputersociety.org/10.1109/34.387503.<br />

[24] S. Soatto, P. Perona, R. Frezza, G. Picci, Recursive motion <strong>and</strong> structure<br />

estimation with complete error characterization, in: Int. Conf. on<br />

Computer Vision <strong>and</strong> Pattern Recognition, 1993, pp. 428–433.<br />

[25] S. Palmer, E. Rosch, P. Chase, Canonical perspective <strong>and</strong> the perception<br />

of objects, Attention <strong>and</strong> Performance IX (1981) 135–151.<br />

[26] V. Blanz, M. Tarr, H. Bulthoff, What object attributes determine canonical<br />

views?, Perception 28 (5) (1999) 575–600.<br />

[27] T. Denton, M. Demirci, J. Abrahamson, A. Shokouf<strong>and</strong>eh, S. Dickinson,<br />

Selecting canonical views for view-based 3-d object recognition, in:<br />

ICPR, 2004, pp. 273–276.<br />

[28] P. Hall, M. Owen, Simple canonical views, in: BMVC, 2005, pp. 839–<br />

848.<br />

[29] T. L. Berg, D. Forsyth, Automatic ranking of iconic images, Tech. rep.,<br />

University of California, Berkeley (2007).<br />

38


[30] Y. Jing, S. Baluja, H. Rowley, Canonical image selection <strong>from</strong> the web,<br />

in: Proceedings of the 6th ACM international Conference on Image <strong>and</strong><br />

<strong>Video</strong> Retrieval, 2007.<br />

[31] I. Simon, N. Snavely, S. M. Seitz, Scene summarization for online image<br />

collections, in: ICCV, 2007.<br />

[32] T. L. Berg, A. C. Berg, Finding iconic images, in: The 2nd <strong>Internet</strong><br />

Vision Workshop at IEEE Conference on Computer Vision <strong>and</strong> Pattern<br />

Recognition, 2009.<br />

[33] R. Raguram, S. Lazebnik, Computing iconic summaries of general visual<br />

concepts, in: Workshop on <strong>Internet</strong> Vision CVPR, 2008.<br />

[34] D. Scharstein, R. Szeliski, A taxonomy <strong>and</strong> evaluation of dense twoframe<br />

stereo correspondence algorithms, Int. J. of Computer Vision<br />

47 (1-3) (2002) 7–42.<br />

[35] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, R. Szeliski, A comparison<br />

<strong>and</strong> evaluation of multi-view stereo reconstruction algorithms,<br />

in: Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition, 2006, pp.<br />

519–528.<br />

[36] R. Collins, A space-sweep approach to true multi-image matching, in:<br />

Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition, 1996, pp. 358–<br />

363.<br />

[37] R. Yang, M. Pollefeys, Multi-resolution real-time stereo on commodity<br />

graphics hardware, in: Int. Conf. on Computer Vision <strong>and</strong> Pattern<br />

Recognition, 2003, pp. 211–217.<br />

39


[38] T. Werner, A. Zisserman, New techniques for automated architectural<br />

reconstruction <strong>from</strong> photographs, in: European Conf. on Computer Vision,<br />

2002, pp. 541–555.<br />

[39] P. Burt, L. Wixson, G. Salgian, Electronically directed ”focal” stereo,<br />

in: Int. Conf. on Computer Vision, 1995, pp. 94–101.<br />

[40] S. N. Sinha, D. Steedly, R. Szeliski, Piecewise planar stereo for imagebased<br />

rendering, in: International Conference on Computer Vision<br />

(ICCV), 2009.<br />

[41] Y. Furukawa, B. Curless, S. M. Seitz, , R. Szeliski, Manhattan-world<br />

stereo, in: Computer Vision <strong>and</strong> Pattern Recognition (CVPR), 2009.<br />

[42] G. Turk, M. Levoy, Zippered polygon meshes <strong>from</strong> range images., in:<br />

SIGGRAPH, 1994, pp. 311–318.<br />

[43] M. Soucy, D. Laurendeau, A general surface approach to the integration<br />

of a set of range views, IEEE Trans. on Pattern Analysis <strong>and</strong> Machine<br />

Intelligence 17 (4) (1995) 344–358.<br />

[44] A. Hilton, A. Stoddart, J. Illingworth, T. Windeatt, Reliable surface<br />

reconstruction <strong>from</strong> multiple range images, in: European Conf. on Computer<br />

Vision, 1996, pp. 117–126.<br />

[45] B. Curless, M. Levoy, A volumetric method for building complex models<br />

<strong>from</strong> range images, SIGGRAPH 30 (1996) 303–312.<br />

[46] M. Wheeler, Y. Sato, K. Ikeuchi, Consensus surfaces for modeling 3d<br />

40


objects <strong>from</strong> multiple range images, in: Int. Conf. on Computer Vision,<br />

1998, pp. 917–924.<br />

[47] R. Koch, M. Pollefeys, L. Van Gool, <strong>Robust</strong> calibration <strong>and</strong> 3d geometric<br />

modeling <strong>from</strong> large collections of uncalibrated images, in: DAGM, 1999,<br />

pp. 413–420.<br />

[48] G. Schindler, P. Krishnamurthy, F. Dellaert, Line-based structure <strong>from</strong><br />

motion for urban environments, in: 3DPVT, 2006.<br />

[49] G. Schindler, F. Dellaert, S. Kang, Inferring temporal order of images<br />

<strong>from</strong> 3d structure, in: Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition,<br />

2007.<br />

[50] N. Cornelis, K. Cornelis, L. Van Gool, <strong>Fast</strong> compact city modeling for<br />

navigation pre-visualization, in: Int. Conf. on Computer Vision <strong>and</strong><br />

Pattern Recognition, 2006.<br />

[51] N. Snavely, S. M. Seitz, R. Szeliski, <strong>Photo</strong> tourism: Exploring photo<br />

collections in 3d, in: SIGGRAPH, 2006, pp. 835–846.<br />

[52] N. Snavely, S. M. Seitz, R. Szeliski, Modeling the world <strong>from</strong> <strong>Internet</strong><br />

photo collections, International Journal of Computer Vision 80 (2)<br />

(2008) 189–210.<br />

[53] K. Ni, D. Steedly, F. Dellaert, Out-of-core bundle adjustment for large<strong>scale</strong><br />

3d reconstruction, in: ICCV, 2007.<br />

[54] N. Snavely, S. M. Seitz, R. Szeliski, Skeletal sets for efficient structure<br />

<strong>from</strong> motion, in: CVPR, 2008.<br />

41


[55] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, R. Szeliski, Building Rome<br />

in a day, in: ICCV, 2009.<br />

[56] S. Arya, D. Mount, N. Netanyahu, R. Silverman, A. Wu, An optimal<br />

algorithm for approximate nearest neighbor searching fixed dimensions,<br />

in: JACM, Vol. 45, 1998, pp. 891–923.<br />

[57] O. Chum, J. Philbin, J. Sivic, M. Isard, A. Zisserman, Total recall:<br />

Automatic query expansion with a generative feature model for object<br />

retrieval, in: ICCV, 2007.<br />

[58] D. Lowe, Distinctive image features <strong>from</strong> <strong>scale</strong>-invariant keypoints, Int.<br />

J. of Computer Vision 60 (2) (2004) 91–110.<br />

[59] R. Raguram, J.-M. Frahm, M. Pollefeys, A comparative analysis of<br />

RANSAC techniques leading to adaptive real-time r<strong>and</strong>om sample consensus,<br />

in: ECCV, 2008.<br />

[60] B. Lucas, T. Kanade, An iterative image registration technique with<br />

an application to stereo vision, in: International Joint Conference on<br />

Artificial Intelligence (IJCAI), 1981, pp. 674–679.<br />

URL citeseer.ist.psu.edu/lucas81iterative.html<br />

[61] J. Shi, C. Tomasi, Good features to track, in: IEEE Conference on<br />

Computer Vision <strong>and</strong> Pattern Recognition (CVPR), 1994.<br />

[62] S. Kim, D. Gallup, J.-M. Frahm, A. Akbarzadeh, Q. Yang, R. Yang,<br />

D. Nistér, M. Pollefeys, Gain adaptive real-time stereo streaming, in:<br />

Int. Conf. on Vision Systems, 2007.<br />

42


[63] M. Fischler, R. Bolles, R<strong>and</strong>om sample consensus: A paradigm for model<br />

fitting with applications to image analysis <strong>and</strong> automated cartography,<br />

Comm. of the ACM 24 (6) (1981) 381–395.<br />

[64] R. Raguram, J.-M. Frahm, M. Pollefeys, A comparative analysis of<br />

ransac techniques leading to adaptive real-time r<strong>and</strong>om sample consensus,<br />

in: Proc. ECCV, 2008.<br />

[65] D. Nister, Preemptive RANSAC for live structure <strong>and</strong> motion estimation,<br />

in: Proc. ICCV, Vol. 1, 2003, pp. 199–206.<br />

[66] J. Matas, O. Chum, R<strong>and</strong>omized ransac with sequential probability ratio<br />

test, in: Proc. ICCV, 2005, pp. 1727–1732.<br />

[67] M. Pollefeys, D. Nister, J.-M. Frahm, A. Akbarzadeh, P. Mordohai,<br />

B. Clipp, C. Engels, D. Gallup, S.-J. Kim, P. Merrell, C. Salmi, S. Sinha,<br />

B. Talton, L. Wang, Q. Yang, H. Stewenius, R. Yang, G. Welch,<br />

H. Towles, Detailed real-time urban 3d reconstruction <strong>from</strong> video, Int.<br />

Journal of Computer Vision special issue on Modeling <strong>Large</strong>-Scale 3D<br />

Scenes.<br />

[68] C. Wu, F. Fraundorfer, J.-M. Frahm, M. Pollefeys, 3d model<br />

search <strong>and</strong> pose estimation <strong>from</strong> single images using vip features,<br />

Computer Vision <strong>and</strong> Pattern Recognition Workshop 0 (2008) 1–8.<br />

doi:http://doi.ieeecomputersociety.org/10.1109/CVPRW.2008.4563037.<br />

[69] D. Nister, H. Stewenius, Scalable recognition with a vocabulary<br />

tree, in: In Proceedings of IEEE CVPR, Vol. 2, IEEE Com-<br />

43


puter Society, Los Alamitos, CA, USA, 2006, pp. 2161–2168.<br />

doi:http://doi.ieeecomputersociety.org/10.1109/CVPR.2006.264.<br />

[70] G. Schindler, M. Brown, R. Szeliski, City-<strong>scale</strong> location recognition, in:<br />

In Proceedings of IEEE CVPR, 2007.<br />

[71] A. Irschara, C. Zach, J.-M. Frahm, H. Bischof, From structure-<strong>from</strong>motion<br />

point clouds to fast location recognition, in: In Proceedings of<br />

IEEE CVPR, 2009, pp. 2599–2606.<br />

[72] F. Dellaert, M. Kaess, Square root SAM: Simultaneous location <strong>and</strong><br />

mapping via square root information smoothing, International Journal<br />

of Robotics Research.<br />

[73] C. Engels, H. Stewénius, D. Nistér, Bundle adjustment rules, in: <strong>Photo</strong>grammetric<br />

Computer Vision (PCV), 2006.<br />

[74] K. Ni, D. Steedly, F. Dellaert, Out-of-core bundle adjustment for large<strong>scale</strong><br />

3D reconstruction, in: IEEE International Conference on Computer<br />

Vision (ICCV), 2007.<br />

[75] M. Kaess, A. Ranganathan, F. Dellaert, iSAM: Incremental smoothing<br />

<strong>and</strong> mapping, IEEE Transactions on Robotics.<br />

[76] T. A. Davis, J. R. Gilbert, S. Larimore, E. Ng, A column approximate<br />

minimum degree ordering algorithm, ACM Transactions on Mathematical<br />

Software 30 (3) (2004) 353–376.<br />

[77] M. Lourakis, A. Argyros, The design <strong>and</strong> implementation of a generic<br />

sparse bundle adjustment software package based on the Levenberg-<br />

44


Marquardt algorithm, Tech. Rep. 340, Institute of Computer Science -<br />

FORTH (2004).<br />

[78] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation<br />

of the spatial envelope, IJCV 42 (3) (2001) 145–175.<br />

[79] J. Hays, A. A. Efros, Scene completion using millions of photographs,<br />

in: SIGGRAPH, 2007.<br />

[80] M. Douze, H. Jégou, H. S<strong>and</strong>hawalia, L. Amsaleg, C. Schmid, Evaluation<br />

of gist descriptors for web-<strong>scale</strong> image search, in: International<br />

Conference on Image <strong>and</strong> <strong>Video</strong> Retrieval, 2009.<br />

[81] J.-M. Frahm, M. Pollefeys, RANSAC for (quasi-) degenerate data<br />

(QDEGSAC), in: CVPR, Vol. 1, 2006, pp. 453–460.<br />

[82] C. Beder, R. Steffen, Determining an initial image pair for fixing the<br />

<strong>scale</strong> of a 3d reconstruction <strong>from</strong> an image sequence, in: Proc. DAGM,<br />

2006, pp. 657–666.<br />

[83] M. Goesele, N. Snavely, B. Curless, H. Hoppe, S. M. Seitz, Multi-view<br />

stereo for community photo collections, in: ICCV, 2007.<br />

[84] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision,<br />

Cambridge University Press, 2000.<br />

[85] S. Kang, R. Szeliski, J. Chai, H<strong>and</strong>ling occlusions in dense multi-view<br />

stereo, in: Int. Conf. on Computer Vision <strong>and</strong> Pattern Recognition,<br />

2001, pp. 103–110.<br />

45


[86] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J.-M. Frahm,<br />

R. Yang, D. Nister, M. Pollefeys, Real-Time Visibility-Based Fusion<br />

of Depth Maps, in: Proceedings of International Conf. on Computer<br />

Vision, 2007.<br />

46

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!