Summary Page - Department of Computer Science

Summary Page - Department of Computer Science Summary Page - Department of Computer Science

29.01.2014 Views

Summary Page Do not exceed one page. Failure to include the summary page will result in the automatic rejection of the paper. Answer each of the following questions in no more than 2-3 sentences: 1. Is this a system paper or a regular paper? Regular paper 2. If it is a system paper, please explain the contribution. 3. If it is a regular paper: (a) What is the main contribution in terms of theory, algorithms and approach. The paper addresses the alignment of 3D scene observed from different cameras with large viewpoint differences. It has two main contributions. First we introduce a novel viewpoint independent feature based on texture and local geometry. The description contains the 3D position of the feature, the local surface normal, the scale of the feature, the 3D orientation of the feature, and the SIFT descriptor of the normalized feature texture. Secondly we propose a novel efficient matching algorithm employing the unique properties of our viewpoint invariant features. The rich set of attributes of the novel features enables us to compute the 3D scene alignment from a single match. (b) Describe the types of experiments and the novelty of the results. If applicable, provide comparison to the state of the art in this area. We demonstrate the performance of our novel features and the efficient matching on multiple real world examples and compare it with state of the art image based matching based on the popular SIFT features. Our algorithm clearly outperforms the traditional image based techniques for large viewpoint changes of the camera. In this situation our novel viewpoint invariant feature matching still achieves reliable matching as shown in our evaluation. Due to the rich feature description we are able to align even surfaces with small overlap. Figure 1. 3D model alignment of the models from two cameras. The red lines show the inlier matches. Please note the second model is translated for visualization purposes.

<strong>Summary</strong> <strong>Page</strong><br />

Do not exceed one page. Failure to include the summary page will result in the automatic rejection <strong>of</strong> the paper.<br />

Answer each <strong>of</strong> the following questions in no more than 2-3 sentences:<br />

1. Is this a system paper or a regular paper?<br />

Regular paper<br />

2. If it is a system paper, please explain the contribution.<br />

3. If it is a regular paper:<br />

(a) What is the main contribution in terms <strong>of</strong> theory, algorithms and approach.<br />

The paper addresses the alignment <strong>of</strong> 3D scene observed from different cameras with large viewpoint differences.<br />

It has two main contributions. First we introduce a novel viewpoint independent feature based on texture and local<br />

geometry. The description contains the 3D position <strong>of</strong> the feature, the local surface normal, the scale <strong>of</strong> the feature,<br />

the 3D orientation <strong>of</strong> the feature, and the SIFT descriptor <strong>of</strong> the normalized feature texture. Secondly we propose<br />

a novel efficient matching algorithm employing the unique properties <strong>of</strong> our viewpoint invariant features. The rich<br />

set <strong>of</strong> attributes <strong>of</strong> the novel features enables us to compute the 3D scene alignment from a single match.<br />

(b) Describe the types <strong>of</strong> experiments and the novelty <strong>of</strong> the results. If applicable, provide comparison to the state <strong>of</strong><br />

the art in this area.<br />

We demonstrate the performance <strong>of</strong> our novel features and the efficient matching on multiple real world examples<br />

and compare it with state <strong>of</strong> the art image based matching based on the popular SIFT features. Our algorithm<br />

clearly outperforms the traditional image based techniques for large viewpoint changes <strong>of</strong> the camera. In this<br />

situation our novel viewpoint invariant feature matching still achieves reliable matching as shown in our evaluation.<br />

Due to the rich feature description we are able to align even surfaces with small overlap.<br />

Figure 1. 3D model alignment <strong>of</strong> the models from two cameras. The red lines show the inlier matches. Please note the second model is<br />

translated for visualization purposes.


3D Model Matching with Viewpoint-Invariant Patches (VIP)<br />

sumbitted to ICCV 07<br />

Changchang Wu, Xiaowei Li, Jan-Michael Frahm and Marc Pollefeys<br />

<strong>Department</strong> <strong>of</strong> <strong>Computer</strong> <strong>Science</strong>,<br />

University <strong>of</strong> North Carolina at Chapel Hill, USA<br />

{ccwu,xwli,jmf,marc}@cs.unc.edu<br />

Abstract<br />

The paper introduces a novel class <strong>of</strong> viewpoint independent<br />

local features and novel algorithms to use them for<br />

3D scene alignment. The advantages <strong>of</strong> the novel viewpoint<br />

invariant patches (VIP) are: a single VIP correspondence<br />

uniquely defines the 3D similarity transformation between<br />

the two VIP features, 2) the novel features are invariant<br />

to 3D camera motion. In the paper we use these properties<br />

to introduce an efficient matching scheme for the 3D<br />

scene alignment. The algorithm is based on a hierarchical<br />

RANSAC scheme which test the components <strong>of</strong> the similarity<br />

transformation sequentially to allow matching and 3D<br />

scene alignment. We will evaluate the novel features on real<br />

data with known ground truth information.<br />

1. Introduction<br />

In recent years, there have been significant research efforts<br />

for fast and large scale 3D scene reconstruction from<br />

video. Recent systems show real time performance [1].<br />

Large scale reconstruction from video only is vulnerable to<br />

accumulated errors even with a bundle adjustment. To avoid<br />

accumulated errors, the reconstruction inherently needs to<br />

recognize previously reconstructed scene parts and determine<br />

the similarity transformation between the current reconstruction<br />

and previous reconstructions. This similarity<br />

transformation is equivalent to the accumulated drift. Traditionally<br />

image-based matching is used to provide the loop<br />

closing constraints to bundle adjustment. Utilizing textures<br />

in 3D model matching is a hard problem due to the irregularity<br />

<strong>of</strong> 3D structure. Typically these image based methods<br />

suffer under large viewpoint changes that <strong>of</strong>ten occur. In<br />

urban modeling for example, the path <strong>of</strong>ten crosses at an intersection<br />

which typically means that the viewing direction<br />

changes by about 90 ◦ .<br />

The paper introduces novel viewpoint invariant patches<br />

(VIP) that provide the necessary properties to determine the<br />

similarity transformation between two 3D scenes. The VIPs<br />

are extracted from images based on their known geometry,<br />

and the detection works in euclidian space to achieve<br />

projective invariance. Given an interesting 3D point, normalized<br />

image patches is first generated by viewpoint normalization<br />

that generates a local snapshot from a relatively<br />

fixed ortho viewpoint. This local snapshot actually can be<br />

viewed as an ortho-texture 1 <strong>of</strong> the 3D model. DOG extrema<br />

suppression is then performed in this patch to check existence<br />

<strong>of</strong> VIP keypoint. The normalized image patches <strong>of</strong><br />

VIP keypoints are then coded by SIFT descriptor[5].<br />

3D models are then transformed to a set <strong>of</strong> VIPs, each<br />

<strong>of</strong> which has a 3D position, patch scale, surface normal, local<br />

gradient orientation, and a SIFT descriptor. The rich<br />

information <strong>of</strong> VIP correspondences makes them convenient<br />

for 3D similarity transformation estimation. One<br />

VIP correspondence is sufficient to compute a full similarity<br />

transformation by comparing their 3d positions, normals,orientations<br />

and scales. The scale and rotation components<br />

<strong>of</strong> the VIP correspondence is consistent with relative<br />

scale and rotation between the two 3D-models. Moreover,<br />

they are independent and can be be tested separately and<br />

efficiently. These advantages lead to an efficient hierarchical<br />

exhaustive hypothesis test (EHT) scheme, which delivers<br />

a transformation, by which 3D textured models can be<br />

stitched automatically.<br />

The remainder <strong>of</strong> the paper is organized as follows: The<br />

related work is discussed in Section 2. Afterwards Section<br />

3 introduces the viewpoint-invariant patch and discusses its<br />

properties. An efficient VIP detector for urban scenes is<br />

discussed in Section 4. In Section 5 our novel hierarchical<br />

EHT is introduced. The novel algorithms are evaluated in<br />

Section 6.<br />

1 Ortho-texture: Representation <strong>of</strong> the texture that is projected on the<br />

surface with orthogonal projection.<br />

2


2. Related Work<br />

Many image features have been developed to achieve invariance<br />

to similarity transformation or affine transformation<br />

for wide-baseline matching. Lowe’s SIFT keypoints[5]<br />

is one <strong>of</strong> the most popular. SIFT detector get feature scale<br />

in scale space and feature orientation from edge map, and it<br />

generate normalized image patches to achieve similarity invariance.<br />

SIFT descriptors is also a strong tool to represent<br />

the normalized image patch, it is used by many feature detectors<br />

including affine covariant features, and we also use<br />

SIFT to represent our VIP. Affine covariant feature then go<br />

beyond to achieve invariance to affine transformation, and<br />

[6] gives a good comparison <strong>of</strong> several such features. In this<br />

paper, we will go even beyond to detect projective invariant<br />

features in images based on 3D structures. It is intuitive<br />

to be projective invariant if detector operates in euclidian<br />

space.<br />

With the advances in structure from motion techniques<br />

and active sensors in the last decade, the interest in the<br />

alignment <strong>of</strong> 3D models has grown. Fitzgibbon and Zisserman<br />

proposed a hierarchical structure from motion (sfm)<br />

scheme [2] to align local 3D scene models from connective<br />

triplets. The technique exploited 3D correspondences<br />

from common 2D tracks in consecutive triplets to compute<br />

similarity transformation for alignment. The proposed<br />

technique works well for small viewpoint changes between<br />

triplets which are typically observed in video. The approach<br />

by Snavely et al. in [7] uses the well known SIFT features<br />

to automatically extract wide baseline salient feature from<br />

photo collections. Then a robust matching and a succeeding<br />

bundle adjustment are used to determine the camera positions.<br />

Afterwards the cameras and the images are used to<br />

provide an image alignment for the scene. These methods<br />

are based on texture only.<br />

If a rough alignment transformation is known, ICP based<br />

methods are widely used to compute the alignment by iteratively<br />

minimizing the sum <strong>of</strong> distance between closest<br />

points. For a successful alignment, the local reconstructions<br />

need to be as accurate as possible which are <strong>of</strong>ten not<br />

the case. This limits the use <strong>of</strong> ICP based techniques for<br />

refining a rough alignment. Obviously, ICP and its variants<br />

consider only 3D point positions.<br />

There has also been some work about aligning 3D models<br />

with the extracted geometric entities called spin images<br />

[3]. Stamos and Leordeanu used mainly planar regions<br />

and 3D lines on them to do 3D scene alignment [8]. The approach<br />

uses a pair <strong>of</strong> matched infinite lines on the two local<br />

3D geometries to extract the in-plane rotation <strong>of</strong> the lines<br />

on the planar patches. The translation between the models<br />

was computed using the endpoints <strong>of</strong> the lines by estimating<br />

it as the vector that connects the mid-points <strong>of</strong> the matching<br />

lines. In general, two pairs <strong>of</strong> matched 3D lines give<br />

a unique solution, which is good to be a minimal fit as in a<br />

RANSAC scheme. These methods are only dealing with geometric<br />

entities. Along with ICPs, these are methods based<br />

on geomery.<br />

There are also methods based on texture and geometry.<br />

Liu et al. in [4] extended their work to align 3D points<br />

from sfm to range data. They first register several images<br />

independently to range data by matching vanishing points.<br />

Then the registered images are used as common points <strong>of</strong><br />

the range data and an model from sfm. In the final step a robust<br />

alignment is computed by minimizing the distance between<br />

the range data and the geometry obtained from structure<br />

from motion. After the alignment, photorealistic texture<br />

is mapped to 3D surface models.<br />

In [12], Zhao and Nistér propose a technique to align<br />

3D point clouds from sfm and 3D sensors. They start the<br />

scheme by registering two images first, fixing a rough transformation,<br />

and then refining with ICP.<br />

Wyngaerd et.al did a lot <strong>of</strong> work related to stitching partially<br />

reconstructed models. In [11], they extract and matche<br />

bitangent curve pairs from images by their invariant characteristics.<br />

Aligning these curves gives an initialization for<br />

more precise methods such as ICP. In [9], they use symmetric<br />

characteristics <strong>of</strong> surface patches, which helps matching<br />

the patches more accurately. This method also provides initial<br />

values for more accurate approaches. In [10], texture<br />

and shape information guide each other to look for better<br />

regions to match, which gives a more reliable initial alignment.<br />

In comparison, our method belongs to the texture-andgeometry<br />

based methods, but the geometry relationship we<br />

use is more constrained than the ones mentioned above. Our<br />

scheme first exploits the invariance <strong>of</strong> feature descriptors for<br />

putative VIP matching. Then we use the local coordinates<br />

defined on single VIP pair to give a unique solution for<br />

the whole similarity transformation that aligns 3D scenes,<br />

which enables matching features from largely variant viewpoints.<br />

3. Viewpoint-Invariant Patch(VIP)<br />

In this section we describe our novel features for 3D<br />

scene alignment. Viewpoint-Invariant Patches(VIPs) are<br />

features that can be extracted from textured 3D models and<br />

are invariant to 3D similarity transformations. We propose<br />

to use them to robustly align several 3D models <strong>of</strong> the same<br />

scene recorded from different viewpoints. In this paper<br />

we’ll mostly consider 3D models obtained from video, but<br />

our method is also applicable to textured 3D models obtained<br />

with LIDAR or other sensors. [?The invariance to<br />

3D similarities exactly corresponds to the ambiguity <strong>of</strong> 3D<br />

models obtained from images, while models from other sensors<br />

are <strong>of</strong>ten known up to a 3D Euclidean transformation<br />

or better.?]


Conceptually for every point on the surface we can estimate<br />

the normal and generate a local texture patch by orthogonal<br />

projection on the tangent plane. Within the local<br />

texture patch we can now verify whether the point corresponds<br />

to a local maximal response <strong>of</strong> the DoG (Difference<strong>of</strong>-Gaussians)<br />

filter in both scale and space, similarly to<br />

SIFT features. Once a VIP feature is identified, its orientation<br />

within the tangent plane is determined by the dominant<br />

gradient direction and then a SIFT descriptor is computed<br />

to describe the feature. The first step in the feature detection<br />

is to achieve a viewpoint normalized ortho-texture for<br />

each patch. This will be described in the next section.<br />

3.1. Viewpoint Normalization<br />

Viewpoint-normalized image patches need to be generated<br />

to describe VIPs. It is similar to normalizing image<br />

patches according to scale and orientation in SIFT and according<br />

to eclipse in affine covariant feature detection. The<br />

viewpoint normalization can be divided into the following<br />

steps:<br />

1. Warp the image texture onto the local tangential plane<br />

to make the image patch invariant to camera intrinsics.<br />

For this the corresponding image patches are rendered<br />

onto the tangential plane in space with native resolution.<br />

This step can be viewed as using a virtual camera<br />

with fixed camera intrinsics and an image plane<br />

coplanar with the tangential plane to generate the image<br />

patches.<br />

2. To obtain viewing direction invariance <strong>of</strong> the VIP, image<br />

patches are normalized to ortho-texture. Accordingly<br />

the projection direction <strong>of</strong> the texture is parallel<br />

to the normal direction <strong>of</strong> the tangential plane. Naturally<br />

this limits the VIP to planar surfaces as only then<br />

a consistent normal exists. This guarantees that the<br />

succeeding in plane computations are valid.<br />

size is (w j , h j ), normalized image size is (W j , H j ), and the<br />

original textures’s camera matrix is P = K[R|T ]. Then,<br />

the homography transform from local image coordinate x j<br />

to the original texture coordinate x o is<br />

x o = KR(R j<br />

⎡<br />

⎣ w j/W j 0 −w j /2<br />

0 h j /H j −h j /2<br />

0 0 0<br />

⎤<br />

⎦ x j +C j )+T<br />

Figure 3 demonstrates the viewpoint normalization. The<br />

2nd and 3rd column are ortho-textures. It can be seen that<br />

the ortho-textures are very similar which enable an image<br />

based matching for the ortho-textures. The 1st and 4th column<br />

are the original image representations <strong>of</strong> the VIPs.<br />

There is very large distortion due to the large viewpoint<br />

change, which poses significant problems to a purely image<br />

based SIFT matching.<br />

VIP Matches<br />

Figure 2. Two pair <strong>of</strong> VIP matching. The first and fourth collum<br />

give the original texture<br />

(1)<br />

3. Invariance to scale is achieved by choosing patch size<br />

according to local scale, which can be determined from<br />

local surface shape or local texture. In this paper,<br />

we choose to determine the scale according to texture<br />

information, which specifically the DOG key point<br />

method in Lowe’s SIFT is used, which makes our VIP<br />

invariant to rotation around the patch normal <strong>of</strong> the image<br />

patch. To make our VIPs distinctive, we only generate<br />

VIP for the 3D points, at which DOG key points<br />

are detected in the ortho-texture.<br />

In order to generate ortho-textures efficiently, local coordinate<br />

systems are defined for VIPs such that the local z<br />

axis is aligned with to the normal, . Suppose that the similarity<br />

transformation from world coordinate to the local coordinate<br />

system <strong>of</strong> patch j is X = R j X j + C j , local patch<br />

Figure 3. VIPs detected on 3D model that has two texture images,<br />

the cameras corresponding to the textures are at the bottom


3.2. VIP Generation<br />

VIPs are fully defined for the points where DOG extremes<br />

are found on the ortho-texture, and a VIP then can<br />

be denoted as (pos, σ, n, Ori, descriptor), where 1) pos is<br />

its 3D position, 2) σ the patch size, 3) n the surface normal<br />

at this location, 4) Ori texture dominant orientation<br />

as a vector in 3D, 5) SIFT descriptor d that describes the<br />

viewpoint-normalized patch.<br />

To wrap up the whole VIP, 3D position and surface normal<br />

are trivial to get if the local coordinates are already defined.<br />

Planar texture orientation can also be transformed to<br />

get the 3D orientation vector easily. Patch size is chosen to<br />

be proportional to local DOG extrema scale, and the normalized<br />

image patch gives the SIFT descriptor.<br />

The above steps extracts the VIP features from images<br />

and known local 3D geometry 2 as for example delivered by<br />

structure from motion. Accordingly each frame and its associated<br />

3D geometry are transformed to a set <strong>of</strong> distinctive<br />

VIPs, then VIP matches from two different models can be<br />

used to stitching them together.<br />

3.3. VIP Matching<br />

VIP descriptors are also invariant to any 3D similarity<br />

transformation <strong>of</strong> the 3D scene, because image patches are<br />

already normalized according to local scale and local normal.<br />

From another point <strong>of</strong> view, the VIP position, scale,<br />

normal, and orientation are covariant with the 3D similarity<br />

transform. That is, the transformation <strong>of</strong> the position, scale,<br />

normal, and orientation <strong>of</strong> corresponding VIP matches are<br />

also consistent with the global 3D similarity transformation<br />

between the 3D models. The invariance to viewpoint <strong>of</strong><br />

VIP already enables an robust matching even under significant<br />

viewpoint changes. Additionally the properties <strong>of</strong> the<br />

VIPs also allow us to recover the 3D similarity transformation<br />

between different local 3D models to align them in a<br />

common coordinate frame.<br />

Putative VIP matches, like other techniques that also<br />

use SIFT descriptors, can be easily obtained using nearest<br />

neighbor searching or other scalable methods if the problem<br />

size is large. After obtaining all the putative matches<br />

between two 3D scenes, robust estimation method can be<br />

to selecting an optimized transformation from the 3D hypotheses<br />

from every VIP correspondences. Since VIPs are<br />

viewpoint invariant, given correct camera matrix and correct<br />

3D structure, the similarity between correct matches<br />

should have a larger chance <strong>of</strong> being better than other methods<br />

that does not take into account this factor.<br />

The large amount <strong>of</strong> information associated with each<br />

VIP allows to compute the 3D similarity transformation be-<br />

2 Local geometry denotes the geometry that is recovered from the images<br />

by using for example stereo. Please note that the local geometry is<br />

usually given in the coordinate system <strong>of</strong> the first camera <strong>of</strong> the sequence<br />

with an arbitrary scale w.r.t. the real world motion<br />

tween two 3D scenes from a single match. The ratio <strong>of</strong> the<br />

scales <strong>of</strong> two VIPs expresses the relative scale between the<br />

3D scenes. The rotation is obtained using normal and orientation<br />

<strong>of</strong> the VIP pair. The translation between the scenes<br />

can be obtained by examining the rotation and scale compensated<br />

inliers locations. It should be noted that the scale<br />

and rotation needed to bring corresponding VIP features in<br />

alignment is constant for a complete 3D model. We will use<br />

this property later to set up an efficient hierarchical EHT<br />

scheme to determine 3D similarities between models.<br />

4. Efficient VIP Detection<br />

In general the planar patch detection needs to be executed<br />

for every pixel <strong>of</strong> the image to make the orthotextures.<br />

Each pixel (x, y) together with the camera center<br />

C defines a ray, which is intersected with the local 3D<br />

scene geometry. The point <strong>of</strong> intersection is the corresponding<br />

3D point for the feature. From this point and its spatial<br />

neighbors we then compute the tangential plane Π t at<br />

the point, which for planar regions coincides with the local<br />

plane. For structures that only slightly deviate from a plane<br />

we retrieve a planar approximation for local geometry <strong>of</strong> the<br />

patch. Then the extracted plane can then be used to compute<br />

the VIP feature description with respect to this plane. This<br />

method is generally valid for any scene.<br />

VIP detection for a set <strong>of</strong> points that have a same normal<br />

can be efficiently done in one pass. Considering the local<br />

coordinate system <strong>of</strong> VIPs that have a same normal, the image<br />

coordinate transformation between them are simply 2D<br />

Similarity Transform. This means that the VIP detection for<br />

all those points can be done in one pass on a larger plane<br />

patch, on which all the points are projected, and the original<br />

VIP can be recovered by applying a known similarity<br />

transformation.<br />

Planes are where lots <strong>of</strong> points have common normals,<br />

and where the normal grouping can be used. In many<br />

interesting scenarios for 3D scene alignment like urban<br />

scenes the plane fitting can be implemented more efficiently<br />

by searching for large scene planes. On one hand this<br />

saves computation time and it also improves the robustness<br />

against errors in the local geometry. Even more, parallel<br />

planes can also be grouped for VIP detection using the same<br />

idea.<br />

Our method uses RANSAC to extract planes from point<br />

clouds <strong>of</strong> 3D models. After the extraction <strong>of</strong> a 3D plane,<br />

a local coordinate system and a bounding box is determined<br />

according to the distribution <strong>of</strong> the points on the<br />

plane, which will be discussed next. The detection is repeated<br />

on remaining 3D points until there are not sufficient<br />

points. The planar detection can be further improved in performance<br />

by limiting the 3D points used in the plane detection<br />

step to the ones that are SIFT features in the original<br />

image. Experiments verify that the detection results are


similar to considering all points.<br />

Normally we can use SVD on normalized points to select<br />

the planar coordinate directions. Sometimes prior knowledge<br />

about camera pose is provided. We can improve<br />

the detection step by incorporating the prior knowledge to<br />

achieve even better quality features. It is common in video<br />

capturing that the vertical direction <strong>of</strong> images is approximately<br />

perpendicular to the ground plane. This also means<br />

that viewing direction <strong>of</strong> camera and horizontal axis <strong>of</strong> the<br />

image are parallel to the ground plane. In this case, planes<br />

that are less than 5 ◦ away from being vertical are adjusted<br />

to be so.<br />

After the coordinate directions are determined, we add<br />

a filtering step to generate a bounding box for each plane<br />

to enclose a solid surface patch which is later used to generate<br />

the ortho-textures. The filter reduces the influence <strong>of</strong><br />

outliers and surface discontinuities due to the imperfection<br />

<strong>of</strong> stereo reconstruction, and simultaneously we determine<br />

a bounding box in the horizontal and vertical direction separately<br />

as follows:<br />

1. Divide the range along the direction as N b bins, and<br />

compute a histogram <strong>of</strong> point counts.<br />

2. Mark the bins whose voting is larger than T good % <strong>of</strong><br />

average count as “good”, and check all bin segments satisfying<br />

that the gap between two “good” bins is less than<br />

T gap , and find the one with the largest sum <strong>of</strong> point counts.<br />

3. Set the bounding box along this direction to the range<br />

<strong>of</strong> this portion <strong>of</strong> bins.<br />

After a set <strong>of</strong> planes are extracted, ortho-textures <strong>of</strong> those<br />

planes are first generated. VIP detection is then performed<br />

only on those new normalized images to get planar features,<br />

instead <strong>of</strong> the entire model. Like we did in the previous<br />

section, features are transformed back to the original 3D<br />

plane to get the real VIP. This is an very efficient solution<br />

for the case where we are intrested in planar features.<br />

Figure 4 illustrates an result <strong>of</strong> detecting VIPs on dominant<br />

planes. The planes here actually deals with the noise<br />

<strong>of</strong> the reconstructed model, and helps on VIP localization.<br />

Planes on the ground are all ignored due to large viewing<br />

angle. Figure 5 shows an example <strong>of</strong> the viewpoint normalized<br />

images which shows the performance <strong>of</strong> our method.<br />

You can see some artifacts in the image, which is a result <strong>of</strong><br />

filling the pixels that can not be seen by the original camera<br />

with the closest border pixel.<br />

5. Hierarchical RANSAC for Estimating 3D<br />

Transformation<br />

Given the putative matches obtained from two models,<br />

robust methods can be used to estimate the similarity transformation<br />

using only one point correspondences. With the<br />

3D-Similarity-invariant VIP, hypothesis can be easily proposed<br />

by evaluating the local feature information <strong>of</strong> VIPs.<br />

Figure 4. Detect VIPs on dominant planes. Planes, such as the<br />

ground plane here are too distorted to be useful and ignored.<br />

Figure 5. Bottom is the normalized plane <strong>of</strong> the upper image<br />

Each single VIP correspondence can give a unique 3D<br />

transformation, and which allows us to try all possible samples<br />

efficiently. On the other hand, rotation and scaling<br />

components <strong>of</strong> similarity transformation are constant in the<br />

entire models, and they can be tested separately and efficiently<br />

by voting consensus.<br />

5.1. 3D Similarity Transformation from One VIP<br />

Correspondence<br />

Given a 3D patch pair (V IP 1 , V IP 2 ), the scale σ s is<br />

estimated as<br />

σ s = σ 1<br />

σ 2<br />

(2)<br />

The rotation matrix R s satisfies:<br />

(n 2 , ori 2 , ori 2 × n 2 )R s = (n 1 , ori 1 , ori 1 × n 1 ). (3)<br />

The translation t s is:<br />

T s = pos 1 − σ s R s pos 2 (4)<br />

where n i , ori i , pos i are the corresponding components<br />

<strong>of</strong> V IP i . σ s , R s and T s form the global transform<br />

[σ s R s |T s ].<br />

5.2. Hierarchical Exhaustive Hypothesis-Test<br />

(EHT) Scheme<br />

The scale, rotation and translation <strong>of</strong> a VIP is covariant<br />

with global 3D similarity transformation, and local feature


scale change and rotation are the same as the global scaling<br />

and rotation. Solving these components separately and<br />

hierarchically is a good choice to avoid uncertainties from<br />

different parts. It is worth note that in our experiments all<br />

possible hypothesis are exhaustively tested since each VIP<br />

pair can make up one hypothesis and thus the whole sample<br />

space is linear to VIP pair number. We benefit on this point<br />

for efficiency and accuracy.<br />

The 3D similarity estimation can be done hierarchically<br />

as follows:<br />

1). EHT1: For each VIP match, compute its local<br />

scaling σ cur by Eq. (2), and the matches whose scaling<br />

σ i falls into (ασ cur , σ cur /α) will vote for it. Find the<br />

one with the largest voting, and re-estimate σ best as mean<br />

<strong>of</strong> σ i from all inliers, and filter outliers from the putative set.<br />

2). EHT2: For each VIP match in filtered<br />

putative set, compute a rotation matrix R cur<br />

by Eq. (3), and the matches whose residual<br />

‖(n 2 , ori 2 , ori 2 × n 2 )R cur − (n 1 , ori 1 , ori 1 × n1)‖ 1<br />

is less than a certain threshold thr R will vote for it. Find<br />

the match with the largest voting as best R, and refine R best<br />

by linearly solving a Least Square solution by stacking<br />

equations from all inliers. Again, filter outliers from the<br />

putative set. Compute σ best with new filtered putative set<br />

for update.<br />

3). EHT3: For each VIP match in filtered putative set,<br />

compute translation t cur along with R best and σ best by<br />

Eq. (4). Find the match with largest support as in 1 and 2<br />

using a threshold thr t , refine t best as in 2). Again, filter<br />

outliers from the putative set. Compute σ best and R best<br />

with filtered putative set for update.<br />

4). Run non-linear optimization to refine scaling, rotation,<br />

and translation using all the inliers.<br />

After the above scheme, the similarity transformation is<br />

[σ best R best |t best ].<br />

6. Experimental Results<br />

This section evaluates novel 3D model alignment techniques<br />

using the proposed viewpoint invariant features.<br />

We applied the VIP-based 3D alignment to several reconstructed<br />

models and it demonstrated reliable surface alignment.<br />

The models in the experiments were given as images,<br />

depth maps and camera positions along with their intrinsics.<br />

The camera positions were from separate reconstructions<br />

each in its own coordinate system. The first scene<br />

shown in Fig. 6 consists <strong>of</strong> two facades <strong>of</strong> a building reconstructed<br />

from two different cameras with significantly different<br />

viewing directions <strong>of</strong> about 45 ◦ . The cameras moved<br />

along a path that went around the building. The obtained<br />

local geometry models were computed with structure from<br />

motion. One can observe reconstruction errors due to trees<br />

in front <strong>of</strong> the building. In Fig. 6, an <strong>of</strong>fset is added to the<br />

second scene model for visualization <strong>of</strong> the matching VIPs.<br />

The red-lines are connecting all the inliers. Rotation and<br />

scaling are already been compensated for in this visualization.<br />

The hierarchical EHT determines 122 inliers out <strong>of</strong><br />

2211 putative matches. The number <strong>of</strong> putative matches is<br />

high because putative matches are generated between every<br />

two sub model units.<br />

The second evaluation scene shown in Figure 7 consists<br />

<strong>of</strong> two local scene models, for which camera paths are intersecting<br />

with an angle <strong>of</strong> 90 degrees. The overlapping<br />

region is just a very small part <strong>of</strong> the entire models, and it<br />

is viewed from very different viewpoints in the two videos.<br />

Experiments show that our 3D model alignment reliably detect<br />

the small overlapping surface and aligns the two models.<br />

Videos in the supplemental materials will illustrates the<br />

details <strong>of</strong> how our algorithm works.<br />

Table 1 shows quantitative results <strong>of</strong> the Hierarchical<br />

EHT. The first row gives the number <strong>of</strong> putative matches,<br />

and the second row demonstrates how the number <strong>of</strong> inliers<br />

are decreasing in the 3-staged hierarchical EHT. Both<br />

scale and rotation verification already removes a significant<br />

portion <strong>of</strong> the outliers. For the evaluation we measure the<br />

distances between the overlapping parts <strong>of</strong> the models. The<br />

search for the closet points is done in a very small window<br />

around a point’s projection. The statistics in table 1 demonstrates<br />

the performance <strong>of</strong> our matching.<br />

Fig8 shows the histogram <strong>of</strong> Log <strong>of</strong> VIP scale ratios <strong>of</strong><br />

the putative matches <strong>of</strong> both experiments. It can be seen<br />

that the histograms <strong>of</strong> scale ratio <strong>of</strong> all the putative matches<br />

has a similar distribution as the inlier set. This similarity is<br />

caused by the texture repetition (e.g. the brick wall) on the<br />

model surface itself. This fact is actually very reasonable<br />

for many scenes, and it means that our scale test can still<br />

work reasonably even if there too much outliers.<br />

Additionally we compared our alignment with purely<br />

SIFT feature based alignment. The difficulty for SIFT<br />

matching is the significant viewpoint change, SIFT-based<br />

image matching is also conducted on the images <strong>of</strong> the overlapping<br />

surfaces in Fig. 7, where the camera directions <strong>of</strong><br />

the two cameras differ by about 45◦. The putative match<br />

generation <strong>of</strong> image matching is the same as the VIP matching.<br />

Out <strong>of</strong> the 58 putative SIFT matches in this example,<br />

we were not able to recover fundamental matrix by<br />

RANSAC. The failure <strong>of</strong> SIFT here can be explained by<br />

the large viewpoint change which does not naturally fit into<br />

SIFT. Another fact we noticed is that the descriptor distance<br />

<strong>of</strong> VIP inliers are usually smaller than the value we see in<br />

the SIFT correspondence. This observation goes consistent


Figure 6. 3D model stitch with two walls.<br />

7. <strong>Summary</strong> and Conclusions<br />

The paper addresses the important problem <strong>of</strong> the alignment<br />

<strong>of</strong> 3D scenes. Our novel alignment is based on view<br />

invariant patches (VIP) a novel feature description. It allows<br />

the scene alignment from a single correspondence only. For<br />

the matching process <strong>of</strong> VIPs we introduced a exhaustive<br />

hypothesis test which exploits the fact that the different<br />

parts <strong>of</strong> the similarity transformation can be evaluated independently.<br />

Through this scheme <strong>of</strong> a hierarchical hypotheses<br />

test we were able to overcome the problems posed by<br />

the high amount <strong>of</strong> uncertainty in the translational alignment<br />

<strong>of</strong> any standard matching. We evaluated the proposed<br />

matching using the novel VIPs on a variety <strong>of</strong> scenes. From<br />

the presented evaluation it follows that the novel features<br />

advance 3D alignment techniques for video.<br />

References<br />

Figure 7. 3D model stitch with very small overlapping.<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

−3 −2 −1 0 1 2 3 4 5<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

−1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2<br />

Figure 8. Histogram <strong>of</strong> Log <strong>of</strong> scale ratio.<br />

with the viewpoint invariance <strong>of</strong> VIPs.<br />

Scene#1 (Fig6) Scene#2 (Fig7)<br />

Putative # 2211 449<br />

Inlier # 1197→542→122 143→45→35<br />

Linear Refinement<br />

mean: 0.135<br />

median: 0.135<br />

mean: 0.0558<br />

median: 0.0534<br />

Non-Linear<br />

Refinement<br />

mean: 0.0801<br />

median: 0.0752<br />

mean: 0.0136<br />

median: 0.0113<br />

Surface alignment<br />

error(m)<br />

mean:0.806<br />

median:0.332<br />

mean: 0.392<br />

median: 0.141<br />

Table 1. Hierarchical EHT Running Details. It’s worth to note that<br />

we have got an inlier set <strong>of</strong> 122 from initial putatives <strong>of</strong> 2211 with<br />

the proposed method.<br />

[1] P. M. B. C. C. E. D. G. P. M. M. P. S. S. B. T. L. W. Q. Y. H.<br />

S. R. Y. G. W. H. T. D. N. A. Akbarzadeh, J.-M. Frahm and<br />

M. Pollefeys. Towards urban 3d reconstruction from video.<br />

In Proceedings <strong>of</strong> 3DPVT, 2006. 2<br />

[2] A. W. Fitzgibbon and A. Zisserman. Automatic camera recovery<br />

for closed or open image sequences. In Proceedings<br />

<strong>of</strong> the European Conference on <strong>Computer</strong> Vision, pages 311–<br />

326, June 1998. 3<br />

[3] A. Johnson, M., and Hebert. Using spin images for efficient<br />

object recognition in cluttered 3Dscenes. IEEE Transactions<br />

on Pattern Analysis and Machine Intelligence, 21(5):433–<br />

449, May 1999. 3<br />

[4] L. Liu, I. Stamos, G. Yu, G. Wolberg, and S. Zokai. Multiview<br />

Geometry for Texture Mapping 2D Images Onto 3D<br />

Range Data. IEEE <strong>Computer</strong> Society Conference on <strong>Computer</strong><br />

Vision and Pattern Recognition, 2:2293–2300, 2006.<br />

3<br />

[5] D. Lowe. Distinctive image features from scale-invariant<br />

keypoints. In International Journal <strong>of</strong> <strong>Computer</strong> Vision, volume<br />

20, pages 91–110, 2004. 2, 3<br />

[6] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,<br />

J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A<br />

comparison <strong>of</strong> affine region detectors. 2005. 3<br />

[7] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring<br />

photo collections in 3d. In SIGGRAPH ’06: ACM<br />

SIGGRAPH 2006 Papers, pages 835–846, 2006. 3<br />

[8] I. Stamos and M. Leordeanu. Automated Feature-Based<br />

Registration <strong>of</strong> Urban Scenes <strong>of</strong> Large Scale. In proceedings<br />

<strong>of</strong> <strong>Computer</strong> Vision Pattern Recognition, 2003. 3<br />

[9] J. V. Wyngaerd and L. J. V. Gool. Automatic crude patch<br />

registration: Toward automatic 3d model building. <strong>Computer</strong><br />

Vision and Image Understanding, 87(1-3):8–26, 2002. 3<br />

[10] J. V. Wyngaerd and L. J. V. Gool. Combining texture and<br />

shape for automatic crude patch registration. In 3DIM, pages<br />

179–186, 2003. 3<br />

[11] J. V. Wyngaerd, L. J. V. Gool, R. Koch, and M. Proesmans.<br />

Invariant-based registration <strong>of</strong> surface patches. In ICCV,<br />

pages 301–306, 1999. 3


[12] W.-Y. Zhao, D. Nistér, and S. C. Hsu. Alignment <strong>of</strong> continuous<br />

video onto 3d point clouds. IEEE Trans. Pattern Anal.<br />

Mach. Intell., 27(8), 2005. 3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!