Summary Page - Department of Computer Science

Summary Page - Department of Computer Science Summary Page - Department of Computer Science

from inf.ethz.ch More from this publisher

29.01.2014 Views

Summary Page Do not exceed one page. Failure to include the summary page will result in the automatic rejection of the paper. Answer each of the following questions in no more than 2-3 sentences: 1. Is this a system paper or a regular paper? Regular paper 2. If it is a system paper, please explain the contribution. 3. If it is a regular paper: (a) What is the main contribution in terms of theory, algorithms and approach. The paper addresses the alignment of 3D scene observed from different cameras with large viewpoint differences. It has two main contributions. First we introduce a novel viewpoint independent feature based on texture and local geometry. The description contains the 3D position of the feature, the local surface normal, the scale of the feature, the 3D orientation of the feature, and the SIFT descriptor of the normalized feature texture. Secondly we propose a novel efficient matching algorithm employing the unique properties of our viewpoint invariant features. The rich set of attributes of the novel features enables us to compute the 3D scene alignment from a single match. (b) Describe the types of experiments and the novelty of the results. If applicable, provide comparison to the state of the art in this area. We demonstrate the performance of our novel features and the efficient matching on multiple real world examples and compare it with state of the art image based matching based on the popular SIFT features. Our algorithm clearly outperforms the traditional image based techniques for large viewpoint changes of the camera. In this situation our novel viewpoint invariant feature matching still achieves reliable matching as shown in our evaluation. Due to the rich feature description we are able to align even surfaces with small overlap. Figure 1. 3D model alignment of the models from two cameras. The red lines show the inlier matches. Please note the second model is translated for visualization purposes.

Summary Page

Do not exceed one page. Failure to include the summary page will result in the automatic rejection of the paper.

Answer each of the following questions in no more than 2-3 sentences:

1. Is this a system paper or a regular paper?

Regular paper

2. If it is a system paper, please explain the contribution.

3. If it is a regular paper:

(a) What is the main contribution in terms of theory, algorithms and approach.

The paper addresses the alignment of 3D scene observed from different cameras with large viewpoint differences.

It has two main contributions. First we introduce a novel viewpoint independent feature based on texture and local

geometry. The description contains the 3D position of the feature, the local surface normal, the scale of the feature,

the 3D orientation of the feature, and the SIFT descriptor of the normalized feature texture. Secondly we propose

a novel efficient matching algorithm employing the unique properties of our viewpoint invariant features. The rich

set of attributes of the novel features enables us to compute the 3D scene alignment from a single match.

(b) Describe the types of experiments and the novelty of the results. If applicable, provide comparison to the state of

the art in this area.

We demonstrate the performance of our novel features and the efficient matching on multiple real world examples

and compare it with state of the art image based matching based on the popular SIFT features. Our algorithm

clearly outperforms the traditional image based techniques for large viewpoint changes of the camera. In this

situation our novel viewpoint invariant feature matching still achieves reliable matching as shown in our evaluation.

Due to the rich feature description we are able to align even surfaces with small overlap.

Figure 1. 3D model alignment of the models from two cameras. The red lines show the inlier matches. Please note the second model is

translated for visualization purposes.

3D Model Matching with Viewpoint-Invariant Patches (VIP)

sumbitted to ICCV 07

Changchang Wu, Xiaowei Li, Jan-Michael Frahm and Marc Pollefeys

Department of Computer Science,

University of North Carolina at Chapel Hill, USA

{ccwu,xwli,jmf,marc}@cs.unc.edu

Abstract

The paper introduces a novel class of viewpoint independent

local features and novel algorithms to use them for

3D scene alignment. The advantages of the novel viewpoint

invariant patches (VIP) are: a single VIP correspondence

uniquely defines the 3D similarity transformation between

the two VIP features, 2) the novel features are invariant

to 3D camera motion. In the paper we use these properties

to introduce an efficient matching scheme for the 3D

scene alignment. The algorithm is based on a hierarchical

RANSAC scheme which test the components of the similarity

transformation sequentially to allow matching and 3D

scene alignment. We will evaluate the novel features on real

data with known ground truth information.

1. Introduction

In recent years, there have been significant research efforts

for fast and large scale 3D scene reconstruction from

video. Recent systems show real time performance [1].

Large scale reconstruction from video only is vulnerable to

accumulated errors even with a bundle adjustment. To avoid

accumulated errors, the reconstruction inherently needs to

recognize previously reconstructed scene parts and determine

the similarity transformation between the current reconstruction

and previous reconstructions. This similarity

transformation is equivalent to the accumulated drift. Traditionally

image-based matching is used to provide the loop

closing constraints to bundle adjustment. Utilizing textures

in 3D model matching is a hard problem due to the irregularity

of 3D structure. Typically these image based methods

suffer under large viewpoint changes that often occur. In

urban modeling for example, the path often crosses at an intersection

which typically means that the viewing direction

changes by about 90 ◦ .

The paper introduces novel viewpoint invariant patches

(VIP) that provide the necessary properties to determine the

similarity transformation between two 3D scenes. The VIPs

are extracted from images based on their known geometry,

and the detection works in euclidian space to achieve

projective invariance. Given an interesting 3D point, normalized

image patches is first generated by viewpoint normalization

that generates a local snapshot from a relatively

fixed ortho viewpoint. This local snapshot actually can be

viewed as an ortho-texture 1 of the 3D model. DOG extrema

suppression is then performed in this patch to check existence

of VIP keypoint. The normalized image patches of

VIP keypoints are then coded by SIFT descriptor[5].

3D models are then transformed to a set of VIPs, each

of which has a 3D position, patch scale, surface normal, local

gradient orientation, and a SIFT descriptor. The rich

information of VIP correspondences makes them convenient

for 3D similarity transformation estimation. One

VIP correspondence is sufficient to compute a full similarity

transformation by comparing their 3d positions, normals,orientations

and scales. The scale and rotation components

of the VIP correspondence is consistent with relative

scale and rotation between the two 3D-models. Moreover,

they are independent and can be be tested separately and

efficiently. These advantages lead to an efficient hierarchical

exhaustive hypothesis test (EHT) scheme, which delivers

a transformation, by which 3D textured models can be

stitched automatically.

The remainder of the paper is organized as follows: The

related work is discussed in Section 2. Afterwards Section

3 introduces the viewpoint-invariant patch and discusses its

properties. An efficient VIP detector for urban scenes is

discussed in Section 4. In Section 5 our novel hierarchical

EHT is introduced. The novel algorithms are evaluated in

Section 6.

1 Ortho-texture: Representation of the texture that is projected on the

surface with orthogonal projection.

2. Related Work

Many image features have been developed to achieve invariance

to similarity transformation or affine transformation

for wide-baseline matching. Lowe’s SIFT keypoints[5]

is one of the most popular. SIFT detector get feature scale

in scale space and feature orientation from edge map, and it

generate normalized image patches to achieve similarity invariance.

SIFT descriptors is also a strong tool to represent

the normalized image patch, it is used by many feature detectors

including affine covariant features, and we also use

SIFT to represent our VIP. Affine covariant feature then go

beyond to achieve invariance to affine transformation, and

[6] gives a good comparison of several such features. In this

paper, we will go even beyond to detect projective invariant

features in images based on 3D structures. It is intuitive

to be projective invariant if detector operates in euclidian

space.

With the advances in structure from motion techniques

and active sensors in the last decade, the interest in the

alignment of 3D models has grown. Fitzgibbon and Zisserman

proposed a hierarchical structure from motion (sfm)

scheme [2] to align local 3D scene models from connective

triplets. The technique exploited 3D correspondences

from common 2D tracks in consecutive triplets to compute

similarity transformation for alignment. The proposed

technique works well for small viewpoint changes between

triplets which are typically observed in video. The approach

by Snavely et al. in [7] uses the well known SIFT features

to automatically extract wide baseline salient feature from

photo collections. Then a robust matching and a succeeding

bundle adjustment are used to determine the camera positions.

Afterwards the cameras and the images are used to

provide an image alignment for the scene. These methods

are based on texture only.

If a rough alignment transformation is known, ICP based

methods are widely used to compute the alignment by iteratively

minimizing the sum of distance between closest

points. For a successful alignment, the local reconstructions

need to be as accurate as possible which are often not

the case. This limits the use of ICP based techniques for

refining a rough alignment. Obviously, ICP and its variants

consider only 3D point positions.

There has also been some work about aligning 3D models

with the extracted geometric entities called spin images

[3]. Stamos and Leordeanu used mainly planar regions

and 3D lines on them to do 3D scene alignment [8]. The approach

uses a pair of matched infinite lines on the two local

3D geometries to extract the in-plane rotation of the lines

on the planar patches. The translation between the models

was computed using the endpoints of the lines by estimating

it as the vector that connects the mid-points of the matching

lines. In general, two pairs of matched 3D lines give

a unique solution, which is good to be a minimal fit as in a

RANSAC scheme. These methods are only dealing with geometric

entities. Along with ICPs, these are methods based

on geomery.

There are also methods based on texture and geometry.

Liu et al. in [4] extended their work to align 3D points

from sfm to range data. They first register several images

independently to range data by matching vanishing points.

Then the registered images are used as common points of

the range data and an model from sfm. In the final step a robust

alignment is computed by minimizing the distance between

the range data and the geometry obtained from structure

from motion. After the alignment, photorealistic texture

is mapped to 3D surface models.

In [12], Zhao and Nistér propose a technique to align

3D point clouds from sfm and 3D sensors. They start the

scheme by registering two images first, fixing a rough transformation,

and then refining with ICP.

Wyngaerd et.al did a lot of work related to stitching partially

reconstructed models. In [11], they extract and matche

bitangent curve pairs from images by their invariant characteristics.

Aligning these curves gives an initialization for

more precise methods such as ICP. In [9], they use symmetric

characteristics of surface patches, which helps matching

the patches more accurately. This method also provides initial

values for more accurate approaches. In [10], texture

and shape information guide each other to look for better

regions to match, which gives a more reliable initial alignment.

In comparison, our method belongs to the texture-andgeometry

based methods, but the geometry relationship we

use is more constrained than the ones mentioned above. Our

scheme first exploits the invariance of feature descriptors for

putative VIP matching. Then we use the local coordinates

defined on single VIP pair to give a unique solution for

the whole similarity transformation that aligns 3D scenes,

which enables matching features from largely variant viewpoints.

3. Viewpoint-Invariant Patch(VIP)

In this section we describe our novel features for 3D

scene alignment. Viewpoint-Invariant Patches(VIPs) are

features that can be extracted from textured 3D models and

are invariant to 3D similarity transformations. We propose

to use them to robustly align several 3D models of the same

scene recorded from different viewpoints. In this paper

we’ll mostly consider 3D models obtained from video, but

our method is also applicable to textured 3D models obtained

with LIDAR or other sensors. [?The invariance to

3D similarities exactly corresponds to the ambiguity of 3D

models obtained from images, while models from other sensors

are often known up to a 3D Euclidean transformation

or better.?]

Conceptually for every point on the surface we can estimate

the normal and generate a local texture patch by orthogonal

projection on the tangent plane. Within the local

texture patch we can now verify whether the point corresponds

to a local maximal response of the DoG (Differenceof-Gaussians)

filter in both scale and space, similarly to

SIFT features. Once a VIP feature is identified, its orientation

within the tangent plane is determined by the dominant

gradient direction and then a SIFT descriptor is computed

to describe the feature. The first step in the feature detection

is to achieve a viewpoint normalized ortho-texture for

each patch. This will be described in the next section.

3.1. Viewpoint Normalization

Viewpoint-normalized image patches need to be generated

to describe VIPs. It is similar to normalizing image

patches according to scale and orientation in SIFT and according

to eclipse in affine covariant feature detection. The

viewpoint normalization can be divided into the following

steps:

1. Warp the image texture onto the local tangential plane

to make the image patch invariant to camera intrinsics.

For this the corresponding image patches are rendered

onto the tangential plane in space with native resolution.

This step can be viewed as using a virtual camera

with fixed camera intrinsics and an image plane

coplanar with the tangential plane to generate the image

patches.

2. To obtain viewing direction invariance of the VIP, image

patches are normalized to ortho-texture. Accordingly

the projection direction of the texture is parallel

to the normal direction of the tangential plane. Naturally

this limits the VIP to planar surfaces as only then

a consistent normal exists. This guarantees that the

succeeding in plane computations are valid.

size is (w j , h j ), normalized image size is (W j , H j ), and the

original textures’s camera matrix is P = K[R|T ]. Then,

the homography transform from local image coordinate x j

to the original texture coordinate x o is

x o = KR(R j

⎡

⎣ w j/W j 0 −w j /2

0 h j /H j −h j /2

0 0 0

⎤

⎦ x j +C j )+T

Figure 3 demonstrates the viewpoint normalization. The

2nd and 3rd column are ortho-textures. It can be seen that

the ortho-textures are very similar which enable an image

based matching for the ortho-textures. The 1st and 4th column

are the original image representations of the VIPs.

There is very large distortion due to the large viewpoint

change, which poses significant problems to a purely image

based SIFT matching.

VIP Matches

Figure 2. Two pair of VIP matching. The first and fourth collum

give the original texture

(1)

3. Invariance to scale is achieved by choosing patch size

according to local scale, which can be determined from

local surface shape or local texture. In this paper,

we choose to determine the scale according to texture

information, which specifically the DOG key point

method in Lowe’s SIFT is used, which makes our VIP

invariant to rotation around the patch normal of the image

patch. To make our VIPs distinctive, we only generate

VIP for the 3D points, at which DOG key points

are detected in the ortho-texture.

In order to generate ortho-textures efficiently, local coordinate

systems are defined for VIPs such that the local z

axis is aligned with to the normal, . Suppose that the similarity

transformation from world coordinate to the local coordinate

system of patch j is X = R j X j + C j , local patch

Figure 3. VIPs detected on 3D model that has two texture images,

the cameras corresponding to the textures are at the bottom

3.2. VIP Generation

VIPs are fully defined for the points where DOG extremes

are found on the ortho-texture, and a VIP then can

be denoted as (pos, σ, n, Ori, descriptor), where 1) pos is

its 3D position, 2) σ the patch size, 3) n the surface normal

at this location, 4) Ori texture dominant orientation

as a vector in 3D, 5) SIFT descriptor d that describes the

viewpoint-normalized patch.

To wrap up the whole VIP, 3D position and surface normal

are trivial to get if the local coordinates are already defined.

Planar texture orientation can also be transformed to

get the 3D orientation vector easily. Patch size is chosen to

be proportional to local DOG extrema scale, and the normalized

image patch gives the SIFT descriptor.

The above steps extracts the VIP features from images

and known local 3D geometry 2 as for example delivered by

structure from motion. Accordingly each frame and its associated

3D geometry are transformed to a set of distinctive

VIPs, then VIP matches from two different models can be

used to stitching them together.

3.3. VIP Matching

VIP descriptors are also invariant to any 3D similarity

transformation of the 3D scene, because image patches are

already normalized according to local scale and local normal.

From another point of view, the VIP position, scale,

normal, and orientation are covariant with the 3D similarity

transform. That is, the transformation of the position, scale,

normal, and orientation of corresponding VIP matches are

also consistent with the global 3D similarity transformation

between the 3D models. The invariance to viewpoint of

VIP already enables an robust matching even under significant

viewpoint changes. Additionally the properties of the

VIPs also allow us to recover the 3D similarity transformation

between different local 3D models to align them in a

common coordinate frame.

Putative VIP matches, like other techniques that also

use SIFT descriptors, can be easily obtained using nearest

neighbor searching or other scalable methods if the problem

size is large. After obtaining all the putative matches

between two 3D scenes, robust estimation method can be

to selecting an optimized transformation from the 3D hypotheses

from every VIP correspondences. Since VIPs are

viewpoint invariant, given correct camera matrix and correct

3D structure, the similarity between correct matches

should have a larger chance of being better than other methods

that does not take into account this factor.

The large amount of information associated with each

VIP allows to compute the 3D similarity transformation be-

2 Local geometry denotes the geometry that is recovered from the images

by using for example stereo. Please note that the local geometry is

usually given in the coordinate system of the first camera of the sequence

with an arbitrary scale w.r.t. the real world motion

tween two 3D scenes from a single match. The ratio of the

scales of two VIPs expresses the relative scale between the

3D scenes. The rotation is obtained using normal and orientation

of the VIP pair. The translation between the scenes

can be obtained by examining the rotation and scale compensated

inliers locations. It should be noted that the scale

and rotation needed to bring corresponding VIP features in

alignment is constant for a complete 3D model. We will use

this property later to set up an efficient hierarchical EHT

scheme to determine 3D similarities between models.

4. Efficient VIP Detection

In general the planar patch detection needs to be executed

for every pixel of the image to make the orthotextures.

Each pixel (x, y) together with the camera center

C defines a ray, which is intersected with the local 3D

scene geometry. The point of intersection is the corresponding

3D point for the feature. From this point and its spatial

neighbors we then compute the tangential plane Π t at

the point, which for planar regions coincides with the local

plane. For structures that only slightly deviate from a plane

we retrieve a planar approximation for local geometry of the

patch. Then the extracted plane can then be used to compute

the VIP feature description with respect to this plane. This

method is generally valid for any scene.

VIP detection for a set of points that have a same normal

can be efficiently done in one pass. Considering the local

coordinate system of VIPs that have a same normal, the image

coordinate transformation between them are simply 2D

Similarity Transform. This means that the VIP detection for

all those points can be done in one pass on a larger plane

patch, on which all the points are projected, and the original

VIP can be recovered by applying a known similarity

transformation.

Planes are where lots of points have common normals,

and where the normal grouping can be used. In many

interesting scenarios for 3D scene alignment like urban

scenes the plane fitting can be implemented more efficiently

by searching for large scene planes. On one hand this

saves computation time and it also improves the robustness

against errors in the local geometry. Even more, parallel

planes can also be grouped for VIP detection using the same

idea.

Our method uses RANSAC to extract planes from point

clouds of 3D models. After the extraction of a 3D plane,

a local coordinate system and a bounding box is determined

according to the distribution of the points on the

plane, which will be discussed next. The detection is repeated

on remaining 3D points until there are not sufficient

points. The planar detection can be further improved in performance

by limiting the 3D points used in the plane detection

step to the ones that are SIFT features in the original

image. Experiments verify that the detection results are

similar to considering all points.

Normally we can use SVD on normalized points to select

the planar coordinate directions. Sometimes prior knowledge

about camera pose is provided. We can improve

the detection step by incorporating the prior knowledge to

achieve even better quality features. It is common in video

capturing that the vertical direction of images is approximately

perpendicular to the ground plane. This also means

that viewing direction of camera and horizontal axis of the

image are parallel to the ground plane. In this case, planes

that are less than 5 ◦ away from being vertical are adjusted

to be so.

After the coordinate directions are determined, we add

a filtering step to generate a bounding box for each plane

to enclose a solid surface patch which is later used to generate

the ortho-textures. The filter reduces the influence of

outliers and surface discontinuities due to the imperfection

of stereo reconstruction, and simultaneously we determine

a bounding box in the horizontal and vertical direction separately

as follows:

1. Divide the range along the direction as N b bins, and

compute a histogram of point counts.

2. Mark the bins whose voting is larger than T good % of

average count as “good”, and check all bin segments satisfying

that the gap between two “good” bins is less than

T gap , and find the one with the largest sum of point counts.

3. Set the bounding box along this direction to the range

of this portion of bins.

After a set of planes are extracted, ortho-textures of those

planes are first generated. VIP detection is then performed

only on those new normalized images to get planar features,

instead of the entire model. Like we did in the previous

section, features are transformed back to the original 3D

plane to get the real VIP. This is an very efficient solution

for the case where we are intrested in planar features.

Figure 4 illustrates an result of detecting VIPs on dominant

planes. The planes here actually deals with the noise

of the reconstructed model, and helps on VIP localization.

Planes on the ground are all ignored due to large viewing

angle. Figure 5 shows an example of the viewpoint normalized

images which shows the performance of our method.

You can see some artifacts in the image, which is a result of

filling the pixels that can not be seen by the original camera

with the closest border pixel.

5. Hierarchical RANSAC for Estimating 3D

Transformation

Given the putative matches obtained from two models,

robust methods can be used to estimate the similarity transformation

using only one point correspondences. With the

3D-Similarity-invariant VIP, hypothesis can be easily proposed

by evaluating the local feature information of VIPs.

Figure 4. Detect VIPs on dominant planes. Planes, such as the

ground plane here are too distorted to be useful and ignored.

Figure 5. Bottom is the normalized plane of the upper image

Each single VIP correspondence can give a unique 3D

transformation, and which allows us to try all possible samples

efficiently. On the other hand, rotation and scaling

components of similarity transformation are constant in the

entire models, and they can be tested separately and efficiently

by voting consensus.

5.1. 3D Similarity Transformation from One VIP

Correspondence

Given a 3D patch pair (V IP 1 , V IP 2 ), the scale σ s is

estimated as

σ s = σ 1

σ 2

(2)

The rotation matrix R s satisfies:

(n 2 , ori 2 , ori 2 × n 2 )R s = (n 1 , ori 1 , ori 1 × n 1 ). (3)

The translation t s is:

T s = pos 1 − σ s R s pos 2 (4)

where n i , ori i , pos i are the corresponding components

of V IP i . σ s , R s and T s form the global transform

[σ s R s |T s ].

5.2. Hierarchical Exhaustive Hypothesis-Test

(EHT) Scheme

The scale, rotation and translation of a VIP is covariant

with global 3D similarity transformation, and local feature

scale change and rotation are the same as the global scaling

and rotation. Solving these components separately and

hierarchically is a good choice to avoid uncertainties from

different parts. It is worth note that in our experiments all

possible hypothesis are exhaustively tested since each VIP

pair can make up one hypothesis and thus the whole sample

space is linear to VIP pair number. We benefit on this point

for efficiency and accuracy.

The 3D similarity estimation can be done hierarchically

as follows:

1). EHT1: For each VIP match, compute its local

scaling σ cur by Eq. (2), and the matches whose scaling

σ i falls into (ασ cur , σ cur /α) will vote for it. Find the

one with the largest voting, and re-estimate σ best as mean

of σ i from all inliers, and filter outliers from the putative set.

2). EHT2: For each VIP match in filtered

putative set, compute a rotation matrix R cur

by Eq. (3), and the matches whose residual

‖(n 2 , ori 2 , ori 2 × n 2 )R cur − (n 1 , ori 1 , ori 1 × n1)‖ 1

is less than a certain threshold thr R will vote for it. Find

the match with the largest voting as best R, and refine R best

by linearly solving a Least Square solution by stacking

equations from all inliers. Again, filter outliers from the

putative set. Compute σ best with new filtered putative set

for update.

3). EHT3: For each VIP match in filtered putative set,

compute translation t cur along with R best and σ best by

Eq. (4). Find the match with largest support as in 1 and 2

using a threshold thr t , refine t best as in 2). Again, filter

outliers from the putative set. Compute σ best and R best

with filtered putative set for update.

4). Run non-linear optimization to refine scaling, rotation,

and translation using all the inliers.

After the above scheme, the similarity transformation is

[σ best R best |t best ].

6. Experimental Results

This section evaluates novel 3D model alignment techniques

using the proposed viewpoint invariant features.

We applied the VIP-based 3D alignment to several reconstructed

models and it demonstrated reliable surface alignment.

The models in the experiments were given as images,

depth maps and camera positions along with their intrinsics.

The camera positions were from separate reconstructions

each in its own coordinate system. The first scene

shown in Fig. 6 consists of two facades of a building reconstructed

from two different cameras with significantly different

viewing directions of about 45 ◦ . The cameras moved

along a path that went around the building. The obtained

local geometry models were computed with structure from

motion. One can observe reconstruction errors due to trees

in front of the building. In Fig. 6, an offset is added to the

second scene model for visualization of the matching VIPs.

The red-lines are connecting all the inliers. Rotation and

scaling are already been compensated for in this visualization.

The hierarchical EHT determines 122 inliers out of

2211 putative matches. The number of putative matches is

high because putative matches are generated between every

two sub model units.

The second evaluation scene shown in Figure 7 consists

of two local scene models, for which camera paths are intersecting

with an angle of 90 degrees. The overlapping

region is just a very small part of the entire models, and it

is viewed from very different viewpoints in the two videos.

Experiments show that our 3D model alignment reliably detect

the small overlapping surface and aligns the two models.

Videos in the supplemental materials will illustrates the

details of how our algorithm works.

Table 1 shows quantitative results of the Hierarchical

EHT. The first row gives the number of putative matches,

and the second row demonstrates how the number of inliers

are decreasing in the 3-staged hierarchical EHT. Both

scale and rotation verification already removes a significant

portion of the outliers. For the evaluation we measure the

distances between the overlapping parts of the models. The

search for the closet points is done in a very small window

around a point’s projection. The statistics in table 1 demonstrates

the performance of our matching.

Fig8 shows the histogram of Log of VIP scale ratios of

the putative matches of both experiments. It can be seen

that the histograms of scale ratio of all the putative matches

has a similar distribution as the inlier set. This similarity is

caused by the texture repetition (e.g. the brick wall) on the

model surface itself. This fact is actually very reasonable

for many scenes, and it means that our scale test can still

work reasonably even if there too much outliers.

Additionally we compared our alignment with purely

SIFT feature based alignment. The difficulty for SIFT

matching is the significant viewpoint change, SIFT-based

image matching is also conducted on the images of the overlapping

surfaces in Fig. 7, where the camera directions of

the two cameras differ by about 45◦. The putative match

generation of image matching is the same as the VIP matching.

Out of the 58 putative SIFT matches in this example,

we were not able to recover fundamental matrix by

RANSAC. The failure of SIFT here can be explained by

the large viewpoint change which does not naturally fit into

SIFT. Another fact we noticed is that the descriptor distance

of VIP inliers are usually smaller than the value we see in

the SIFT correspondence. This observation goes consistent

Figure 6. 3D model stitch with two walls.

7. Summary and Conclusions

The paper addresses the important problem of the alignment

of 3D scenes. Our novel alignment is based on view

invariant patches (VIP) a novel feature description. It allows

the scene alignment from a single correspondence only. For

the matching process of VIPs we introduced a exhaustive

hypothesis test which exploits the fact that the different

parts of the similarity transformation can be evaluated independently.

Through this scheme of a hierarchical hypotheses

test we were able to overcome the problems posed by

the high amount of uncertainty in the translational alignment

of any standard matching. We evaluated the proposed

matching using the novel VIPs on a variety of scenes. From

the presented evaluation it follows that the novel features

advance 3D alignment techniques for video.

References

Figure 7. 3D model stitch with very small overlapping.

600

500

400

300

200

100

0

−3 −2 −1 0 1 2 3 4 5

6

5

4

3

2

1

0

−1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2

Figure 8. Histogram of Log of scale ratio.

with the viewpoint invariance of VIPs.

Scene#1 (Fig6) Scene#2 (Fig7)

Putative # 2211 449

Inlier # 1197→542→122 143→45→35

Linear Refinement

mean: 0.135

median: 0.135

mean: 0.0558

median: 0.0534

Non-Linear

Refinement

mean: 0.0801

median: 0.0752

mean: 0.0136

median: 0.0113

Surface alignment

error(m)

mean:0.806

median:0.332

mean: 0.392

median: 0.141

Table 1. Hierarchical EHT Running Details. It’s worth to note that

we have got an inlier set of 122 from initial putatives of 2211 with

the proposed method.

[1] P. M. B. C. C. E. D. G. P. M. M. P. S. S. B. T. L. W. Q. Y. H.

S. R. Y. G. W. H. T. D. N. A. Akbarzadeh, J.-M. Frahm and

M. Pollefeys. Towards urban 3d reconstruction from video.

In Proceedings of 3DPVT, 2006. 2

[2] A. W. Fitzgibbon and A. Zisserman. Automatic camera recovery

for closed or open image sequences. In Proceedings

of the European Conference on Computer Vision, pages 311–

326, June 1998. 3

[3] A. Johnson, M., and Hebert. Using spin images for efficient

object recognition in cluttered 3Dscenes. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 21(5):433–

449, May 1999. 3

[4] L. Liu, I. Stamos, G. Yu, G. Wolberg, and S. Zokai. Multiview

Geometry for Texture Mapping 2D Images Onto 3D

Range Data. IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, 2:2293–2300, 2006.

3

[5] D. Lowe. Distinctive image features from scale-invariant

keypoints. In International Journal of Computer Vision, volume

20, pages 91–110, 2004. 2, 3

[6] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,

J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A

comparison of affine region detectors. 2005. 3

[7] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring

photo collections in 3d. In SIGGRAPH ’06: ACM

SIGGRAPH 2006 Papers, pages 835–846, 2006. 3

[8] I. Stamos and M. Leordeanu. Automated Feature-Based

Registration of Urban Scenes of Large Scale. In proceedings

of Computer Vision Pattern Recognition, 2003. 3

[9] J. V. Wyngaerd and L. J. V. Gool. Automatic crude patch

registration: Toward automatic 3d model building. Computer

Vision and Image Understanding, 87(1-3):8–26, 2002. 3

[10] J. V. Wyngaerd and L. J. V. Gool. Combining texture and

shape for automatic crude patch registration. In 3DIM, pages

179–186, 2003. 3

[11] J. V. Wyngaerd, L. J. V. Gool, R. Koch, and M. Proesmans.

Invariant-based registration of surface patches. In ICCV,

pages 301–306, 1999. 3

[12] W.-Y. Zhao, D. Nistér, and S. C. Hsu. Alignment of continuous

video onto 3d point clouds. IEEE Trans. Pattern Anal.

Mach. Intell., 27(8), 2005. 3

Summary Page - Department of Computer Science

Summary Page - Department of Computer Science ... View more Summary Page - Department of Computer Science

Delete template?

Save as template ?

Summary Page - Department of Computer Science Summary Page - Department of Computer Science