Bachelor Thesis A Qualitative Analysis of Two Automated ...

Bachelor Thesis 

A Qualitative Analysis of Two 

Automated Registration Algorithms In 

a Real World Scenario Using Point 

Clouds from the Kinect 

Jacob Kjær, s072421 

June 6, 2011 

1 Introduction 

In fall 2010 Microsoft released the Kinect. It is a piece of hardware which 

incorporates a structured light projector, an infrared camera, a color camera, 

and various other peripheals like an array of microphones and an accelerometer 

meant for adjustment of a built’in pivot motor. In this report, the two camera 

are put to use. 

A structured light camera is able to measure the depth of a given point in 

an image relative to its viewpoint. Given that the x and y position of any 

point in the image plane is implicitly given, having its depth gives us its three 

dimensional projective coordinates. On such coordinates one can perform a 

reverse perspective projection in order to extract the 3-dimensional cartesian 

coordinate of the point relative to the viewpoint of the camera. Performing 

such a transformation on all points in the image plane results in a set of points 

which we call a point cloud. A point cloud can be reprojected and explored 

from all viewpoints independently of how it was generated. 

Given that the Kinect also has a normal color camera, this can be used in 

conjunction with the structured light camera to extract a colored point cloud 

of a given scene. 

1.1 The Problem 

Now, having a point cloud of a scene is nice. But exploring it from different 

viewpoints, one will discover large holes in the objects in the scene caused by 

occlusion. And even more, what about that which lies behind the camera A 

single point cloud captured by the Kinect can tell us nothing more about the 

scene than the surface which is visible from its viewpoint. 

1

Point cloud registration is the act of matching such point-clouds captured from 

different viewpoints with each other so that they form a more completely representation 

of a given scene. It can be done by hand, but it is time-consuming 

and tedious. However, automated solutions have been proposed. 

The aim of this project is to analyze a real world scenario in which two implementations 

of such automatic registration algorithms are used to align point 

clouds captured by the Kinect. In what situations can the Kinect be used, and 

how well do the algorithms perform with regards to speed and precision 

2 Retrieving and Processing data from the Kinect 

While the Kinect was initially released as a gaming peripheal for the Xbox 360 

console, a set of targetted drivers were released around newyear 2011 as part 

of the OpenNI project along with a very useful library for interfacing with the 

device and a collection of high-level computervision functions. 

OpenNI provides everything needed for extracting a colored pointcloud from 

the Kinect. It can extract corresponding depth and color maps, calibrate them 

against each other, and perform the reverse perspective projection to cartesian 

coordinates as described in the introduction. Calibration of the color and depth 

map against each other is necessary because the source cameras for these maps 

are physically offset from each other on a horizontal axis, and because the cameras 

have difference fields of view, and therefore different view frusta. The points 

on the depth map must be re-projected to the viewpoint of the color map for 

good correspondance. As written earlier, OpenNI can do that. 

OpenNI also abstracts all internal scales for depth and position to metric values. 

This is useful as it means that the device can exploited for measuring object 

sizes and lengths in a scene. 

3 Qualitative Analysis of the Input Data 

Through OpenNI the Kinect provides both color (24-bit RGB) and depth (formatting 

description follows) images in 640x480 resolution at a speed of 30Hz. 

This gives a theoretical upper limit of 640×480 = 307200 points in a pointcloud. 

In practice, a scene with optimal capturing conditions will result in a cloud of 

around 265000 points. The color images are about as good as a decent webcam, 

and bayer noise is noticeable. 

3.1 Scanning surfaces with structured light 

TODO: De tre paragraffer her haenger ikke rigtigt sammen. 

The Kinect uses its infrared camera along its a structured light projector to 

estimate the surface of a scene as seen from the camera viewpoint. The projector 

projects an infrared pattern onto the scene which is captured by the infrared 

2

Figure 1: A Kinect-like setup with a single point-projector 

camera. The IR camera’s viewpoint is offset on a horizontal axis relative to the 

projector. 

To illustrate how a structured light camera works, see figure 1 for a Kinectlike 

setup in which a projector is projecting just a single point (a very simple 

pattern) into space along the red line. It is captured by the camera once it hits 

a surface. There are three planes in the illustration; a reference plane, a plane 

closer to the camera than the reference plane, and a more distant plane. When 

the point hits a close plane it will appear further right in the image than if it 

was at the reference plane. Likewise, when the point is projected onto an object 

on a plane which is more distant than the reference plane, the point will appear 

more to the left. 

The Kinect has an image reference of what the pattern looks like from the cameras 

viewpoint when all point in the surface are at a certain, known distance. 

By comparing the horizontal position of a point in the captured image to its 

corresponding horizontal position in the reference image, a binocular disparity 

can be extracted, which in turn can be used to calculate the depth of the pixel. 

The Kinect itself actually does not calculate the depth, but returns a disparity 

for the host system to handle. While OpenNI abstracts this away for the 

developer, libfreenect makes these 11-bit values available. 

3.2 Precision Across the Depth Map 

According to Nicolas Burrus, who pioneered with information about the Kinect 

from his own experiments, the depth of a point z can be calculated in meters 

from the raw disparity of the point d (as provided by the Kinect hardware) 

using the following equation: 

z = 1.0/(d · −0.0030711016 + 3.3309495161) 

3

This equation is peculiar; d is an 11-bit integer which ranges from 0 to 2047. z 

will change sign from positive to negative when d is around 1084, so any practical 

use of values beyond that are useless for depth meassurement. I carried out tests 

with the Kinect pointing straight at a wall and found that it is was to unable 

to reliably meassure depth values below 50 cm (figure 2). These facts mean 

that only values of about 434 to 1084 are be actually useable to represent depth 

values. This is 650 unique disparity values, which is not a lot. Of these, all 

values up to 759 represent a depth of less than 1.0 m, meaning that half of all 

useful disparities output by the Kinect are in the range 50 cm to 1 m. And 

it falls exponentially; only 16 values in the range of disparities correspond to 

depths between 4 and 5 meters. 

The reason for this is likely that as objects come further away from the viewer, 

the effect of binocular disparity lessens, and thereby also the precision of the 

depth meassurement, so the engineers behind the hardware probably did not 

see much benefit in leaving any useful depth information in these ranges. So 

not only will objects which are far away from the viewer consist of less points 

because they take up a smaller area of the picture, but they will also be more 

coarse. The lesson here is that for any object more distant than 2.5-3.0 m, 

one cannot expect a faithful representation of its surface, and that any delicate 

details must be captured in a range of 60 cm to 1.5 m. Whether the hardware 

actually is precise at higher ranges is irrelevant because of the format in which 

it returns its meassurements is of too low fidelity. 

The gymball in figure 3 was shot at two distances, approximatly 2.5m and 80 

cm. The difference in precision is very noticeable - at a distance ball is hardly 

discernable, while at close point it is round and smooth. 

3.3 Artefacts from Structured Light 

Because the camera and projector are offset by design, this means that not 

all points which are visible from the camera can always be illuminated by the 

projector. As a result holes may appear in the depth map where the projected 

pattern has been occluded (see figure 4) 

In general, infrared structed light will have many errors in environments consists 

of anything but diffuse surfaces, or environments with other sources of infrared 

light. For example the light of the sun, which created a hole in the test object 

(a gym ball) in figure 4 due to too much light being reflected back. 

3.4 Inconsistensies across multiple clouds 

TODO, fortael hvordan aendringer i lyssaetning og viewpoint kan have meget 

dramatisk effekt p hvordan scenen ser ud fra kameraet. 

Reference til figur 6. 

3.5 OpenNI/Kinect Stock Calibrations 

For the purpose of capturing point clouds, and computer vision in general, 

cameras are easiest to work with when they capture pictures in the way that an 

4

Figure 2: Emperical test of minimum supported depth of the hardware. Left: 

At over 50 cm, the wall is captured by the kinect (as shown on the screen). 

Right: At less than 50 cm, it is not. While these distances are approximat, one 

should operate with the Kinect well beyond them for the best results. 

Figure 3: Depth-map fidelity at far (left) and near (right) distances for the same 

round object. Pointclouds shown from the side. 

5

Figure 4: Example of projector-shadowing and local IR overexposure caused by 

sunlight. 

Figure 5: Edge of a pointcloud warping towards the viewer. This is supposed to 

be a straight wall. The green line was added post-capture to give an indication 

of where the edges of the wall should be. 

Figure 6: Two clouds taken from slightly different positions with a very short 

delay, and then registered onto each other. Notice how the brightness of the 

door and walls have changed noticeably just from a very minor adjustment. 

6

idealized pinhole camera would. Without going into too many formal details, 

it means that the center of the picture corresponds to what is directly straight 

in front of the camera, that all straight lines are perfectly straight, and that 

there are no other lense-caused distortions (such as fish-eye effects). Most 3d 

computer games use this camera model. 

The ROS team emperically meassured how much their Kinect cameras diverted 

from the pinhole camera model and found the difference to be very small.This 

is good. 

Inspection the point clouds provided by the Kinect through Kinect at OpenNI 

at close point, OpenNI’s calibration of the depth and color map is very decent 

out of the box. For distant objects, a near-perfect calibration is hard to acquire. 

Part of this can be attributed to the lowered precision of the depth meassurement 

at longer distances, and is therefore hard to do anything about clientside. 

At last, the corners of any point cloud will signs of a slight warping (see figure 

5). This cannot be solved with calibration, as even hand-calibrated clouds have 

this issue, so it is more likely due to a flaw in the IR projector or IR camera 

lens causing the edges to be distorted. As the IR data is used in the Kinect 

hardware, correcting these distortions properly would require a large effort. 

For the software that goes with this project, it was decided that the quality of 

the stock OpenNI calibration is not likely to be a major source of error, and 

that it therefore is good enough. 

4 The ICP Registration Algorithm 

The ICP algorithm is a very common registration algorithm devised in the 

early 90’s. It goes as following, where C m denotes a point cloud which is to be 

registered and moved to another point cloud with a static position, C s . 

1. For all points p m in C m , find the closest neighbooring point p s in P m . 

2. Find the rigid transformation, an orthogonal rotation R and a translation 

t, which minimizes the squared distances between the neighbooring pairs 

(enumerated with i): 

min 

R,t 

3. Apply the transformation to C m . 

∑ 

||(Rp mi + t) − p si || 2 

4. Repeat until the algorithm has converged. 

i 

The translation part t of the rigid transformation is defined as the vector which 

will translate the center off mass of the pair-matched points in C m to the center 

of mass of the corresponding points in C s : 

t = 1 n 

n∑ 

(p mi − p si ) 

i=1 

7

Where n denotes the amount of matched pairs found in step 1 of the algorithm, 

and i is used to enumerate these point pair entries so that (p mi , p si ) denotes 

the i’th pair. Notice that some points of C s may be the closest neighboor for 

several points in C m , meaning that they will be weighted more in the center of 

mass calculations. This is intentional. 

Finding R is described in [GELbook]. The description of the procedure is an 

expanded paraphrase of their instructions: 

It can be shown that R can be found via singular value decomposition of the 

following matrix: 

H = 

n∑ 

p mi p T si − 1 n 

n ( ∑ n∑ 

p mi )( p si ) T 

i=1 

TODO: Skriv om multidimensionel ICP. 

i=1 

i=1 

5 The Features-based Registration Algorithm 

The feature-based registration algorithm is similar to ICP in that it searches 

for a set of matched pairs in the moving and the stationary pointcloud, and 

then finds the rigid tranformation which minimizes the sum of squared distance 

between these pairs. However, the algorithm is only run once, and the method 

of finding matching pairs of points in the two clouds is very different. 

It goes as following: 

1. Detect invariant features in the color-images from which C m and C s obtain 

their color information. 

2. Match the two sets of features detected in the source images to each other. 

3. Extract a set of matched pairs in 3D-space by corresponding the location 

of each feature with the point in the pointcloud which it maps to. 

4. Find the rigid transformation which minimizes the squared distance between 

the matched pairs. One can use the same procedure for this as 

described for ICP. 

Detection and matching of features are big problems in themselves, but the 

freely available library OpenCV offers ready-made solutions which can be used 

for that. 

6 Implementation 

The two registration algorithms were implemented in C++ along with a user 

interface for the purpose of testing their performance. 

TODO: Skriv om hvordan ICP er implementeret (dimensioner, templates, KDTree 

fra GEL), hvordan Featurebased Registration er implementeret (OpenCV, Dynamic 

detectors), hvordan point clouds er lagret og behandlet, hvordan race 

conditions er blevet undget, hvordan UI’en fungerer 

8

7 Precision and performance 

8 Conclusion 

9 References 

9

Bachelor Thesis A Qualitative Analysis of Two Automated ...

Create successful ePaper yourself

Delete template?

Save as template?