Study and Implementation of Stereo Vision Systems for ... - Utopia

Study and Implementation 

of Stereo Vision Systems 

for Robotic Applications 

Lazaros Nalpantidis 

Thesis submitted for the degree of Doctor of Philosophy 

Department of Production and Management Engineering 

Democritus University of Thrace, Greece 

Xanthi, September 2010

Title: Study and Implementation of Stereo Vision Systems for Robotic 

Applications 

Author: Lazaros Nalpantidis 

Thesis submitted for the degree of Doctor of Philosophy 

to the 

Production and Management Engineering Department 

Democritus University of Thrace, Greece 

Advising Committee: 

Chairman: Assistant Professor Antonios Gasteratos, P.&M.E. Dept., DUTH 

Member: Professor Vassilios Tourassis, P.&M.E. Dept., DUTH 

Member: Associate Professor Dimitrios Koulouriotis, P.&M.E. Dept., DUTH 

Xanthi, Greece 

September 2010

Dedicated to my wonderful parents 

Georgios and Sofia, 

and to my own "Penelope" 

Efi.

Summary - Contribution to the State of the Art 

Stereo vision has been chosen by natural selection as the most common way to estimate the depth of 

objects. A pair of two-dimensional images is enough in order to retrieve the third dimension of the 

scene under observation. The importance of this method is great, apart from the living creatures, for 

sophisticated machine systems, as well. During the last years robotics has made significant progress 

and the state of the art is now about achieving autonomous behaviors. In order to accomplish 

the target of robots being able to move and act autonomously, accurate representations of their 

environments are required. Both these fields, stereo vision and accomplishing autonomous robotic 

behaviors, have been in the center of this PhD thesis. The issue of robots using machine stereo 

vision is not a new one. The number and significance of the researchers that have been involved, 

as well as the publishing rate of relevant scientific papers indicates an issue that is interesting and 

still open to solutions and fresh ideas rather than a banal and solved issue. 

The motivation of this PhD thesis has been the observation that the combination of stereo vision 

usage and autonomous robots is usually performed in a simplistic manner of simultaneously using 

two independent technologies. This situation is owed to the fact that the two technologies have 

evolved independently and by different scientific communities. Stereo vision has mainly evolved 

within the field of computer vision. On the other hand, autonomous robots are a branch of the 

robotics and mechatronics field. Methods that have been proposed within the frame of computer 

vision are not generally satisfactory for use in robotic applications. This fact is due to that an 

autonomous robot places strict constraints concerning the demanded speed of calculations and the 

available computational resources. Moreover, their inefficiency is commonly owed to factors related 

to the environments and the conditions of operation. As a result, the used algorithms, in this case 

the stereo vision algorithms, should take into consideration these factors during their development. 

The required compromises have to retain the functionality of the integrated system. 

The objective of this PhD thesis is the development of stereo vision systems customized for use in 

autonomous robots. Initially, a literature survey was conducted concerning stereo vision algorithms 

and corresponding robotic applications. The survey revealed the state of the art in the specific 

field and pointed out issues that had not yet been answered in a satisfactory manner. Afterwards, 

novel stereo vision algorithms were developed, which satisfy the demands posed by robotic systems 

and propose solutions to the open issues indicated by the literature survey. Finally, systems that 

embody the proposed algorithms and treat open robotic applications’ issues have been developed. 

Within this dissertation there have been used for the first time and combined in a novel way 

various computational tools and ideas originating from different scientific fields. There have been 

vii

viii 

used biologically and psychologically inspired methods, such as the logarithmic response law (Weber- 

Fechner law) and the gestalt laws of perceptual organization (proximity, similarity and continuity). 

Furthermore, there have been used sophisticated computational methods, such as 2D and 3D cellular 

automata and fuzzy inference systems for computer vision applications. Additionally, ideas from the 

field of video coding have been incorporated in stereo vision applications. The resulting methods 

have been applied to basic computer vision depth extraction applications and even to advanced 

autonomous robotic behaviors. 

In more detail, the possibility of implementing effective hardware-implementable stereo correspondence 

algorithms has been investigated. Specifically, an algorithm that combines rapid execution, 

simple and straight-forward structure, as well as high-quality of results is presented. These 

features render it as an ideal candidate for hardware implementation and for real-time applications. 

The algorithm utilizes Gaussian aggregation weights and 3D cellular automata in order to achieve 

high-quality results. This algorithm comprised the basis of a multi-view stereo vision system. The 

final depth map is produced as a result of a certainty assessment procedure. Moreover, a new hierarchical 

correspondence algorithm is presented, inspired by motion estimation techniques originally 

used in video encoding. The algorithm performs a 2D correspondence search using a similar hierarchical 

search pattern and the intermediate results are refined by 3D cellular automata. This 

algorithm can process uncalibrated and non-rectified stereo image pairs, maintaining the computational 

load within reasonable levels. It is well known that non-ideal environmental conditions, 

such as differentiations in illumination depending on the viewpoint heavily affect the stereo algorithms’ 

performance. In this PhD thesis a new illumination-invariant pixels’ dissimilarity measure 

is presented that can substitute the established intensity-based ones. The proposed measure can be 

adopted by almost any of the existing stereo algorithms, enhancing them with its robust features. 

The algorithm using the proposed dissimilarity measure has outperformed all the other examined 

algorithms, exhibiting tolerance to illumination differentiations and robust behavior. Moreover, a 

novel stereo correspondence algorithm that incorporates many biologically and psychologically inspired 

features to an adaptive weighted sum of absolute differences framework is presented. In 

addition to ideas already exploited, such as the color information utilization, gestalt laws of proximity 

and similarity, new ones have been adopted. The algorithm introduces the use of circular 

support regions, the gestalt law of continuity, as well as the psychophysically-based logarithmic response 

law. All the aforementioned perceptual tools act complementarily inside a straight-forward 

computational algorithm. 

Furthermore, stereo correspondence algorithms have been further exploited as the basis of more 

advanced robotic behaviors. Vision-based obstacle avoidance algorithms for autonomous mobile 

robots are presented. These algorithms avoid, as much as possible, computationally complex processes. 

The only sensor required is a stereo camera. The algorithms consist of two building blocks. 

The first one is a stereo algorithm, able to provide reliable depth maps of the scenery in frame rates 

suitable for a robot to move autonomously. The second building block is either a simple decisionmaking 

algorithm or a fuzzy logic-based one, which analyze the depth maps and deduce the most 

appropriate direction for the robot to avoid any existing obstacles. Finally, a visual Simultaneous 

Localization and Mapping (SLAM) algorithm suitable for indoor applications is proposed. The 

algorithm is focused on computational effectiveness and the only sensor used is a stereo camera 

placed onboard a moving robot. The algorithm processes the acquired images calculating the depth 

of the scenery, detecting occupied areas and progressively building a map of the environment. The 

stereo vision-based SLAM algorithm embodies a custom-tailored stereo correspondence algorithm, 

the robust scale and rotation invariant feature detection and matching "Speeded Up Robust Fea-

tures" (SURF) method, a computationally effective v-disparity image calculation scheme, a novel 

map-merging module, as well as a sophisticated cellular automata-based enhancement stage. 

ix

Preface 

"It was as if God had decided to put to the test every 

capacity for surprise and was keeping the inhabitants of 

Macondo in a permanent alternation between excitement 

and disappointment, doubt and revelation, to such an 

extreme that no one knew for certain where the limits of 

reality lay." 

Gabriel Garcia Marquez 

"One Hundred Years of Solitude" 

This quote of Gabriel Garcia Marquez could have been a nice and brief summary of the impressions 

left by this (and maybe many other) PhD thesis. I could not have had the slightest idea of how 

expressive these few lines would sound to me, when I first read them. The alteration of emotions, 

from sunny moments of private "glory" and excitement to dark moments of frustration and doubt, 

is the predominant impression from a long way of effort. Hopefully, much less than "one hundred 

years of solitude" were required for this work to be concluded! Testing my strength and persistence 

were the personal reward from this work, which I keep for myself. Any scientific knowledge gained, 

I share it with the readers of this thesis. 

The subject of this thesis has to do with stereo vision systems for robotic applications. Although 

I had some experience in designing analog imaging sensors, it was my advisor who encouraged me to 

get involved with the specific topic. The importance of vision systems is indisputable in fields such 

as image processing, computer vision and robotics. After all, this thesis is being published while the 

latest (2009) Nobel prize in Physics has been awarded to "The masters of light" and especially by 

one half to Willard S. Boyle and George E. Smith "for the invention of an imaging semiconductor 

circuit - the CCD sensor". Using and exploiting the possibilities of such sensors comprises the basis 

of the fields that this thesis deals with. Even if we are too small to achieve the glory of such giants, 

we can always stand on their shoulders and present our own efforts. 

The workflow of these years is to a large extent mirrored to the structure of this thesis, which 

is organized in five chapters. The first chapter is introductory. It presents basic concepts that are 

used later in the thesis and provides a first contact with the issue of stereo vision. The second 

chapter bears a literature survey concerning stereo correspondence algorithms, their hardware implementations 

and robotic applications of stereo vision. Furthermore, it presents some open issues of 

robotics-oriented stereo vision, as resulted from the literature analysis. The third chapter presents 

novel stereo correspondence algorithms that were developed within this dissertation, as well as 

experimental and comparative results. The fourth chapter presents robotic applications of stereo 

vision systems and the corresponding experimental results. Finally, the fifth chapter bears the conclusions 

that were reached during the course of this dissertation, discusses the results and describes 

further future directions of this work. 

The completion of this thesis would not be feasible without the help and support of many others. 

First of all, I would like to thank my advisor and chair of my advising committee Assistant Professor 

Antonios Gasteratos. Antonios has been both an academic and personal tutor. Our relationship, 

which started with my doctoral studies, has grown to become a relationship of respect, trust and 

xv

xvi 

finally friendship. Antonios has encouraged and guided my endeavors. His support, both scientific 

and moral, was influential and motivational for me. Antonios has trusted me to be his colleague in 

various funded research projects, in which he were scientific responsible. More specifically, he has 

enrolled me in the following projects: "Vision and Chemiresistor Equipped Web-connected Finding 

Robots (View-Finder), FP6-IST-2006-045541", "Autonomous Collaborative Robots to Swing 

and Work in Everyday EnviRonment (ACROBOTER), FP6-IST-2006-045530", "Innovative Novel 

First Responders Applications (INFRA), FP7-ICT-SEC-2007-1-225272", all funded by the European 

Commission. 

I would also like to express my gratitude to the rest two members of my advising committee 

vice-rector Professor Vassilios Tourassis and Associate Professor Dimitrios Koulouriotis for their 

interest concerning the progress of my efforts and their help whenever an issue concerning my 

doctoral studies arose. 

A special position among the people that I have met and worked with since my arrival to Xanthi 

belongs to Assistant Professor Georgios Ch. Sirakoulis. Georgios has been truly supportive since 

my first steps and gave me his insight whenever I asked for it. I had the chance to work with 

him and I have learned many things by him at a scientific, academic and moral level. His deep 

scientific knowledge and his kind and polite manners during our often conversations have been 

rather influential for me. 

I also owe many thanks to the people that I worked within the same laboratory during my 

years in Xanthi. Dimitrios Chrysostomou, my closest and always helpful friend in Xanthi, Rigas 

Kouskouridas, Nikolaos Kyriakoulis and I shared the same worries and dreams and became valuable 

colleagues and close friends. The same stands for the younger members of our laboratory, my good 

friends Vasileios Belagiannis and Ioannis Kostavelis. 

I kept this last place for my family. They were always there for me in a quiet and gentle manner. 

Their support and love has played the most important role for me. A lot of changes have happened 

during these years; good and bad emotional situations, each one having something to teach. I owe 

much respect and gratitude to the memory of my uncle Eleftherios whose encouragement, trust and 

enlightened advices have motivated and guided me towards where I stand right now. I would also 

like to thank my other half, my beloved fiancee Efi, whose love, understanding and support has 

always been crucial for me. These last words I would like to dedicate to my wonderful parents who 

I love and respect endlessly; Georgios and Sofia. 

Xanthi, 

September 2010 Lazaros Nalpantidis

xviii 

2.4.3 Uncalibrated Stereo Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 

2.4.4 Non-ideal Lighting Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 

2.4.5 Biologically Inspired Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 

3 Stereo Correspondence Algorithms .......................................... 43 

3.1 Stereo Correspondence Algorithm with Enhanced Disparity Selection . . . . . . . . . . . . 43 

3.1.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 

3.1.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

3.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

3.2 Quad-view Stereo Correspondence Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 


3.2.2 Experimental Results and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 

3.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

3.3 Hierarchical Stereo Correspondence Algorithm for Uncalibrated Images . . . . . . . . . . . 56 



3.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 

3.4 Biologically and Psychophysically Inspired Stereo Correspondence Algorithm . . . . . . 67 

3.4.1 Novel Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 



3.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 

3.5 Illumination-Invariant Dissimilarity Measure and Stereo Correspondence Algorithm 77 

3.5.1 Description of Illumination-Invariant Dissimilarity Measure . . . . . . . . . . . . . . . 77 



3.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 

4 Robotic Applications of Stereo Vision ....................................... 93 

4.1 Stereo Vision-based Obstacle Avoidance Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 

4.1.1 Threshold Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

4.1.2 Experimental Results for the Threshold Algorithm. . . . . . . . . . . . . . . . . . . . . . . 97 

4.1.3 Fuzzy Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 

4.1.4 Experimental Results for the Fuzzy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 102 

4.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 

4.2 Stereo Vision-based SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

4.2.1 Stereo Vision Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 

4.2.2 Camera’s Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

4.2.3 Local Map Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

4.2.4 Global Map Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 


4.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 

5 Conclusion and Future Work ................................................ 117 

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

References ....................................................................... 121 

Abbreviations ................................................................... 131 

Thesis Publications .............................................................. 133 

xix

List of Figures 

1.1 RGB colorspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.2 HSL colorspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 

1.3 HSV colorspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

1.4 CIELab colorspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

1.5 Views of the a 3D volume and of a 3D neighborhood defined in it . . . . . . . . . . . . . . . . 5 

1.6 Images whose parts are differentiated from the wholesome pattern, explainable 

through Gestalt theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

1.7 Rectification of a stereo pair. The two images Il, Ir of an object d (a) are replaced 

by two pictures Il,rect, Ir,rect that lie on a common plane P (b) . . . . . . . . . . . . . . . . . . 8 

1.8 Geometry of epipolar lines, where C1 and C2 are the left and right camera lens 

centers, respectively. (a) Point P1 in one image plane may have arisen from any of 

the points in the line C1P1, andmayappearinthealternateimageplaneatany 

point on the epipolar line E2; (b) In the case of non-rectified images, point P1 may 

have arisen for any of the points inside the block B ............................. 9 

1.9 Human eye’s (left) and a typical camera’s (right) color sensitivities . . . . . . . . . . . . . . . 10 

1.10 DSI containing matching costs for every pixel of the image and for all its potential 

disparity values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 

1.11 General structure of a stereo correspondence algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 12 

2.1 Categorization of stereo vision algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

2.2 Left image of the stereo pair (left) and ground truth (right) for the Tsukuba (a), 

Sawtooth (b), Map (c), Venus (d), Cones (e) and Teddy (f) stereo pair. . . . . . . . . . . . 15 

2.3 Diagrammatic representation of the local methods’ categorization . . . . . . . . . . . . . . . . 16 

2.4 Diagrammatic representation of the global methods’ categorization . . . . . . . . . . . . . . . 20 

2.5 An ASIC chip (a) and a FPGA development board (b) . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

2.6 Generalized block diagram of a hardware implementable stereo correspondence 

algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 

2.7 Mobile robots in real environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

2.8 A real-life stereo pair suffering from different illumination . . . . . . . . . . . . . . . . . . . . . . . 41 

3.1 Block diagram of the presented stereo correspondence algorithm . . . . . . . . . . . . . . . . . . 44 

3.2 2D Gaussian mask producing the weight for the pixel summation . . . . . . . . . . . . . . . . . 45 

xxi

xxii 

3.3 Results for the Middlebury data sets. From left to right: the Tsukuba, Venus, Teddy 

and Cones image From top to bottom: the reference (left) images (a), the provided 

ground truth disparity maps (b), the disparity maps calculated by the presented 

method (c), maps of signed disparity error (d), and maps of pixels with absolute 

computed disparity error bigger than 1 (e) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

3.4 Self-recorded scenes. (a) outdoor scene, (b) indoor scene. From left to right: left 

image, right image, calculated disparity map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

3.5 (a) The quad-camera configuration and (b) the results (up-left) and scene capturing 

(right) using the quad-camera configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 

3.6 Algorithm’s steps and results for the Tsukuba data set. (column 1) the reference 

image (up-left), (column 2) the three target images (up-right, down-left, down-right), 

(column 3) the certainty maps for the horizontal, vertical and diagonal pair, 

(column 4) the computed disparity map for each stereo pair, (column 5) the fused 

(top) and the ground truth (bottom) disparity maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 

3.7 Results of the presented fusion system (left), the computationally equivalent simple 

stereo algorithm (middle) and the preliminary simple stereo algorithm applied on 

the horizontal image pair (right). From top to bottom: the computed disparity 

maps, pixels with absolute computed disparity error bigger than 1 and maps of 

signed disparity error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 

3.8 Algorithm’s steps and results for a synthetic room scene. (column 1) The 

reference image (up-left), (column 2) the three target images (up-right, down-left, 

down-right), (column 3) the certainty maps for the horizontal, vertical and diagonal 

pair, (column 4) the computed disparity map for each stereo pair, (column 5) the 

final fused depth map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 

3.9 Application results obtained using the calculated depth maps. (a) View of the 

reconstructed Tsukuba scene and (b) obstacle detection in the virtual room scene . . 55 

3.10 Quadruple, double and single pixel sample matching algorithm . . . . . . . . . . . . . . . . . . . 58 

3.11 General scheme of the presented hierarchical matching disparity algorithm. The 

search block is enlarged for viewing purposes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

3.12 Block diagram of the hierarchical disparity search algorithm . . . . . . . . . . . . . . . . . . . . . 59 

3.13 (a), (b) The uncalibrated, diagonally captured input images and the resulting 

disparity maps of the presented algorithm for (c) the quadruple, (d) double and (e) 

single pixel estimation respectively. The result of (f) Ogale & Aloimonos (2007) and 

(g) Yoon & Kweon (2006a) forthesameinputimages........................... 62 

3.14 From left to right: the left and right 10% distorted input images and the calculated 

final disparity map for the (from up to down:) Tsukuba, Venus, Teddy and Cones 

image sets respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 

3.15 (from left to right:) The left and right distorted input images and the calculated 

final disparity maps for various percentages of the induced lens distortion . . . . . . . . . . 64 

3.16 The NMSE for the Tsukuba image pair for various distortion percentages . . . . . . . . . . 65 

3.17 (a), (b) The self-captured input images of an alley, and the resulting disparity maps 

for (c) the quadruple, (d) double and (e) single pixel estimation respectively . . . . . . . 66 

3.18 (a), (b) The self-captured input images of a building, and the resulting disparity 

maps for (c) the quadruple, (d) double and (e) single pixel estimation respectively . . 66 

3.19 (a), (b) The self-captured input images of a corner, and the resulting disparity maps 

for (c) the quadruple, (d) double and (e) single pixel estimation respectively . . . . . . . 67

xxiii 

3.20 Perceived intensity response according to the Weber-Fechner law . . . . . . . . . . . . . . . . . 69 

3.21 Block diagram of the algorithm’s structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 

3.22 Results for the Middlebury data sets. From left to right: the Tsukuba, Venus, Teddy 

and Cones image. From top to bottom: the reference (left) images, the provided 

ground truth disparity maps, the disparity maps calculated by the presented 

method, maps of signed disparity error and maps of pixels with absolute computed 

disparity error bigger than 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

3.23 Results for new data sets. From top to bottom: Aloe, Babe3, Bowling2, Cloth1, 

Cloth3, Cloth4 and Flowerpots. From left to right: the reference (left) image of 

the stereo pair, the provided ground truth, the disparity map computed by the 

presented algorithm and error map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 

3.24 Views of the HSL color space representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 

3.25 Block diagram of the utilized stereo correspondence algorithm . . . . . . . . . . . . . . . . . . . 80 

3.26 Results for the Middlebury data sets. From top to bottom: the Tsukuba, Venus, 

Teddy and Cones image sets. From left to right: the reference (left) input images, 

the right input images, the disparity maps calculated by the presented LCDM-based 

method, maps of pixels with absolute computed disparity error bigger than 1 and 

maps of signed disparity error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 

3.27 Left input images (a), right input images with altered luminosity (b) and calculated 

disparity maps for the presented (c), its RGB-based AD version (d), the ZNCC 

stereo (e) and the Ogale-Aloimonos (f) algorithms for various Lightness conditions . . 83 

3.28 Percentage of erroneously calculated pixels for the presented, its RGB-based AD 

version, the ZNCC and the Ogale-Aloimonos stereo algorithms for various lightness 

conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 

3.29 From left to right: left input images with constant luminosity, right input images 

with luminosity grading from -50% to +50% along the horizontal direction, 

and calculated disparity maps for the standard image sets using the presented 

LCDM (c), the AD-based variant algorithm (d), the AD-based variant algorithm 

with histogram equalization (e), the AD-based variant algorithm with Retinex 

enhancement (f), the AD-based variant algorithm with enhanced pictures according 

to Vonikakis et al. (2008) (g) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 

3.30 Percentage of pixels whose absolute disparity error is greater than 1 for standard 

image sets calculated using the presented LCDM, the AD-based variant algorithm, 

the AD-based variant algorithm with histogram equalization, the AD-based 

variant algorithm with Retinex enhancement, the AD-based variant algorithm with 

enhanced pictures according to Vonikakis et al. (2008) . . . . . . . . . . . . . . . . . . . . . . . . . . 87 


with luminosity grading from -50% to +50% along the horizontal direction, and 

calculated disparity maps for the standard image sets using the presented LCDM 

(c), the ZNCC algorithm (d), the ZNCC algorithm with histogram equalization (e), 

the ZNCC algorithm with Retinex enhancement (f), the ZNCC algorithm with 

enhanced pictures according to Vonikakis et al. (2008) (g) . . . . . . . . . . . . . . . . . . . . . . . 88

xxiv 


image sets calculated using the presented LCDM, the ZNCC algorithm, the 

ZNCC algorithm with histogram equalization, the ZNCC algorithm with Retinex 

enhancement, the ZNCC algorithm with enhanced pictures according to Vonikakis 

et al. (2008) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 


with luminosity grading from -50% to +50% along the horizontal direction, 

and calculated disparity maps for the standard image sets using the presented 

LCDM (c), the Ogale-Aloimonos algorithm (d), the Ogale-Aloimonos algorithm 

with histogram equalization (e), the Ogale-Aloimonos algorithm with Retinex 

enhancement (f), the Ogale-Aloimonos algorithm with enhanced pictures according 

to Vonikakis et al. (2008) (g) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 


image sets calculated using the presented LCDM, the Ogale-Aloimonos algorithm, 

the Ogale-Aloimonos algorithm with histogram equalization, the Ogale-Aloimonos 

algorithm with Retinex enhancement, the Ogale-Aloimonos algorithm with 

enhanced pictures according to Vonikakis et al. (2008) . . . . . . . . . . . . . . . . . . . . . . . . . . 89 

3.35 Various self-recorded outdoor input image pairs and the resulting disparity maps. 

From left to right: the left and right input images and the disparity maps calculated 

with: the presented LCDM-based algorithm, the RGB AD-based algorithm applied 

on the raw images, the RGB AD-based algorithm applied on the histogram 

equalized images, the RGB AD-based algorithm applied on the Retinex enhanced 

images, the RGB AD-based algorithm applied on the images enhanced according to 

Vonikakis et al. (2008) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 

4.1 Flow chart of the implemented threshold-based obstacle avoidance algorithm . . . . . . . 95 

4.2 Image enhancement steps of the presented stereo algorithm . . . . . . . . . . . . . . . . . . . . . . 95 

4.3 Depth map’s division in three windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 

4.4 A sample outdoor route and the algorithm’s outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 

4.5 Percentage of the algorithm’s correct decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 

4.6 Percentage of certainty for the algorithm’s decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 

4.7 (a) Stereo camera equipped mobile robotic platform and (b) floor plan of the robot’s 

environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

4.8 Fuzzy membership functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 

4.9 Test images and disparity maps where the algorithm chose to move forward . . . . . . . . 103 

4.10 Test images and disparity maps where the algorithm chose to move left . . . . . . . . . . . 104 

4.11 Test images and disparity maps where the algorithm chose to move right . . . . . . . . . . 105 

4.12 Test images and disparity map where the algorithm chose to move backwards . . . . . . 106 

4.13 Outline of the presented SLAM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

4.14 Reference image (a) of an indoor scene and sparse disparity map (b) obtained with 

the presented stereo correspondence algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

4.15 Depth vs. camera’s motion estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

4.16 Environment’s maps for the scene of Figure 4.14 obtained with the presented 

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 

4.17 V-disparity images for the image of Figure 4.14(a) and the corresponding disparity 

map of Figure 4.14(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.18 Features detected and matched using SURF for various consecutive images of the 

used dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

4.19 Experimental results after processing 1 (first row), 2 (second row), 6 (third row), 

and 10 (fourth row) image pairs of the scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

xxv

xxvi

List of Tables 

2.1 Characteristics of the most common stereo image sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

2.2 Characteristics of local algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

2.3 Characteristics of global algorithms that use global optimization . . . . . . . . . . . . . . . . . 24 

2.4 Characteristics of global algorithms that use DP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

2.5 Characteristics of the algorithms that cannot be clearly assigned to any category . . . 29 

2.6 Characteristics of the algorithms that produce sparse output . . . . . . . . . . . . . . . . . . . . . 31 

2.7 FPGA implementations’ characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

3.1 Percentage of pixels whose absolute disparity error is greater than 1 in various 

regions of the images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

3.2 Calculated NMSE for various versions of the presented algorithm . . . . . . . . . . . . . . . . . 48 

3.3 Percentage of pixels whose absolute disparity error is greater than 1 in various 

regions for the Tsukuba pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 

3.4 Calculated NMSE for the presented algorithm for various pairs with constant 

distortion 10% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

3.5 Calculated NMSE for the presented algorithm for the Tsukuba pair with various 

distortion percentages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 

3.6 Variation of the presented algorithm’s results for the Tsukuba image set when 

excluding one of the new concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

3.7 Evaluation of various ASW and local algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 


image sets using the presented LCDM-based algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 81 

4.1 Results for the cases where the algorithm chose to move forward . . . . . . . . . . . . . . . . . 104 

4.2 Results for the cases where the algorithm chose to move left . . . . . . . . . . . . . . . . . . . . . 105 

4.3 Results for the cases where the algorithm chose to move right . . . . . . . . . . . . . . . . . . . . 105 

4.4 Results for the cases where the algorithm chose to move backwards . . . . . . . . . . . . . . . 106 

xxvii

Chapter 1 

Introduction 

Vision begins with light being captured by a sensor, i.e. the eye for living creatures or the camera 

sensor for machines. Stereo vision involves the simultaneous use of two such sensors and leads to 

the perception of depth. This fact was first realized by Sir Charles Wheatstone about two centuries 

ago who stated: "...the mind perceives an object of three dimensions by means of the two dissimilar 

pictures projected by it on the two retinae..." (Wheatstone 1838). In order to achieve successful 

operation of stereo vision systems various basic concepts have to contribute. This Chapter serves 

as an introduction to the most essential of these concepts and to the mechanisms of robotic stereo 

vision. 

1.1 Basic Concepts 

Stereo vision algorithms involve the use of more basic tools so as to provide accurate depth results. 

Many of those tools model aspects of the operation of the Human Visual System (HVS). The most 

essential of them have to do with the color representation and perceptual processing tools. 

1.1.1 Color Models 

Contemporary cameras are able to provide reliable color information about the depicted objects 

and scenes. Color models, also referred to as colorspaces, provide coordinate systems that allow the 

specification of each color as a point within this coordinate system in a standard manner (Gonzalez 

&Woods1992). 

Color cameras commonly use a combination of red-, green-, and blue-sensitive photoreceptors 

in order to capture color information in each of their pixels in accordance with the way the HVS 

performs the same task. As a result, the directly available color information is coded in the popular 

RGB colorspace. However, many machine vision applications require different colorspaces in order 

to facilitate the demanded operations. Consequently, HSL, HSV, CIELab as well as numerous other 

color models have been developed and are being used. 

1

2 Chapter 1. Introduction 

The RGB color model 

The Red-Green-Blue (RGB) color model consider the Red, Green and Blue as the primary spectral 

components. The colorspace is represented as a cube on a Cartesian coordinate system, as shown in 

Figure 1.1. Each color is represented by a triad of values (r,g,b) or as a vector extending from the 

origin of the coordinate system. The normalized value of each variable lies within the range [0,1]. 

Point (0,0,0) represents the black color and (1,1,1) the white. As a result, in the RGB color model 

each other color arises as a mixture of the three primary colors 

Fig. 1.1 RGB colorspace 

The HSL color model 

The Hue-Saturation-Luminosity/Lightness (HSL) often called Hue-Luminosity/Lightness-Saturation 

(HLS) color model representation is a double cone, as shown in Figure 1.2. In this colorspace H 

stands for hue and determines the human impression about which color (red, green, blue, etc) is 

depicted. Each color is represented by an angular value ranging between 0 and 360 degrees (0 being 

red, 120 green and 240 blue). S stands for saturation and determines how vivid or gray the particular 

color seems. Its value ranges from 0 for gray to 1 for fully saturated (pure) colors. The L 

channel of the HSL colorspace stands for the Luminosity and determines the intensity of a specific 

color. It ranges from 0 for completely dark colors (black) to 1 for fully illuminated colors (white). 

The transition from the RGB colorspace, which is the usual output of contemporary cameras, 

to the HSL is straightforward and does not involve any complicated mathematical computations 

(Gonzalez & Woods 1992) as shown in Eq. 1.1-1.3 for M = max(R, G, B) and m = min(R, G, B) : 

⎧ 

0 if M = m, 

⎪⎨ 

(60 

H = 

⎪⎩ 

o × G−B 

M−m + 360o )mod360o if M = R, 

60o × G−B 

M−m + 120o if M = G, 

60o × G−B 

M−m + 240o if M = B. 

(1.1) 

L = 1 

(M + m) (1.2) 

2


Fig. 1.3 HSV colorspace 

hand, the value of b* represents how blue or yellow the color is. Negative values of b* indicate blue, 

while positive values of b* indicate yellow. 

Fig. 1.4 CIELab colorspace 

1.1.2 Cellular Automata 

Cellular Automata (CA) were first introduced by von Neumann, who was thinking of imitating 

the behavior of a human brain in order to build a machine able to solve very complex problems 

(Von Neumann 1966). His ambitious project was to show that complex phenomena can, in principle, 

be reduced to the dynamics of many identical, very simple primitives, capable of interacting and 

maintaining their identity (Chopard & Droz 1998). Following a suggestion by Ulam (Ulam 1952), von 

Neumann adopted a fully discrete approach, in which space, time and even the dynamical variables 

were defined to be discrete. Consequently, CA comprise a very effective computational tool in 

simulating physical systems and solving scientific problems, because they can capture the essential

Chapter 1. Introduction 5 

features of systems where global behavior arises from the collective effect of simple components 

which interact locally (Feynman 1982, Wolfram 1986). 

In CA analysis, physical processes and systems are described by a cell array and a local rule, 

which defines the new state of a cell depending on the states of its neighbors. All cells can work in 

parallel due to the fact that each cell can independently update its own state. Therefore CA models 

are massively parallel and comprise ideal candidates for hardware implementation (Mardiris et al. 

2008, Kotoulas, Gasteratos, Sirakoulis, Georgoulas & Andreadis 2005, Sirakoulis et al. 2003). 

CA can easily handle complicated boundary and initial conditions (Von Neumann 1966, Wolfram 

1986). Using a more formal definition, a CA system requires: 

1. aregularlatticeofcellscoveringaportionofad-dimensional space; 

2. asetC(r,t)={C1(r,t),C2(r,t),...,Cm(r,t)} of variables attached to each position r of the 

lattice giving the local state of each cell at the time t =0, 1, 2,...; 

3. aruleR = {R1,R2,...,Rm} which specifies the time evolution of the states C(r,t) in the following 

way: 

Cj(r,t+ 1) = Rj{C(r,t),C(r + δ1,t), 

C(r + δ2,t)..., C(r + δq,t)} 

where r + δk designate the cells belonging to a given neighborhood of cell r. 

Furthermore, the CA approach is consistent with the modern notion of unified space-time. In 

computer science, space corresponds to memory and time to processing unit. In CA, memory (CA 

cell state) and processing unit (CA local Rule) are inseparably related to a CA cell (Nalpantidis 

et al. 2008a). 

While CA are usually considered as tools applicable to two-dimensional structures, threedimensional 

structures, as the one shown in Figure 1.5(a) can be also processed. 3D CA include 

rules that involve some or all the three directions of such structures. As a result, the neighborhoods 

affecting the state of each cell are also 3D. Figure 1.5(b) shows such a 3 × 3 × 3 3D neighborhood. 

(a) A 3D volume (x, y, z) (b) A 3×3×3 neighborhood 

Fig. 1.5 Views of the a 3D volume and of a 3D neighborhood defined in it


1.1.3 Gestalt Laws of Perceptual Organization 

Gestalt is a movement of psychology that deals with perceptual organization (Kohler 1969). Gestalt 

psychology examines the relationships that bond individual elements so as to form a group (Forsyth 

& Ponce 2002). As a consequence, a pattern emerges instead of separate parts. This pattern has 

generally completely different characteristics to its parts, as shown in Figure 1.6. 

(a) (b) (c) 

Fig. 1.6 Images whose parts are differentiated from the wholesome pattern, explainable through Gestalt theory 

Some of the gestalt rules by which elements tend to be associated together and interpreted as a 

group are the following (Sternberg 2002): 

• Proximity: elements that are close to each other. 

• Similarity: elements similar in an attribute. 

• Continuity: elements that could belong to a smooth larger feature. 

• Common fate: elements that exhibit similar behavior. 

• Closure: elements that could provide closed curves. 

• Parallelism: elements that seem to be parallel. 

• Symmetry: elements that exhibit a larger symmetry. 

Gestalt laws have proven themselves to be precious tools in interpreting the way the human 

perceives his environment through vision (Scholl 2001). While all the laws are valuable in order 

to understand the context of an image, basic image processing tasks could be restricted to using 

the most basic ones. In order to express an image processing task through the prism of the gestalt 

theory, pixels should be considered as the elements. The correlation degree between them should 

be treated as the bonding relationship of the elements. 

1.2 Stereoscopic Vision 

Calculating the distance of various points, or any other primitive, in a scene relative to the position 

of a camera is one of the important tasks of a computer vision system. The most common method 

for extracting depth information from intensity images is by means of a pair of synchronized camerasignals, 

acquired by a stereo rig. The point-by-point matching between the two images from the


stereo setup (also known as the stereo correspondence problem) derives the depth images, or the so 

called disparity maps (Faugeras 1993). 

The estimation of the disparity map between two images of the same scene is a long-standing 

issue for the machine vision community (Marr & Poggio 1976). Stereoscopic vision is based on the 

principal, first utilized by nature, that two spatially differentiated views of the same scene provide 

enough information so as to perceive the depth of the portrayed objects. Thus, the importance 

of stereo correspondence is apparent in the fields of machine vision (Jain et al. 1995), computer 

vision (Forsyth & Ponce 2002), virtual reality, robot navigation (Metta et al. 2004), simultaneous 

localization and mapping (Murray & Little 2000), (Murray & Jennings 1997), depth measurements 

(Manzotti et al. 2001) and 3D environment reconstruction (Jain et al. 1995). 

1.2.1 Image Rectification 

In the general case, the image planes of the two capturing cameras do not belong to the same 

plane. While stereo algorithms can deal with such cases, the demanded calculations are considerably 

simplified if the stereo image pair has been rectified. The process of rectification, as shown in Figure 

1.7 involves the replacement of the initial image pair Il, Ir by another projectively equivalent pair 

Il,rect, Ir,rect (Forsyth & Ponce 2002, Faugeras 1993). The initial images are reprojected on a 

common plane P that is parallel to the baseline B joining the optical centers of the initial images. 

1.2.2 Epipolar Geometry 

Epipolar geometry provides tools in order to solve the stereo correspondence problem, i.e. to recognize 

the same feature in both images. If no rectification is performed, the matching procedure 

involves searching within two-dimensional regions of the target image, as shown in Figure 1.8(b). 

However, this matching can be done as a one-dimensional search if accurately rectified stereo pairs 

are assumed in which horizontal scan lines reside on the same epipolar line, as shown in Figure 

1.8(a). A point P1 in one image plane may have arisen from any of points in the line C1P1, andmay 

appear in the alternate image plane at any point on the epipolar line E2 (Jain et al. 1995). Thus, 

the search is theoretically reduced within a scan line, since corresponding pair points reside on the 

same epipolar line. The difference of the horizontal coordinates of these points is the disparity value. 

The disparity map consists of all the disparity values of the image. 

1.2.3 Pixel Correlation 

Detecting conjugate pairs in stereo images is a challenging research problem known as the correspondence 

problem, i.e. to find for each pixel in the left image the corresponding pixel in the 

right one (Barnard & Thompson 1980). To determine that two pixels form a conjugate pair, it is 

necessary to measure the similarity of these pixels. The pixel to be matched without any ambiguity 

should be distinctly different from its surrounding pixels. Several algorithms have been proposed in


Fig. 1.9 Human eye’s (left) and a typical camera’s (right) color sensitivities 

aggregation is traditionally considered as time-consuming. However, Gong and his colleagues (Gong 

et al. 2007) study the performance of various aggregation approaches suitable for real-time methods. 

According to (Hua et al. 2005), using color information instead of gray values during stereo 

matching significantly improves the accuracy. Recently, the use of the CIELab color space has been 

proved to yield impressive results (Yoon & Kweon 2006a). However, vision sensors produce color 

images in the RGB color space, due to their structure. This fact is generally in accordance with the 

way HVS perceives colors as shown in Figure 1.9. 

The conversion from the RGB to CIELab or similar color spaces demands non-linear transformations 

and, as a result, it is computationally demanding. The use of RGB color space’s chromatic 

components is the simplest solution. Thus, the absolute differences for each channel of the RGB 

color space are taken into consideration. However, there are at least two possible methodologies 

for combining the three color channels. International Telecommunications Union (ITU) in Recommendation 

BT.601-6 suggests that luminance (or intensity) information, represented as Y in color 

spaces such as YCbCr, YUV and XYZ, can be calculated as a weighted linear combination of the 

available RGB components. The weight for the red, green and blue chromatic channel is 0.299, 

0.587 and 0.114 respectively. These values reflect photometric considerations and were derived from 

measurements of the response of the HVS to color stimuli. This equation is used in grayscale conversion 

by NTSC and JPEG. According to this, the aforementioned linear combination of the absolute 

differences (AD) calculated for each RGB channel becomes: 

AD =0.299ADR +0.587ADG +0.114ADB 

where by AD is denoted the total absolute luminance (or intensity) difference of two pixels in 

two images and by ADR, ADG, ADB the absolute differences calculated for the red, green, blue 

chromatic channel only, respectively. In spite of being representative of the way HVS accounts for 

each chromatic channel, this methodology is not the most credible one. The methodology most 

preferred in literature (Scharstein & Szeliski 2002, Muhlmann et al. 2002) indicates that the same 

weight should be assigned to each one of the three chromatic channels, since each one contain the 

same amount of information. Thus, the total AD is a simple summation of the AD for each specific 

channel: 

AD = ADR + ADG + ADB 

(1.10) 

(1.9)


This simpler treatment presents better performance than the more sophisticated one, as indicated 

by conducted tests. 

1.2.4 Structure of Stereo Correspondence Algorithms 

The majority of the reported stereo correspondence algorithms can be described using more or less 

the same structural set (Scharstein & Szeliski 2002). The basic building blocks are: 

1. Computation of a matching cost function for every pixel in both the input images. 

2. Aggregation of the computed matching cost inside a support region for every pixel in each image. 

3. Finding the optimum disparity value for every pixel of one picture. 

4. Refinement of the resulted disparity map. 

Every stereo correspondence algorithm makes use of a matching cost function so as to establish 

correspondence between two pixels, as discussed in Section 1.2.3. The results of the matching cost 

computation comprise the Disparity Space Image (DSI). DSI is a 3D matrix containing the computed 

matching costs for every pixel and for all its potential disparity values (Muhlmann et al. 2002). The 

structure of a DSI is illustrated in Figure 1.10. 

! 

Fig. 1.10 DSI containing matching costs for every pixel of the image and for all its potential disparity values 

Usually, the matching costs are aggregated over support regions. These regions could be 2D or 

even 3D (Zitnick & Kanade 2000, Brockers et al. 2005) ones within the DSI cube. The selection of 

the appropriate disparity value for each pixel is performed afterwards. It can be a simple Winner- 

Takes-All (WTA) process or a more sophisticated one. In many cases this is an iterative process as 

depicted in Figure 1.11. An additional disparity refinement step is frequently adopted. It is usually 

intended to interpolate the calculated disparity values, to give sub-pixel accuracy or to assign values 

to not calculated pixels. The general structure of the majority of stereo correspondence algorithms 

is shown in Figure 1.11.


! 

Fig. 1.11 General structure of a stereo correspondence algorithm 

1.2.5 Applications of Stereo Vision in Robotics 

Autonomous robots’ behavior greatly depends on the accuracy of their decision making algorithms. 

Reliable depth estimation is commonly needed in numerous autonomous behaviors; autonomous 

navigation (Hariyama et al. 2000), localization and mapping are just a few of them (Murray & 

Little 2000, Sim & Little 2009). 

Vision-based solutions are becoming more and more attractive due to their decreasing cost as 

well as their inherent coherence with human imposed mechanisms. In the case of stereo vision-based 

navigation, the accuracy and the refresh rate of the computed disparity maps are the cornerstone 

of success (Iocchi & Konolige 1998, Schreer 1998). However, robotic applications place strict requirements 

on the demanded speed and accuracy of vision depth-computing algorithms. Depth 

estimation using stereo vision, i.e. the stereo correspondence problem, is known to be very computational 

demanding. The computation of dense and accurate depth images, i.e. disparity maps, in 

frame rates suitable for robotic applications is an open problem for the scientific community. Most 

of the attempts to confront the demand for accuracy focus on the development of sophisticated 

stereo correspondence algorithms, which usually increase the computational load exponentially. On 

the other hand, the need for real-time frame rates, inevitably, imposes compromises concerning 

the quality of the results. However, the reliability of the results is crucial for autonomous robotic 

applications and proper stereo algorithms are required.

Chapter 2 

State of the Art in Stereo Vision 

Stereo vision is a flourishing field attracting the attention of many researchers (Forsyth & Ponce 

2002, Hartley & Zisserman 2004). New approaches are presented frequently. Such an expanding 

volume of work makes it difficult for those interested to keep up with it. An up-to-date survey 

of the stereo vision matching algorithms and corresponding applications would be useful for those 

already engaged to the field, giving them a brief overview of the advances accomplished, as well as 

for the newly interested ones, allowing for a quick introduction to the state of the art. The stereo 

correspondence algorithms reviewed in this Chapter follow the taxonomy diagrammatically given 

in Figure 2.1. 

2.1 Stereo Correspondence Algorithms 

Since the excellent taxonomy presented by Scharstein and Szeliski (Scharstein & Szeliski 2002) and 

the interesting work of Sunyoto, Mark and Gavrila (Sunyoto et al. 2004) many new stereo matching, 

i.e. stereo correspondence, algorithms have been proposed (Yoon & Kweon 2006a,Klausetal.2006). 

Latest trends in the field mainly pursue real-time execution speeds, as well as decent accuracy. 

Stereo correspondence algorithms can be grouped into those producing sparse output and those 

giving a dense result. Feature based methods stem from human vision studies and are based on 

matching segments or edges between two images, thus resulting in a sparse output. This disadvantage 

is counterbalanced by the accuracy and speed obtained. However, contemporary applications 

demand more and more dense output. In order to categorize and evaluate the dense stereo correspondence 

algorithms a context has been proposed (Scharstein & Szeliski 2002). According to this, 

dense matching algorithms are classified in local and global ones. Local methods (area-based) trade 

accuracy for speed. They are also referred to as window-based methods because disparity computation 

at a given point depends only on intensity values within a finite support window. Global 

methods (energy-based) on the other hand are time consuming but very accurate. Their goal is to 

minimize a global cost function, which combines data and smoothness terms, taking into account 

the whole image. Of course, there are many other methods (Liu et al. 2006) that are not strictly 

included in either of these two broad classes. The issue of stereo matching has recruited a variation 

of computation tools. Advanced computational intelligence techniques are not uncommon and 

13

14 Chapter 2. State of the Art in Stereo Vision 

Fig. 2.1 Categorization of stereo vision algorithms. 

present interesting and promiscuous results (Binaghi et al. 2004, Kotoulas, Gasteratos, Sirakoulis, 

Georgoulas & Andreadis 2005). 

Contemporary research in stereo matching algorithms, in accordance with the ideas of Maimone 

and Shafer (Maimone & Shafer 1996), is reigned by the test bench available at the site maintained 

by Scharstein and Szeliski (Scharstein & Szeliski 2010). As numerous methods have been proposed 

since then, this Section aspires to review the most recent ones, i.e. mainly those published during 

and after 2004. Most of the results presented in the rest of this Section are based on the image 

sets (Scharstein & Szeliski 2002, 2003) and test provided there. The most common image sets are 

presented in Figure 2.2. Table 2.1 summarizes their size as well the number of disparity levels. 

Experimental results based on these image sets are given, where available. The preferred metric 

adopted in order to depict the quality of the resulting disparity maps, is the percentage of pixels 

whose absolute disparity error is greater than 1 in the non-occluded areas of the image. This metric, 

considered the most representative of the result’s quality, was used so as to make comparison easier. 

Other metrics, like error rate and root mean square error are also employed. The speed with which 

the algorithms process input image pairs is expressed in frames per second (fps). This metric has 

of course a lot to do with the used computational platform and the kind of the implementation. 

Inevitably, speed results are not directly comparable.

Chapter 2. State of the Art in Stereo Vision 15 

Fig. 2.2 Left image of the stereo pair (left) and ground truth (right) for the Tsukuba (a), Sawtooth (b), Map (c), 

Venus (d), Cones (e) and Teddy (f) stereo pair. 

Table 2.1 Characteristics of the most common stereo image sets 

Size in 

pixels 

Disparity 

levels 

Tsukuba Map Sawtooth Venus Cone Teddy 

384 × 288 284 × 216 434 × 380 434 × 383 450 × 375 450 × 375 

2.1.1 Dense Disparity Algorithms 

16 30 20 20 60 60 

Methods that produce dense disparity maps gain popularity as the computational power grows. 

Moreover, contemporary applications are benefited by and consequently demand dense depth information. 

Therefore, during the latest years efforts towards this direction are being reported much 

more frequently than towards the direction of sparse results.


Dense disparity stereo matching algorithms can be divided in two general classes, according to 

the way they assign disparities to pixels. Firstly, there are algorithms that decide the disparity 

of each pixel according to the information provided by its local, neighboring pixels. There are, 

however, other algorithms which assign disparity values to each pixel depending on information 

derived from the whole image. Consequently, the former ones are called local methods while the 

latter ones global. 

Local Methods 

Local methods are usually fast and can at the same time produce descent results. Several new 

methods have been presented. In Figure 2.3 a Venn diagram presents the main characteristics of 

the below presented local methods. Under the term color usage we have grouped the methods that 

take advantage of the chromatic information of the image pair. Any algorithm can process color 

images but not everyone can use it in a more beneficial way. Furthermore, in Figure 2.3 NCC stands 

for the use of normalized cross correlation and SAD for the use of sum of absolute differences as 

the matching cost function. As expected, the use of SAD as matching cost is far more widespread 

than any other. 

Fig. 2.3 Diagrammatic representation of the local methods’ categorization 

Legend: 

1. Kotoulas, Gasteratos, Sirakoulis, 

Georgoulas & Andreadis 

(2005) 

2. Yoon & Kweon (2006a) 

3. Yoon & Kweon (2006b) 

4. Gu et al. (2008) 

5. Di Stefano et al. (2004) 

6. Zach et al. (2004) 

7. Ogale & Aloimonos (2005b) 

8. Yoon et al. (2005) 

9. Muhlmann et al. (2002) 

10. Hosni et al. (2009) 

11. Mordohai & Medioni (2006) 

12. Binaghi et al. (2004) 

Muhlmann and his colleagues (Muhlmann et al. 2002) describe a method that uses the SAD 

correlation measure for RGB color images. It achieves high speed and reasonable quality. It makes 

use of the left to right consistency and uniqueness constraints and applies a fast median filter to 

the results. It can achieve 20 fps for 160×120 pixels image size, making this method suitable for 

real-time applications. The PC platform is Linux on a dual processor 800 MHz Pentium III system 

with 512 MB of RAM.


Another fast area-based stereo matching algorithm, which uses the SAD as error function, is 

presented in (Di Stefano et al. 2004). Based on the uniqueness constraint, it rejects previous matches 

as soon as better ones are detected. In contrast to bidirectional matching algorithms this one 

performs only one matching phase, having though similar results. The results obtained are tested 

for reliability and sub-pixel refined. It produces dense disparity maps in real-time using an Intel 

Pentium III processor running at 800 MHz. The algorithm achieves 39.59 fps speed for 320×240 

pixels and 16 disparity levels and the root mean square error for the standard Tsukuba pair is 

5.77%. 

On the contrary, Ogale and Aloimonos in (Ogale & Aloimonos 2005b) takeintoconsideration 

the shape of the objects depicted and demonstrate the importance of the vertical and horizontal 

slanted surfaces. The authors propose the replacement of the standard uniqueness constraint referred 

to pixels with a uniqueness constraint referred to line segments along a scanline. So the method 

performs interval matching instead of pixel matching. The slants of the surfaces are computed 

along a scanline, a stretching factor is then obtained and the matching is performed based on the 

absolute intensity difference. The object is to achieve minimum segmentation. The experimental 

results indicate 1.77%, 0.61%, 3.00% and 7.63% error percentages for the Tsukuba, Sawtooth, Venus 

and Map stereo pairs, respectively. The execution speed of the algorithm varies from 1 to 0.2 fps 

on a 2.4 GHz processor. 

Another method that presents almost real-time performance is reported in (Yoon et al. 2005). It 

makes use of a refined implementation of the SAD method and a left-right consistency check. The 

errors in the problematic regions are reduced using differently sized correlation windows. Finally, a 

median filter is used in order to interpolate the results. The algorithm is able to process 7 fps for 

320×240 pixels images and 32 disparity levels. These results are obtained using an Intel Pentium 4 

at 2.66 GHz Processor. 

A window-based method for correspondence search is presented in (Yoon & Kweon 2006a) that 

uses varying support-weights. The support-weights of the pixels in a given support window are 

adjusted based on color similarity and geometric proximity to reduce the image ambiguity. The 

difference between pixel colors is measured in the CIELab color space because the distance of two 

points in this space is analogous to the stimulus perceived by the human eye. The running time 

for the Tsukuba image pair with a 35x35 pixels support window is about 0.016 fps on an AMD 

2700+ processor. The error ratio is 1.29%, 0.97%, 0.99%, and 1.13% for the Tsukuba, Sawtooth, 

Venus and Map image sets, respectively. These figures can be further improved through a left-right 

consistency check. 

The same authors propose a pre-processing step for correspondence search in the presence of 

specular highlights in (Yoon & Kweon 2006b). For given input images, specular-free two-band 

images are generated. The similarity between pixels of these input-image representations can be 

measured using various correspondence search methods such as the simple SAD-based method, 

the adaptive support-weights method (Yoon & Kweon 2006c) andthedynamicprogramming(DP) 

method. This pre-processing step can be performed in real time and compensates satisfactory for 

specular reflections. 

An extension of the previous works can be found in (Gu et al. 2008). Disparity is first approximated 

by an ASW and a rank transform method, and then a compact disparity calibration approach 

is designed to refine the initial disparity, so an accurate result can be acquired. 

The work of (Hosni et al. 2009) proposes a novel support aggregation approach for stereo matching. 

To derive support weights, the geodesic distances for all pixels of the support window to the


window’s center point are computed. Based on the concept of connectivity the ASW algorithm 

proves to be effective for obtaining improved segmentation results. 

Binaghi (Binaghi et al. 2004) on the other hand, have chosen to use the zero mean normalized 

cross correlation (ZNCC) as matching cost. This method integrates a neural network (NN) model, 

which uses the least-mean-square delta rule for training. The NN decides on the proper window 

shape and size for each support region. The results obtained are satisfactory but the 0.024 fps 

running speed reported for the common image sets, on a Windows platform with a 300 MHz 

processor, renders this method as not suitable for real-time applications. 

Based on the same matching cost function a more complex area-based method is proposed in 

(Mordohai & Medioni 2006). A perceptual organization framework, considering both binocular and 

monocular cues is utilized. An initial matching is performed by a combination of NCC techniques. 

The correct matches are selected for each pixel using tensor voting. Matches are then grouped into 

smooth surfaces. Disparities for the unmatched pixels are assigned so as to ensure smoothness in 

terms of both surface orientation and color. The percentage of unoccluded pixels whose absolute 

disparity error is greater than 1 is 3.79, 1.23, 9.76 and 4.38 for the Tsukuba, Venus, Teddy and 

Cones image sets. The execution speed reported is about 0.002 fps for the Tsukuba image pair with 

20 disparity levels running on an Intel Pentium 4 processor at 2.8 MHz. 

There are, of course, more hardware-oriented proposals as well. Many of them take advantage 

of the contemporary powerful graphics machines to achieve enhanced results in terms of processing 

time and data volume. A hierarchical disparity estimation algorithm implemented on programmable 

3D graphics processing unit (GPU) is reported in (Zach et al. 2004). This method can process either 

rectified or uncalibrated image pairs. Bidirectional matching is utilized in conjunction with a locally 

aggregated sum of absolute intensity differences. This implementation, on an ATI Radeon 9700 Pro, 

can achieve up to 50 fps for 256x256 pixel input images. 

Moreover, the use of CA is exploited in (Kotoulas, Gasteratos, Sirakoulis, Georgoulas & Andreadis 

2005). This work presents an architecture for real-time extraction of disparity maps. It 

is capable of processing 1Megapixels image pairs at more than 40 fps. The core of the algorithm 

relies on matching pixels of each scan-line using a one-dimensional window and the SAD matching 

cost as described in (Kotoulas, Georgoulas, Gasteratos, Sirakoulis & Andreadis 2005). This method 

involves a pre-processing mean filtering step and a post-processing CA based filtering one, that can 

be easily implemented in hardware (Nalpantidis et al. 2007, 2008b). 

The main features of the discussed local algorithms are summarized in Table 2.2. 

Global Methods 

Contrary to local methods, global ones produce very accurate results. Their goal is to find the 

optimum disparity function d = d(x, y) which minimizes a global cost function E, which combines 

data and smoothness terms. 

E(d) =Edata(d)+λEsmooth(d) (2.1) 

where Edata takes into consideration the (x, y) pixel’s value throughout the image, Esmooth provides 

the algorithm’s smoothening assumptions and λ is a weight factor. The main disadvantage of the 

global methods is that they are more time consuming and computational demanding. The source of 

these characteristics is the iterative refinement approaches that they employ. They can be roughly


Table 2.2 Characteristics of local algorithms 

Author Method Features Speed 

(fps) 

Muhlmann 

et al. (2002) 

Di Stefano 

et al. (2004) 

Ogale & 

Aloimonos 

(2005b) 

Yoon et al. 

(2005) 

Yoon & 

Kweon (2006a) 

Yoon & 

Kweon (2006b) 

Gu et al. 

(2008) 

Salmen et al. 

(2009) 

Binaghi et al. 

(2004) 

Mordohai & 

Medioni 

(2006) 

Zach et al. 

(2004) 

Kotoulas, 

Gasteratos, 

Sirakoulis, 

Georgoulas & 

Andreadis 

(2005) 

SAD -Color usage 

-Occlusion handling 

-Left-right consistency 

-Uniqueness constraints 

SAD -Occlusion handling 

-Uniqueness constraint 


-Interval uniqueness 

constraint 


-Left-right consistency 

check 

-Variable windows 


-Varying 

support-weights 


-Varying 


-Specular reflection 

compensation 


-Varying 



-Varying 


-Occlusion detection 

ZNCC -Varying windows based 

on neural networks 

NCC -Color usage 

-Occlusion handling 

-Tensor voting 


-Implemented on GPU 

-Bidirectional matching 

Image Size Disparity 

levels 

Computational 

platform 

20 160×120 – Intel Pentium III 

800 MHz 

with 512 MB RAM 

39.59 320×240 16 Intel Pentium III 

800 MHz 

1 384×288 16 2.4 GHz 

7 320×240 32 Intel Pentium 4 

2.66 GHz 

0.016 384×288 16 AMD 2700+ 

0.016 384×288 16 AMD 2700+ 

– – – – 

– – – – 

0.024 284×216 30 300 MHz 

0.002 384×288 20 Intel Pentium 

2.8 MHz 

50 256x256 88 ATI Radeon 9700 

Pro 

SAD -Cellular automata 40 1000×1000 – –


divided in those performing a global energy minimization and those pursuing the minimum for 

independent scanlines using DP. 

In Figure 2.4 the main characteristics of the below discussed global algorithms are presented. 

It is clear that the recently published works utilizes global optimization preferably rather than 

DP. This observation is not a surprising one, taking into consideration the fact that under the 

term global optimization there are actually quite a few different methods. Additionally, DP tends 

to produce inferior, thus less impressive, results. Therefore, applications that don’t have running 

speed constraints, preferably utilize global optimization methods. 

Fig. 2.4 Diagrammatic representation of the global methods’ categorization 

Legend: 

1. Yang et al. (2006) 

2. Veksler (2006) 

3. Yu et al. (2007) 

4. Yang et al. (2010) 

5. Klaus et al. (2006) 

6. Yoon & Kweon (2006c) 

7. Gutierrez & Marroquin 

(2004) 

8. Kim & Sohn (2005) 

9. Ogale & Aloimonos (2005a) 

10. Brockers et al. (2005) 

11. Hirschmuller (2005) 

12. Hirschmuller (2006) 

13. Strecha et al. (2006) 

14. Bleyer & Gelautz (2005) 

15. Hong & Chen (2004) 

16. (Brockers 2009) 

17. Zitnick et al. (2004) 

18. Bleyer & Gelautz (2005) 

19. Sun et al. (2005) 

20. Yang et al. (2009) 

21. Zitnick & Kang (2007) 

22. Lei et al. (2006) 

23. Wang et al. (2006) 

24. Torra & Criminisi (2004) 

25. Kim & Sohn (2005) 

26. Veksler (2005) 

27. Salmen et al. (2009)


Global Optimization 

The algorithms that perform global optimization take into consideration the whole image in order 

to determine the disparity of every single pixel. An increasing portion of the global optimization 

methodologies involves segmentation of the input images according to their colors. 

The algorithm presented in (Bleyer & Gelautz 2005) uses color segmentation. Each segment is 

described by a planar model and assigned to a layer using a mean shift based clustering algorithm. 

Aglobalcostfunctionisusedthattakesintoaccountthesummedupabsolutedifferences, the 

discontinuities between segments and the occlusions. The assignment of segments to layers is iteratively 

updated until the cost function improves no more. The experimental results indicate that 

the percentage of unoccluded pixels whose absolute disparity error is greater than 1 is 1.53, 0.16 

and 0.22 for the Tsukuba, Venus and Sawtooth image sets, respectively. 

The stereo matching algorithm proposed in (Hong & Chen 2004) makes use of color segmentation 

in conjunction with the graph cuts method. The reference image is divided in non-overlapping 

segments using the mean shift color segmentation algorithm. Thus, a set of planes in the disparity 

space is generated. The goal of minimizing an energy function is faced in the segment rather than 

the pixel domain. A disparity plane is fitted to each segment using the graph cuts method. This 

algorithm presents good performance in the textureless and occluded regions as well as at disparity 

discontinuities. The running speed reported is 0.33 fps for a 384x288 pixel image pair when tested 

on a 2.4 GHz Pentium 4 PC. The percentage of bad matched pixels for the Tsukuba, Sawtooth, 

Venus, and Map image sets is found to be 1.23, 0.30, 0.08 and 1.49 respectively. 

Brockers in (Brockers 2009) uses the concept of color-dependent adaptive support weights to the 

definition of local support areas in cooperative stereo methods to improve the accuracy of depth 

estimation at object borders. The dissimilarity measure used is the ZNCC and the algorithm detects 

occlusions and provides sub-pixel precision. The algorithm was coded in standard non-optimized 

C++ code using full float precision and run on an 2.4 GHz Intel Core2Duo T7700. The calculation 

speed for the Tsukuba scene with 100 iterations was 0.05 fps using a single core and 0.09 fps using 

both cores. 

The ultimate goal of the work described in (Zitnick et al. 2004) is to render dynamic scenes 

with interactive viewpoint control produced by a few cameras. A suitable color segmentation-based 

algorithm is developed and implemented on a programmable ATI 9800 PRO GPU. Disparities within 

segments must vary smoothly, each image is treated equally, occlusions are modeled explicitly and 

consistency between disparity maps is enforced resulting in higher quality depth maps. The results 

for each pixel are refined in conjunction with the others. 

Another method that uses the concept of image color segmentation is reported in (Bleyer & 

Gelautz 2005). An initial disparity map is calculated using an adapting window technique, the 

segments are combined in larger layers iteratively and the assignment of segments to layers is 

optimized using a global cost function. The quality of the disparity map is measured by warping 

the reference image to the second view, comparing it with the real image and calculating the color 

dissimilarity. For the 384x288 pixel Tsukuba and the 434x383 pixel Venus test set, the algorithm 

produces results at 0.05 fps rate. For the 450×375 pixel Teddy image pair, the running speed 

decreased to 0.01 fps due to the increased scene complexity. Running speeds refer to an Intel 

Pentium 4 2.0 GHz processor. The root mean square error obtained is 0.73 for the Tsukuba, 0.31 

for the Venus and 1.07 for the Teddy image pair. 

Moreover, Sun and his colleagues in (Sun et al. 2005) presented a method which treats the 

two images of a stereo pair symmetrically within an energy minimization framework that can also


embody color segmentation as a soft constraint. This method enforces that the occlusions in the 

reference image are consistent with the disparities found for the other image. Belief propagation 

iteratively refines the results. Moreover, results for the version of the algorithm that incorporates 

segmentation are better. The percentage of pixels with disparity error larger than 1 is 0.97, 0.19, 

0.16 and 0.16 for the Tsukuba, Sawtooth, Venus and Map image sets, respectively. The running 

speed for the aforementioned data sets is about 0.02 fps tested on a 2.8 GHz Pentium 4 processor. 

Color segmentation is utilized in (Klaus et al. 2006) as well. The matching cost used here is a 

self-adapting dissimilarity measure that takes into account the sum of absolute intensity differences 

as well as a gradient based measure. Disparity planes are extracted using an insensitive to outliers 

technique. Disparity plane labeling is performed using belief propagation. Execution speed varies 

between 0.07 and 0.04 fps on a 2.21 GHz AMD Athlon 64 processor. The results indicate 1.13, 0.10, 

4.22 and 2.48 percent of bad matched pixels in non-occluded areas for the Tsukuba, Venus, Teddy 

and Cones image sets, respectively. 

Finally, one more algorithm that utilizes energy minimization, color segmentation, plane fitting 

and repeated application of hierarchical belief propagation is presented in (Yang et al. 2009). This 

algorithm takes into account a color-weighted correlation measure. Discontinuities and occlusions 

are properly handled. The percentage of pixels with disparity error larger than 1 is 0.88, 0.14, 3.55 

and 2.90 for the Tsukuba, Venus, Teddy and Cones image sets, respectively. 

In (Yoon & Kweon 2006c) twonewsymmetriccostfunctionsforglobalstereomethodsare 

proposed. A symmetric data cost function for the likelihood, as well as a symmetric discontinuity 

cost function for the prior in the Markov random field (MRF) model for stereo is presented. Both 

the reference image and the target image are taken into account to improve performance without 

modeling half-occluded pixels explicitly and without using color segmentation. The use of both of 

the two proposed symmetric cost functions in conjunction with a belief propagation based stereo 

method is evaluated. Experimental results for standard test bed images show that the performance 

of the belief propagation based stereo method is greatly improved by the combined use of the 

proposed symmetric cost functions. The percentage of pixels badly matched for the non-occluded 

areas was found 1.07, 0.69, 0.64 and 1.06 for the Tsukuba, Sawtooth, Venus and Map image sets, 

respectively. 

A method based on the Bayesian estimation theory with a prior MRF model for the assigned 

disparities is described in (Gutierrez & Marroquin 2004). The continuity, coherence and occlusion 

constraints as well as the adjacency principal are taken into account. The optimal estimator is 

computed using a Gauss-Markov random field model for the corresponding posterior marginals, 

which results in a diffusion process in the probability space. The results are accurate but the 

algorithm is not suitable for real-time applications, since it needs a few minutes to process a 256x255 

stereo pair with up to 32 disparity levels, on an Intel Pentium III running at 450 MHz. 

On the other hand, Strecha and his colleagues in (Strecha et al. 2006) treat every pixel of the 

input images as generated either by a process, responsible for the pixels visible from the reference 

camera and which obey to the constant brightness assumption, or by an outlier process, responsible 

for the pixels that cannot be corresponded. Depth and visibility are jointly modeled as a hidden 

MRF, and the spatial correlations of both are explicitly accounted for by defining a suitable Gibbs 

prior distribution. An expectation maximization (EM) algorithm keeps track of which points of 

the scene are visible in which images, and accounts for visibility configurations. The percentages 

of pixels with disparity error larger than 1 are 2.57, 1.72, 6.86 and 4.64 for the Tsukuba, Venus, 

Teddy and Cones image sets, respectively.


Moreover, a stereo method specifically designed for image-based rendering is described in (Zitnick 

& Kang 2007). This algorithm uses over-segmentation of the input images and computes matching 

values over entire segments rather than single pixels. Color-based segmentation preserves object 

boundaries. The depths of the segments for each image are computed using loopy belief propagation 

within a MRF framework. Occlusions are also considered. The percentage of bad matched pixels in 

the unoccluded regions is 1.69, 0.50, 6.74 and 3.19 for the Tsukuba, Venus, Teddy and Cones image 

sets, respectively. The aforementioned results refer to a 2. 8GHz PC platform. 

In (Hirschmuller 2005) an algorithm based on a hierarchical calculation of a mutual information 

based matching cost is proposed. Its goal is to minimize a proper global energy function, not by 

iterative refinements but by aggregating matching costs for each pixel from all directions. The final 

disparity map is sub-pixel accurate and occlusions are detected. The processing speed for the Teddy 

image set is 0.77 fps. The error in unoccluded regions is found less than 3% for all the standard 

image sets. Calculations are made on an Intel Xeon processor running at 2.8 GHz. 

An enhanced version of the previous method is proposed by the same author in (Hirschmuller 

2006). Mutual information is once again used as cost function. The extensions applied in it result 

in intensity consistent disparity selection for untextured areas and discontinuity preserving interpolation 

for filling holes in the disparity maps. It treats successfully complex shapes and uses planar 

models for untextured areas. Bidirectional consistency check, sub-pixel estimation as well as invalid 

disparities interpolation are performed. The experimental results indicate that the percentages of 

bad matching pixels in unoccluded regions are 2.61, 0.25, 5.14 and 2.77 for the Tsukuba, Venus, 

Teddy and Cones image sets, respectively, with 64 disparity levels searched each time. However, the 

reported running speed on a 2.8 GHz PC is less than 1 fps. 

The work done by Kim and Sohn (Kim & Sohn 2005) introduces a two-stage algorithm consisting 

of hierarchical dense disparity estimation and vector field regularization. The dense disparity estimation 

is accomplished by a region dividing technique that uses a Canny edge detector and a simple 

SAD function. The results are refined by regularizing the vector fields by means of minimizing an 

energy function. The root mean square error obtained from this method is 0.9278 and 0.9094 for 

the Tsukuba and Sawtooth image pairs. The running speed is 0.15 fps and 0.105 fps respectively 

on a Pentium 4 PC running Windows XP. 

An uncommon measure is used by Ogale and Aloimonos in (Ogale & Aloimonos 2005a). This 

work describes an algorithm which is focused on achieving contrast invariant stereo matching. It 

relies on multiple spatial frequency channels for local matching. The measure for this stage is the 

deviation of phase difference from zero. The global solution is found by a fast non-iterative left 

right diffusion process. Occlusions are found by enforcing the uniqueness constraint. The algorithm 

is able to handle significant changes in contrast between the two images and can handle noise in 

one of the frequency channels. The Matlab implementation of the algorithm is capable of processing 

the Middlebury image pairs at 0.5 to 0.25 fps rate, on a 2 GHz computer platform. 

The method described in (Brockers et al. 2005) uses a cost relaxation approach. A similarity 

measurement is obtained as a preliminary stage of the relaxation process. Relaxation is an iterative 

process that minimizes a global cost function while taking into account the continuity constraint and 

the neighbor-pixel expected similarity. The support regions are 3D within the disparity space volume 

and have Gaussian weights. The disparity is available at any time of the iteratively refinement 

phase, having of course diminished accuracy for little iteration cycles. This feature makes this 

method suitable for time-critical applications. The percentages of bad matching pixels in unoccluded 

regions are found to be 4.76, 1.41, 8.18 and 3.91 for the Tsukuba, Venus, Teddy and Cones image 

sets, respectively.


In (Yu et al. 2007) the feasibility of applying compression techniques to the messages in the 

belief propagation algorithm in order to improve the efficiency is studied. A compression scheme 

called envelope point transform is proposed. Experimental results on dense stereo reconstruction 

have shown that envelope point transform-based belief propagation can achieve 8 times or more 

compression without significant loss of depth accuracy. 

The work of (Yang et al. 2010) considers the problem of stereo matching using loopy belief 

propagation. The algorithm hierarchically reduces the disparity search range. By fixing the number 

of disparity levels on the original resolution, this method solves the message updating problem in 

atimelinearinthenumberofpixelscontainedintheimageandrequiresconstantmemoryspace. 

Specifically, for a 800×600 image with 300 disparity levels the message updating method achieves 

execution speed of 0.67fps and requires little memory. The used platform was s 2.5 GHz Intel Core 

2Duoprocessor. 

Another algorithm that generates high quality results in real time is reported in (Yang et al. 2006). 

It is based on the minimization of a global energy function comprising of a data and a smoothness 

term. The hierarchical belief propagation iteratively optimizes the smoothness term but it achieves 

fast convergence by removing redundant computations involved. In order to accomplish real-time 

operation authors take advantage of the parallelism of graphics hardware (GPU). Experimental 

results indicate 16 fps processing speed for 320×240 pixel self-recorded images with 16 disparity 

levels. The percentages of bad matching pixels in unoccluded regions for the Tsukuba, Venus, Teddy 

and Cones image sets are found to be 1.49, 0.77, 8.72 and 4.61. The computer used is a 3 GHz PC 

and the GPU is an NVIDIA GeForce 7900 GTX graphics card with 512M video memory. 

The work of Veksler in (Veksler 2006) indicates that computational cost of the graph cuts stereo 

correspondence technique can be efficiently decreased using the results of a simple local stereo 

algorithm to limit the disparity search range. The idea is to analyze and exploit the failures of local 

correspondence algorithms. This method can accelerate the processing by a factor of 2.8, compared 

to the sole use of graph cuts, while the resulting energy is worse only by an average of 1.7%. These 

figures resulted from an analysis done on a large dataset of 32 stereo pairs using a Pentium 4 at 2.6 

GHz PC. This is a considerable improvement in efficiency gained for a small price in accuracy, and 

it moves the graph-cuts based algorithms closer to real-time implementation. The running speeds 

are 0.77, 0.38, 0.16, 0.17, 0.53 and 1.04 fps for the Tsukuba, Venus, Teddy, Cones, Sawtooth and 

Map image sets, respectively while the corresponding error percentages are found 2.22, 1.39, 12.8, 

8.87, 1.18 and 0.51. 

The main features of the discussed algorithms that utilize global optimization are summarized 

in Table 2.3. 

Table 2.3 Characteristics of global algorithms that use global optimization 


(fps) 

Bleyer & 

Gelautz (2005) 

Global cost 

function 

-Color 

segmentation 

-Occlusion 

handling 

Image 

Size 

Disparity 

levels 

Computational 


– – – – 

Continued on next page



(fps) 

Hong & Chen 

(2004) 

Brockers (2009) Global cost 

function 

Zitnick et al. 

(2004) 

Bleyer & 

Gelautz (2005) 

Sun et al. 

(2005) 

Klaus et al. 

(2006) 

Yang et al. 

(2009) 

Yoon & Kweon 

(2006c) 

Gutierrez & 

Marroquin 

(2004) 

Graph cuts -Color 

segmentation 

-Occlusion 


Global cost 

function 

Global cost 

function 

Belief 

propagation 

Belief 

propagation 

Hierarchical 

belief 

propagation 

Belief 

propagation 

Gauss-Markov 

random field 

-Varying 


-Cooperative 

optimization 

-Occlusion 

detection 

-Color 

segmentation 

-Occlusion 


-GPU utilization 

-Color 

segmentation 

-Occlusion 


-Color 

segmentation 

-Occlusion 


-Symmetricaly 

treatment 

-Color 

segmentation 

-Color 

segmentation 

-Occlusion 


-Color 

segmentation 

-Symmetrical 

cost functions 

-Occlusion 


-Continuity 

-Coherence 

-Adjacency 

Table 2.3 – continued from previous page 

Image 

Size 

Disparity 

levels 

Computational 


0.33 384x288 16 Intel Pentium 4 

2.4 GHz 

0.09 384x288 16 2.4GHz Intel 

Core2Duo T7700 

– – – ATI 9800 PRO 

GPU 


2.0 GHz 


2.8 GHz 

0.07 384x288 16 AMD Athlon 64 

2.21 GHz 

– – – – 

– – – – 



(fps) 

Strecha et al. 

(2006) 

Zitnick & Kang 

(2007) 

Hirschmuller 

(2005) 

Hirschmuller 

(2006) 

Kim & Sohn 

(2005) 

Ogale & 

Aloimonos 

(2005a) 

Brockers et al. 

(2005) 

Yang et al. 

(2010) 

Yang et al. 

(2006) 

Hidden Markov 

random field 

Belief 

propagation 

within a MRF 

framework 

Global cost 

function 

Global cost 

function 

Vector field 

regularization 

Left-right 

difusion process 

-Occlusion 


-EM algorithm 

-Color 

segmentation 

-Occlusion 


-Occlusion 


-Mutual 

information 

-Occlusion 


-Mutual 

information 

-Bidirectional 

match 

-Occlusion 


-Canny edge 

detector 

-Occlusion 


-Phase-based 

matching 

Cost relaxation -Occlusion 


-3D support 

regions 

Hierarchical 

belief 

propagation 

Hierarchical 

belief 

propagation 

Table 2.3 – continued from previous page 

Image 

Size 

Disparity 

levels 

Computational 


– – – – 

– – – 2.8 GHz 

0.77 450×375 60 Intel Xeon 

2.8 GHz 


Dynamic Programming 

Many researchers develop stereo correspondence algorithms based on DP. This methodology is a 

fair trade-off between the complexity of the computations needed and the quality of the results 

obtained. In every aspect, DP stands between the local algorithms and the global optimization 

ones. However, its computational complexity still renders it as a less preferable option for hardware 

implementation. 

The work of Torra and Criminisi (Torra & Criminisi 2004) presents a unified framework that 

allows the fusion of any partial knowledge about disparities, such as matched features and known 

surfaces within the scene. It combines the results from corner, edge and dense stereo matching 

algorithms to impose constraints that act as guide points to the standard DP method. The result 

is a fully automatic dense stereo system with up to four times faster running speed and greater 

accuracy compared to results obtained by the sole use of DP. 

Moreover, a generalized ground control points (GGCP) scheme is introduced in (Kim et al. 2005). 

One or more disparity candidates for the true disparity of each pixel are assigned by local matching 

using oriented spatial filters. Afterwards, a two-pass DP technique that performs optimization both 

along and between the scanlines is applied. The result is the reduction of false matches as well 

as of the typical inter-scanline inconsistency problem. The percentage of bad matched pixels in 

unoccluded regions is 1.53, 0.61, 0.94 and 0.706 for the Tsukuba, Sawtooth, Venus and Map image 

sets. The running speeds, tested on a Pentium 4 at 2.4 GHz PC, vary from 0.23 fps for the Tsukuba 

set with 15 disparity levels down to 0.08 fps for the Sawtooth set with 21 disparity levels. 

Salmen in (Salmen et al. 2009) presents a refined DP stereo processing algorithm.The concept of 

multi-path backtracking to exploit the information gained from DP more effectively is introduced. 

All parameters of the algorithm are automatically offline tuned by an evolutionary algorithm. The 

number of incorrect disparities was reduced by 40% compared to the DP reference implementation, 

while the overall complexity increased only slightly. The processing speed for the Tsukuba image 

set is 5 fps, Venus 2.5 fps, Teddy 1.25 fps and Cones 1.25 fps on a standard desktop PC with a 1.8 

GHz processor. 

Wang et al. in (Wang et al. 2006) present a stereo algorithm that combines high quality results 

with real-time performance. DP is used in conjunction with an adaptive aggregation step. The perpixel 

matching costs are aggregated in the vertical direction only resulting in improved inter-scanline 

consistency and sharp object boundaries. This work exploits the color and distance proximity-based 

weight assignment for the pixels inside a fixed support window as reported in (Yoon & Kweon 

2006a). The real-time performance is achieved due to the parallel use of the CPU and the GPU of 

a computer. This implementation can process 320×240 pixel images with 16 disparity levels at 43.5 

fps and 640×480 pixel images with 16 disparity levels at 9.9 fps. The test system is a 3.0 GHz PC 

with an ATI Radeon XL1800 GPU. 

On the contrary, the algorithm proposed in (Veksler 2005) applies the DP method not across 

individual scanlines but to a tree structure. Thus the minimization procedure accounts for all the 

pixels of the image, compensating the known streaking. Reported running speed is a couple of 

frames per second for the tested image pairs. So, real-time implementations are feasible. However, 

the results’ quality is comparable to that of the time-consuming global methods. The reported 

results of bad matched pixels percentages are 1.77, 1.44, 1.21, and 1.45 for the tested Tsukuba, 

Sawtooth, Venus and Map image sets, respectively. 

In (Lei et al. 2006) the pixel-tree approach of the previous work is replaced by a region-tree 

one. First of all, the image is color-segmented using the mean-shift algorithm. During the stereo


matching, a corresponding energy function defined on such a region-tree structure is optimized using 

the DP technique. Occlusions are handled by compensating for border occlusions and by applying 

cross checking. The obtained results indicate that the percentage of the bad matched pixels in 

unoccluded regions is 1.39, 0.22, 7.42 and 6.31 for the Tsukuba, Venus, Teddy and Cones image 

sets. The running speed, on a 1.4 GHz Intel Pentium M processor, ranges from 0.1 fps for the 

Tsukuba dataset with 16 disparity levels to 0.04 fps for the Cones dataset with 60 disparity levels. 

The main features of the discussed global algorithms that utilize DP are summarized in Table 

2.4. 

Table 2.4 Characteristics of global algorithms that use DP 


(fps) 

Torra & 

Criminisi 

(2004) 

Kim & Sohn 

(2005) 

Salmen et al. 

(2009) 

Wang et al. 

(2006) 

DP -Occlusion handling 

-Prior feature matching 


-Prior disparity 

candidate assignment 

-Two-pass inter-scanline 

optimization 

DP -Multi-path 

backtracking 

DP -Color usage 

-Interscanline 

consistency 

-Adaptive aggregation 

-Parallel usage of CPU 

and GPU 

Veksler (2005) DP -Applied to pixel-tree 

structure 

Lei et al. 

(2006) 

Other Methods 


-Color usage 

-Applied to region-tree 

structure 


levels 

Computational 


– – – – 


2.4 GHz 

5 384x288 16 1.8 GHz 

43.5 320×240 16 -3.0 GHz CPU 

-ATI Radeon 

XL1800 GPU 

∼2 – – – 

0.1 384x288 16 1.4 GHz Intel 

Pentium M 

There are of course other methods, producing dense disparity maps, which can be placed in neither 

of previous categories. The below discussed methods use either wavelet-based techniques or 

combinations of various techniques. 

Such a method, based on the continuous wavelet transform (CWT) is found in (Huang & Dubois 

2004). It makes use of the redundant information that results from the CWT. Using 1D orthogonal


and biorthogonal wavelets as well as 2D orthogonal wavelet the maximum matching rate obtained 

is 88.22% for the Tsukuba pair. Upsampling the pixels in the horizontal direction by a factor of two, 

through zero insertion, further decreases the noise and the matching rate is increased to 84.91%. 

Another work (Liu et al. 2006) presents an algorithm based on non-uniform rational B-splines 

(NURBS) curves. The curves replace the edges extracted with a wavelet based method. The NURBS 

are projective invariant and so they reduce false matches due to distortion and image noise. Stereo 

matching is then obtained by estimating the similarity between projections of curves of an image 

and curves of another image. A 96.5% matching rate for a self recorded image pair is reported for 

this method. 

Finally, a different way of confronting the stereo matching issue is proposed in (De Cubber 

et al. 2008). The authors, comprehending that there is no all-satisfying method, investigate the 

possibility of fusing the results from spatially differentiated (stereo vision) scenery images with 

those from temporally differentiated (structure from motion) ones. This method takes advantage of 

both method’s merits improving the performance. 

The main features of the discussed algorithms that cannot be clearly assigned to any of the 

aforementioned categories are summarized in Table 2.5. 

Table 2.5 Characteristics of the algorithms that cannot be clearly assigned to any category 

Author Method Features 

Huang & Dubois (2004) Continuous wavelet transform -1D orthogonal and biorthogonal 

wavelets 

-2D orthogonal wavelet 

Liu et al. (2006) Wavelet-based Non-uniform rational B-splines curves 

De Cubber et al. (2008) Intensity-based Stereo vision and structure from 

motion fusion 

2.1.2 Sparse Disparity Algorithms 

Algorithms resulting in sparse, or semi-dense, disparity maps tend to be less attractive as most of 

the contemporary applications require dense disparity information. Though, they are very useful 

when fast depth estimation is required and at the same time detail, in the whole picture, is not 

so important. This type of algorithms tends to focus on the main features of the images leaving 

occluded and poorly textured areas unmatched. Consequently high processing speeds, accurate results 

but with limited density are achieved. Very interesting ideas flourish in this direction but since 

contemporary interest is directed towards dense disparity maps, only a few indicatory algorithms 

are discussed here. 

Veksler in (Veksler 2002) presents an algorithm that detects and matches dense features between 

the left and right images of a stereo pair, producing a semi-dense disparity map. A dense feature 

is a connected set of pixels in the left image and a corresponding set of pixels in the right image 

such that the intensity edges on the boundary of these sets are stronger than their matching error.


All these are computed during the stereo matching process. The algorithm computes 1 fps with 

14 disparity levels for the Tsukuba pair producing 66% density and 0.06% average error in the 

non-occluded regions. 

Another method developed by Veksler (Veksler 2003) is based on the same basic concepts as the 

former one. The main difference is that this one uses the graph cuts algorithm for the dense feature 

extraction. As a consequence this algorithm produces semi-dense results with significant accuracy 

in areas where features are detected. The results are significantly better considering density and 

error percentage but require longer running times. For the Tsukuba pair it achieves a density up to 

75%, the total error in the non-occluded regions is 0.36% and the running speed is 0.17 fps. For the 

Sawtooth pair the corresponding results are 87%, 0.54% and 0.08 fps. All the results are obtained 

from a Pentium III PC running at 600 MHz. 

On the other hand, Gong and Yang in their paper (Gong & Yang 2005a) proposeaDPalgorithm, 

called reliability-based dynamic programming (RDP) that uses a different measure to evaluate the 

reliabilities of matches. According to this the reliability of a proposed match is the cost difference 

between the globally best disparity assignment that includes the match and the one that does not 

include it. The interscanline consistency problem, common to the DP algorithms, is reduced through 

a reliability thresholding process. The result is a semi-dense unambiguous disparity map with 76% 

density, 0.32% error rate and 16 fps for the Tsukuba and 72% density, 0.23% error rate and 7 fps 

for the Sawtooth image pair. Accordingly, the results for Venus and Map pairs are 73%, 0.18%, 6.4 

fps and 86%, 0.7%, 12.8 fps. As a result, the reported execution speeds, tested on a 2 GHz Pentium 

4 PC, are encouraging for real-time operation if a semi-dense disparity map is acceptable. 

Asimilartothepreviousonenear-real-timestereomatchingtechniqueispresentedin(Gong 

&Yang2005b) by the same authors, which is also based on the RDP algorithm. This algorithm 

can generate semi-dense disparity maps. Two orthogonal RDP passes are used to search for reliable 

disparities along both horizontal and vertical scanlines. Hence, the interscanline consistency 

is explicitly enforced. It takes advantage of the computational power of programmable graphics 

hardware, which further improves speed. The algorithm is tested on an Intel Pentium 4 computer 

running at 3 GHz with a programmable ATI Radeon 9800 XT GPU equipped with 256MB video 

memory. It results in 85% dense disparity map with 0.3% error rate at 23.8 fps for the Tsukuba 

pair, 93% density, 0.24% error rate at 12.3 fps for the Sawtooth pair, 86% density, 0.21% error rate 

at 9.2 fps for the Venus pair and 88% density, 0.05% error rate at 20.8 fps for the Map image pair. 

If needed, the method can also be used to generate more dense disparity maps deteriorating the 

execution speed. 

The main features of the discussed algorithms that produce sparse output are summarized in 

Table 2.6. 

2.2 Hardware Implementations of Stereo Correspondence Algorithms 

While the aforementioned categorization involves stereo matching algorithms in general, in practice 

it is valuable for software implemented algorithms only. Software implementations make use 

of general purpose personal computers (PC) and usually result in considerably long running times. 

However, this is not an option when the objective is the development of autonomous robotic platforms, 

simultaneous localization and mapping (SLAM) or virtual reality (VR) systems. Such tasks 

require efficient real-time performance, demand dedicated hardware and consequently specially de-


Table 2.6 Characteristics of the algorithms that produce sparse output 

Author Method Density Speed 

(fps) 


levels 

Computational 


Veksler (2002) Local 66 1 384x288 14 – 

Veksler (2003) Graph 

cuts 

Gong & Yang 

(2005a) 

Gong & Yang 

(2005b) 

75 0.17 384x288 16 Intel Pentium III 

600 MHz 

RDP 76 16 384x288 16 Intel Pentium 4 2 

GHz 

RDP 85 23.8 384x288 16 -Intel Pentium 4 3 

GHz CPU 

-ATI Radeon 9800 

XT GPU 

veloped and optimized algorithms. Only a small subset of the already proposed algorithms is suitable 

for hardware implementation. Hardware implemented algorithms are characterized from their 

theoretical algorithm as well as the implementation itself. There are two broad classes of hardware 

implementations: the field-programmable gate arrays (FPGA) and the application-specific 

integrated circuits (ASIC) based ones. Figure 2.5(a) depicts an ASIC chip and 2.5(b) an FPGA 

development board. Each one can execute stereo vision algorithms without the necessity of a PC, 

saving volume, weight and consumed energy. However, the evolution of FPGA has made them an 

appealing choice due to the small prototyping times, their flexibility and their good performance. 

(a) (b) 

Fig. 2.5 An ASIC chip (a) and a FPGA development board (b)


There are many applications, as mentioned above, that demand extraction of the disparity map 

from image pairs in real-time. Moreover, most of these applications demand dense output. PCs 

due to their serial-processing architecture find it difficult to meet these requirements. This problem 

can be efficiently confronted by the use of dedicated hardware. In addition, the need for dedicated 

hardware is more evident in the case of autonomous units, where the existence of a PC is not a 

convenient solution. Hardware implementations can accelerate the performance of the stereo vision 

systems. They are able to provide the parallelism that is commonly useful in image processing 

and vision algorithms. In particular, regular and simple structures such as CA or basic filtering 

modules can be easily and efficiently implemented in hardware. By processing several parts of the 

data in parallel and performing specific calculations, their overall performance is considerably better 

compared to software solutions running on serial general purpose processors. 

The hardware implementation of global algorithms is neither an appealing nor an easy option. 

As stated above, global methods are time and computational demanding because of their iterative 

nature. This is also the reason that prevents them from being implemented with parallel structures. 

On the contrary, global algorithms require odd, rather than simple and straightforward, implementations. 

DP though is inherently the simplest of all the other global approximations. 

In contrast, local methods could be greatly benefited by the use of such parallel and straightforward 

structures. Parallelism and simplicity are key factors, available in dedicated hardware implementations, 

that can reduce the required running times. There are several works that describe local 

methods implemented on hardware. What most of them have in common is that they implement 

arathersimplealgorithmandmakeextensivelyuseofcomputationconcurrency.Performanceis 

refined by custom choices during the hardware architecture development phase. A generalized block 

diagram of a hardware implementable stereo correspondence algorithm is shown in Figure 2.6. 

Fig. 2.6 Generalized block diagram of a hardware implementable stereo correspondence algorithm. 

Hardware implementation involves using either FPGA or ASIC. Digital signal processor (DSP) 

based solutions have also been reported in the past (Faugeras et al. 1993), however they are not 

reported as frequently, due to their inhibited difficulty in parallel processing. A survey of the recent 

bibliography confirms that FPGA implementations are preferable. That is because the time 

required for fabrication and test of ASIC implementations is considerably long and its cost is high. 

Moreover there is almost no flexibility for future improvements and modifications. On the other 

hand FPGA provide rapid prototyping time, are far less expensive and can be easily adapted to 

new specifications. In this way FPGA combine the best parts of hardware solutions with those of 

the software ones.


2.2.1 FPGA Implementations 

All the hardware implementations examined in this Section can achieve real-time operation. However, 

the use of FPGAs is now the most convenient and reasonable choice for hardware development. 

They are cheap and perform remarkably well. The available resources of the devices are constantly 

growing, allowing for more complex algorithms to be implemented. The variety of available electronic 

design automation (EDA) tools and the absence of fabrication stage make the prototyping 

times very short. Another advantage is that the resulting hardware implementation is open for 

further upgrades. Thus, FPGA implementations are very flexible and fault tolerant. 

In the rest of this Section FPGA implemented methods based on SAD, DP and Local Weighted 

Phase-Correlation (LWPC) are presented. Table 2.7 demonstrates the main characteristics of the 

below discussed works. This table is populated according to the available data. It is evident that 

the simplest and most straightforward method of all, i.e. SAD, is the most preferable one. 

FPGA Implementations based on SAD 

As expected, when it comes to hardware implementations SAD-based methods are the most preferred 

ones. SAD calculation requires simple computational modules, as it involves only summations 

and absolute values’ calculations. 

The FPGA based architecture presented in (Arias-Estrada & Xicotencatl 2001) is able to produce 

dense disparity maps in real time. The architecture implements a local algorithm based on the SAD 

aggregated in fixed windows. Input data are processed in parallel on a single chip. An extension to 

the basic architecture is also proposed in order to compute disparity maps on more than 2 images. 

This method can process 320×240 pixel images with 16 disparity levels at speeds higher than 71 

fps. The devise utilization is 4.2K slices, equivalent to 69K gates. 

The system developed by (Jia et al. 2003) is able to compute dense disparity maps in real time 

using the SAD method over fixed windows. The whole algorithm, including radial distortion correction, 

Laplacian of Gaussian (LoG) filtering, correspondence finding, and disparity map computation 

is implemented in a single FPGA. The system can process 640×480 pixel images with 64 disparity 

levels and 8 bit depth precision at 30 fps speed, and 320×240 pixel images at 50 fps. 

The SAD algorithm aggregated over fixed windows is the option utilized in (Miyajima & 

Maruyama 2003) as well. This stereo vision system is implemented on a single FPGA with plenty of 

external memory. It supports camera calibration and left-right consistency check. The performance 

is 20 fps for 640×480 pixel images and 80 fps for 320×240. The number of disparity levels for these 

results are 200 and the device utilization is 54%. Changing the number of disparity levels results 

only in changing the circuit size and not the performance. 

One more simplified version of the adaptive windows aggregation method in conjunction with 

SAD is used in (Chonghun et al. 2004). It can process images of size up to 1024×1024 pixels with 

32 disparity levels at 47 fps speed. The resources needed are 3.4K slices, i.e. 10% of the utilized 

FPGA area. 

Another simple implementation of the SAD method with fixed windows is proposed in (Yi et al. 

2004). The effect of various window shapes is investigated. The results indicate that 270×270 pixel 

images with 27 disparity levels can be processed at 30 fps speed achieving 90% of correct matches. 

The utilization of the FPGA reported is in any case less than 46K slices, equivalent to 8M gates.


Table 2.7 FPGA implementations’ characteristics 

Author Matching 

Cost 

Arias-Estrada & 

Xicotencatl (2001) 

Aggregation Image Size Disparity 

Levels 

Window 

Size 

Speed 

(fps) 

Device 

SAD fixed window 320×240 16 7x7 71 Xilinx Virtex 

XCV800HQ240-6 

Jia et al. (2003) SAD fixed window 640×480 64 9x9 30 – 

Miyajima & 

Maruyama (2003) 

Chonghun et al. 

(2004) 

SAD fixed window 640×480 200 7x7 20 Xilinx Virtex-II 

SAD adaptive window 1024x1024 32 16x16 

(max) 

47 Xilinx Virtex-II 

6000 

Yi et al. (2004) SAD fixed window 270×270 27 9x9 30 Xilinx Virtex-II 

XC2V8000 

Lee et al. (2005) SAD fixed window 640×480 64 32x32 30 Xilinx Virtex-II 

XC2V8000 

Hariyama, 

Kobayashi, Sasaki 

&Kameyama 

(2005) 

Georgoulas et al. 

(2008) 

Kalomiros & 

Lygouras (2008) 

Kalomiros & 


Jeong & Park 

(2004) 

Park & Jeong 

(2007) 

Kalomiros & 


Darabiha et al. 

(2006) 

Masrani & 

MacLean (2006) 

SAD adaptive window 64x64 64 8x8 

(max) 

SAD adaptive window 640×480 80 7x7 

(max) 

30 Altera 

APEX20KE 

275 Altera Stratix II 

EP2S180F1020C3 

SAD fixed window 320×240 32 3x3 14 Altera Cyclone II 

EP2C35F672C6 

SAD fixed window 640×480 64 3x3 162 Altera Cyclone II 

2C35 

DP single line 1280×1000 208 – 15 Xilinx Virtex-II 

XC2V8000 

DP 2 lines 320×240 128 – 30 Xilinx Virtex-II 

pro-100 

DP 2 lines 640×480 65 3x3 81 Altera Cyclone II 

2C35 

LWPC – 256x360 20 – 30 4x Xilinx 

Virtex2000E 

LWPC – 480×640 128 – 30 4x Altera Stratix 

S80


The same core algorithm as in (Yi et al. 2004) is used in the work reported in (Lee et al. 2005). 

The aggregating window shape is found to play a significant role in this implementation. Using 

rectangular windows instead of square ones reduces the resource usage to 50% i.e. less than 10K 

slices and at the same time, preserves the same output quality. The proposed system can process 

640×480 pixel images with 64 disparity levels at 30 fps rate and 320×240 pixel images with 64 

disparity levels at 115 fps. 

On the other hand, a slightly more complex implementation than the previous ones is proposed 

in (Hariyama, Kobayashi, Sasaki & Kameyama 2005). It is based on the SAD using adaptive sized 

windows. The proposed method iteratively refines the matching results by hierarchically reducing 

the window size. The results obtained by the proposed method are 10% better than that of the fixedwindow 

method. The architecture is fully parallel and as a result all the pixels and all the windows 

are processed simultaneously. The speed for 64x64 pixel images with 8 bit grayscale precision and 

64 disparity levels is 30 fps. The resources consumption is 42.5K logic elements, i.e. 82% of the 

utilized device. 

SAD aggregated using adaptive windows is the core of the work presented in (Georgoulas et al. 

2008). A hardware based CA parallel-pipelined design is realized on a single FPGA device. The 

achieved speed is nearly 275 fps, for 640×480 pixel image pairs with a disparity range of 80 pixels. 

The presented hardware-based algorithm provides very good processing speed at the expense of 

accuracy. The device utilization is 149K gates, that is 83% of the available resources. 

The work of (Kalomiros & Lygouras 2008) implement an SAD algorithm on a FPGA board featuring 

external memory and a Nios II embedded processor clocked at 100 MHz. The implementation 

produces dense 8-bit disparity maps of 320×240 pixels with 32 disparity levels at a speed of 14 fps. 

Essential resources are about 16K logic elements, whereas by migrating to more complex devices 

the design can easily grow to support better results. 

Finally, the same authors in (Kalomiros & Lygouras 2009) present an improved SAD-based algorithm 

with a fixed 3×3 aggregation window and a hardware median enhancement filter. The 

presented system can process 640×480 images with 64 disparity levels at 162 fps speed. The implementation 

require 32K logic elements, equivalent to about 63K gates. 

FPGA Implementations based on DP 

The use of DP is an alternative as well. The implementation presented in (Jeong & Park 2004) 

uses the DP search method on a trellis solution space. It copes with the vergent cameras case, i.e. 

cameras with optical axes that intersect. The images received from a pair of cameras are rectified 

using linear interpolation and then the disparity is calculated. The architecture has the form of 

a linear systolic array using simple processing elements. The design is canonical and simple to 

be implemented in parallel. The implementation requires 208 processing elements. The resulting 

system can process 1280×1000 pixel images with up to 208 disparity levels at 15 fps. 

An extension of the previous method is presented in (Park & Jeong 2007). The main difference is 

that data from the previous line are incorporated so as to enforce better inter-scanline inconsistency. 

The running speed is 30 fps for 320×240 pixel images with 128 disparity levels. The number of 

utilized processing elements is 128. The percentage of pixels with disparity error larger than 1 in 

the unoccluded areas is 2.63, 0.91, 3.44 and 1.88 for the Tsukuba, Map, Venus and Sawtooth image 

sets, respectively.


Finally, the work of (Kalomiros & Lygouras 2009) presents a custom parallelized DP algorithm, 

as well. Once again, a fixed 3×3 aggregation window and a hardware median enhancement filter is 

used. Moreover, interscanline support is utilized. The presented system can process 640×480 images 

with 65 disparity levels at 81 fps speed. The implementation require 270K logic elements, equivalent 

to about 1.6M gates. 

FPGA Implementations based on Phase Methods 

Moreover phase-based techniques can be implemented on hardware as well. The algorithm implemented 

in (Darabiha et al. 2006) is called Local Weighted Phase-Correlation (LWPC). Hardware 

implementation of the algorithm turns out to be more than 300 times faster than the software one. 

The platform used is the Transmogrifier-3A (TM-3A) containing four Xilinx Virtex2000E FPGAs 

connected via a 98 bit bus. A description of the programmable hardware platform, the base stereo 

vision algorithm and the design of the hardware can be found in the paper. 66.6K look-up tables 

(LUT) and 83K flip-flops (FF) are required. This implementation can produce dense disparity maps 

of 256×360 pixel image pairs with 20 disparity levels and 8 bit sub-pixel accuracy at the rate of 30 

fps. 

The same LWPC method is used in (Masrani & MacLean 2006). The platform used is the 

Transmogrifer-4 containing four Altera Stratix S80 FPGAs. The system performs rectification and 

left-right consistency check to improve the accuracy of the results. The speed for 640×480 pixel 

images with 128 disparity levels is 30 fps. The hardware resources demanded are roughly the same 

as in (Darabiha et al. 2006) due to the reuse of the available temporal information of the input 

video sequence. 

2.2.2 ASIC Implementations 

On the other hand ASIC implementation is an option as well, but it is more expensive, except of the 

case of massive production. The prototyping times are considerable longer and the result is highly 

process-dependent. Any further changes are difficult and additionally time and money consuming. 

Their performance supremacy does in most cases not justify choosing ASICs. These are the main 

reasons that make recent ASIC implementation publications rare in contrast to the FPGA-based 

ones. 

Published works concerning ASIC implementations (Hariyama et al. 2000, Hariyama, Sasaki & 

Kameyama 2005) of stereo matching algorithms are restricted to the use of SAD. The reported 

architectures make extensive use of parallelism and seem promising. Though, they lack undisputed 

experimental results. 

2.3 Robotic Applications of Stereo Vision 

Stereo vision is a tested, useful and popular tool for inferring the depth of a scene with only passive 

optical sensors. Robotics, on the other hand, evolve rapidly and demand methods that can serve


autonomous behaviors, such as obstacle avoidance and SLAM. Within this context, stereo correspondence 

algorithms need to provide accurate depth maps, in real-time frame-rates, confronting, 

at the same time, any difficulties imposed by the robots’ environments. This Section provides an 

overview of the state of the art, regarding vision-based obstacle avoidance and SLAM robotic applications. 

2.3.1 Obstacle Avoidance Applications 

A wide range of sensors and various methods have been proposed in the relevant literature, as far as 

obstacle avoidance techniques are concerned. Some interesting details about the developed sensor 

systems and proposed detection and avoidance algorithms can be found in (Borenstein & Koren 

1990) and (Ohya et al. 1998). Movarec has proposed the Certainty Grid method in (Moravec 1987) 

and Borenstein (Borenstein & Koren 1991) has proposed the Virtual Force Field method for robot 

obstacle avoidance. Then the Elastic Strips method was proposed in (Khatib 1996, 1999) treating the 

trajectory of the robot as an elastic material to avoid obstacles. Moreover, (Kyung Hyun et al. 2008) 

present a modified Elastic Strip method for mobile robots operating in uncertain environments. 

Review of popular obstacle avoidance algorithms covering them in more detail can be found in 

(Manz et al. 1993) and (Kunchev et al. 2006). Finally, the concept of using fuzzy logic for obstacle 

avoidance purposes was covered by (Reignier 1994), but only up to a theoretical level. 

The obstacle avoidance systems found in literature involve the use of one from or a combination 

of ultrasonic, laser, infrared (IR) or vision sensors (Siegwart & Nourbakhsh 2004). The use of 

ultrasonic, laser and IR sensors is well-studied and the depth measurements are quite accurate and 

easily available. However, such sensors suffer either from achieving only low refresh rates (Vandorpe 

et al. 1996) or being extremely expensive. On the other hand, vision sensors combine high frame 

rates and appealing prices. 

Stereo vision is often used in vision-based methods, instead of monocular sensors, due to the simpler 

calculations involved in the depth estimation. Regarding stereo vision systems, one of the most 

popular methods for obstacle avoidance is the initial estimation of the so called v-disparity image. 

This method is applied in order to confront the noise in low quality disparity images (Labayrade 

et al. 2002, Zhao et al. 2007, Soquet et al. 2007). However, if detailed and noise-free disparity maps 

were available, less complicated methods could have been used instead. 

2.3.2 Simultaneous Localization and Mapping Applications 

The SLAM problem is that of estimating a robot’s position and of progressively building a map 

of its environment. The issue has been in the focus of the robotics research community for over 

two decades (Dissanayake et al. 2001). The difficulties of solving the SLAM problem arise from the 

finite precision of the sensors and actuators of the robot, given real-life situations as the ones shown 

in Figure 2.7. 

The computational load of SLAM is another problematic aspect. Much effort has been devoted 

to reduce the demanded computations (Bailey & Durrant-Whyte 2006, Huang et al. 2008). Vision 

is being used for navigation purposes since the early days of autonomous robotics (Jung 1994).


! (a) ! (b) 

Fig. 2.7 Mobile robots in real environments 

However, recently, vision-based mapping and measurement methods, either stereo or monocular 

(Lemaire et al. 2007), are becoming increasingly popular as cameras’ cost is decreasing. A review 

of the advances in the field of vision-based SLAM can be found in (Chen et al. 2007). 

The success of the solely vision-based SLAM algorithms is, to a large extend, owed to the development 

of robust feature detection and description methods, such as the scale-invariant feature 

transform (SIFT) (Lowe 2004) and the speeded-up robust features (SURF) (Bay et al. 2008). A 

quantitative evaluation of feature extractors for use in vision-based SLAM algorithms can be found 

in Klippenstein & Zhang (2007). 

Apart from the feature extraction process, the majority of state of the art SLAM algorithms 

also rely on some kind of progressive probabilistic framework (Durrant-Whyte & Bailey 2006). 

Davison’s work treats the SLAM problem using a robot equipped with an active stereo head, 

operating in unknown environments. Features are extracted and an extended Kalman Filter (EKF) 

is used (Davison & Kita 2001, Davison & Murray 2002, Davison et al. 2003, Davison 2003, 2007). 

Areal-timeEKFimplementationabletosignificantlyreducethecomputationalrequirementsis 

presented in (Guivant & Nebot 2001). Finally, Holmes in (Holmes et al. 2009) presents a square 

root unscented Kalman filter for performing video-rate visual SLAM using a single camera while 

keeping the algorithm’s complexity low. 

On the other hand, an alternative method called FastSLAM has been proposed (Montemerlo 

2003, Montemerlo & Thrun 2007, Stentz et al. 2003). This algorithm estimates recursively the full 

posterior distribution over robot pose and landmark locations, but scales logarithmically with the 

number of landmarks in the map. 

Furthermore, the use of particle filters have been reported (Moreno et al. 2009). The work of 

(Sim et al. 2007, Sim & Little 2009) uses a stereo camera to collect data and a Rao-Blackwellised 

particle filter to solve the SLAM problem. The use of efficient data structures and a hybrid map 

representation provide precise robot’s localization and maps of the environment in high frame rates. 

Recently, real-time solutions have been in the focus of research. The combination of stereo vision 

and an inertial measurement unit is used in (Zhu et al. 2007), while bundle adjustment is utilized in 

(Nister et al. 2006). Finally, highly efficient, solely stereo-based methods are also reported (Agrawal 

&Konolige2008,Meietal.2009).


2.4 Open Issues of Stereo Vision for Robotic Applications 

Despite the fact that stereo vision has been widely used for various robotic applications, common 

issues related to outdoor exploration have not yet been addressed in a satisfactory manner. Outdoor 

robotics place strict constraints on the used algorithms (Konolige et al. 2006, Soquet et al. 2007). 

The lighting conditions in outdoor environments are far from being ideal. A stereo camera, which 

acquires two displaced views of the same scene, is very sensitive to such conditions (Hogue et al. 

2007, Klancar et al. 2004). Moreover, the rough terrain and the bounces it causes to a moving 

robot often decalibrate the cameras of a stereo acquisition array. Autonomous operation demands 

high processing frame-rates. On the other hand, a robotic platform can provide limited computational 

resources, power and payload capacity for many different onboard applications. These facts 

differentiate the priorities of stereo vision algorithms intended for use in outdoor operating robots 

from those listed in (Scharstein & Szeliski 2010). The algorithms listed in the aforementioned site 

compete based on their accuracy on four perfectly lighted, calibrated and rectified image sets without 

any timing or computational constraints. The rest of this Section discusses the state of the art 

concerning open issues of stereo vision methods when applied to robotic applications. 

2.4.1 Simplicity of Computations 

The performance of stereo vision algorithms greatly affects the relevant autonomous robotic behaviors. 

As discussed previously, stereo correspondence algorithms can be coarsely divided in local and 

global ones. Dense local stereo correspondence methods calculate depth for almost every pixel of 

the scenery, talking into consideration only a small neighborhood of pixels each time (Scharstein & 

Szeliski 2002). On the other hand, global methods are significantly more accurate but at the same 

time more computationally demanding, as they account for the whole image (Torra & Criminisi 

2004). Since the most urgent constraint in autonomous robotics is the real-time operation, such 

applications usually utilize the computationally simpler local algorithms (Labayrade et al. 2002, 

Soquet et al. 2007, Kelly & Stentz 1998, Zhao et al. 2007, Konolige et al. 2006, Agrawal et al. 2007). 

Implementing stereo algorithms in hardware can dramatically improve their efficiency. The allure 

of hardware implementations is that they easily outperform the algorithms executed on a computer. 

The achieved frame-rates are generally higher. The power consumed by a dedicated hardware platform, 

e.g. ASIC or FPGA, is considerably lower than that of a common microprocessor. Moreover, 

the computational power of the robot’s onboard available PCs is left intact. However, the hardware 

implementation of the already presented algorithms, as already discussed, is not always straightforward 

In general, robotics require computational simple and easy to implement stereo vision algorithms 

that will provide accurate and reliable results. 

2.4.2 Multi-view Stereo Vision 

Early previous work focused on developing stereo algorithms mostly for binocular camera configurations. 

However, redundancy can lead to more accurate and reliable depth estimations. More recently,


due to significant boost of the available computational power, vision systems using multiple cameras 

are becoming increasingly feasible and practical. The transition from binocular to multi-ocular 

systems has the advantage of potentially increasing the stability and accuracy of depth calculations. 

The continuous price reduction of vision sensors allowed the development of multiple camera 

arrays ready for use in many applications. For instance, Yang et al. (Ruigang et al. 2002) used a 

five-camera system for real-time rendering using modern graphics hardware, while Schirmacher et 

al. (Schirmacher et al. 2001) increased the number of cameras and built up a six-camera system 

for on-the-fly processing of generalized Lumigraphs. Moreover, developers of camera arrays have 

expanded their systems so as to use tens of cameras, such as the MIT distributed light field camera 

(Yang et al. 2002) and the Stanford multi-camera array (Wilburn et al. 2002). These systems are 

using 64, and 128 cameras respectively. 

Most of the aforementioned camera arrays are utilized for real-time image rendering. On the 

other hand, a research area that could also be benefited by the use of multiple camera arrays is the 

so called cooperative stereo vision; i.e., multiple stereo pairs being considered to improve the overall 

depth estimation results. To this end, Zitnick (Zitnick & Kanade 2000) presented an algorithm for 

binocular occlusion detection and Mingxiang (Mingxiang & Yunde 2006) expanded it to trinocular 

stereo. 

2.4.3 Uncalibrated Stereo Images 

The two alternatives for efficiently estimating disparity are either to precisely align the stereo 

camera rig and then perform the demanded rectification (leading to simple scanline searches), or 

to have arbitrary stereo cameras setup and avoid any calibration (performing searching throughout 

blocks). Accurately aligned stereo devices are very expensive, as they demand calibration of a series 

of factors in micrometer scale (Gasteratos & Sandini 2002). On the other hand, non-ideal stereo 

configurations usually produce inferior results, as they fail to satisfy the epipolar constraint. 

The issue of processing uncalibrated images, common to applications where the sensory system 

is not explicitly specified, is an open one. The plethora of computations most commonly require 

the massive parallelization found in custom tailored hardware implementations. The contemporary 

powerful graphics machines are able to achieve better results in terms of processing time and data 

volume. Stereo vision algorithms able to process uncalibrated input images (Zach et al. 2004, Jeong 

&Park2004,Masrani&MacLean2006,Park&Jeong2007)havealsobeendiscussedpreviously 

in this Chapter. 

2.4.4 Non-ideal Lighting Conditions 

The vast majority of the stereo correspondence algorithms use some kind of intensity-based metric 

as the basis of their dissimilarity measure function (Scharstein & Szeliski 2002). The most common 

methods, due to their simplicity and real-time performance, are the SAD and the SSD. The correctness 

of their results is based on the assumption that the same feature in the two stereo images, 

should have ideally the same intensity. However, this assumption is often not valid. Even in the case 

that the gains of the two cameras are perfectly tuned, so as to result in the same intensity for the


same features in both images, the fact that the two cameras shoot from a different pose, might result 

in different intensities for the same point, due to shading reasons. Moreover, in real environments, 

which is the case for robotic applications, the illumination is not ideal (Klancar et al. 2004, Hogue 

et al. 2007). This fact leads to large variations of the intensity values for the same features among 

the two images of a stereo pair. Such a situation is shown in the stereo pair of Figure 2.8. The sun 

in the left image is hidden by a bush, whereas in the right image directly faces the camera. Thus, 

the non-ideal illumination causes failures during an intensity-based correspondence procedure. 

(a) Left image (b) Right image 

Fig. 2.8 A real-life stereo pair suffering from different illumination 

2.4.5 Biologically Inspired Methods 

The success of the HVS in obtaining depth information from two 2D images still remains a goal to 

be accomplished by machine vision. Incorporating procedures and features from HVS in artificial 

stereo-equipped systems, could improve their performance. The key concept behind this transfer of 

know-how from nature to science is identifying, understanding and expressing the basic principles 

of natural stereoscopic vision, aiming to improve the state of the art in machine vision. These 

principles are mainly involved in the aggregation step that most existing algorithms employ. 

HVS has been studied by many branches of the scientific community. Physics have expressed color 

information through color spaces, while biology has investigated the response and the physiology 

of the eyes. Psychophysics have studied the relationship between individual stimuli’s changes and 

the perceived intensity, which is applicable to vision as well as all to the other modalities. On the 

other hand, the gestalt school of psychology suggested grouping as the key for interpreting human 

vision. 

Often, biological and psychological findings are incorporated in the expression of correlation 

functions. Real life is the ultimate resource for finding right solutions in many fields of robotics, 

computer science and electronics (Mead 1990, Shimonomura et al. 2008, Berthouze & Metta 2005). 

The natural selection process is a strict judge that favors the more effective solutions for each 

problem. Of course, our understanding for the solutions that emerged from natural selection comes


mainly from the sciences of biology, psychology and neuroscience. Applying ideas borrowed from 

these sciences in technological problems can lead to very effective results. Consequently, further 

blending of biological and psychological findings with computer vision indicates a promising direction 

towards simple and accurate computer vision algorithms.

Chapter 3 

Stereo Correspondence Algorithms 

This Chapter presents new stereo correspondence algorithms developed within this thesis. Each of 

the presented algorithms aim to confront some of the open issues identified in the previous Chapter. 

3.1 Stereo Correspondence Algorithm with Enhanced Disparity Selection 

In this Section an effective, hardware oriented stereo correspondence algorithm, able to produce 

dense disparity maps of improved fidelity is presented. The presented algorithm combines rapid 

execution, simple and straight-forward structure as well as comparably high quality of results. These 

features render it as an ideal candidate for hardware implementation and for real-time applications. 

The algorithm utilizes the AD as matching cost and aggregates the results inside support windows, 

assigning Gaussian distributed weights to the support pixels, based on their Euclidean distance. 

The resulting DSI is furthered refined by CA acting in all of the three dimensions of the DSI. 

The main merit of the presented algorithm is its simplicity, rendering it as an ideal choice for 

real-time operations and hardware implementation. Its structural elements are summarized as: 

1. AD is utilized as matching cost function since it is the simplest one, involving no multiplications. 

2. The aggregation step is a 2D process performed inside fix-sized square support windows upon a 

slice of the DSI. The pixels inside each support window are assigned to a Gaussian distributed 

weight during aggregation. The weight of each pixel is a Gaussian function of its Euclidean 

distance towards the central pixel of the current window. 

3. The resulting aggregated values of the DSI are furthered refined by applying CA. CA are used 

inside the 3D DSI, and not as a 2D post-processing disparity map filter (Kotoulas, Gasteratos, 

Sirakoulis, Georgoulas & Andreadis 2005). 

4. Finally, the best disparity value for each pixel is decided by a WTA selection step. 

This algorithm was not indented to achieve excellence of results but to provide a simple to 

implement, fast to execute yet credible stereo correspondence methodology. In this way the presented 

algorithm is able to be executed in real-time and to be easily hardware implemented, as demanded 

by many applications. 

43

Chapter 3. Stereo Correspondence Algorithms 45 

! 

Fig. 3.2 2D Gaussian mask producing the weight for the pixel summation 

The resulting aggregated values of the DSI are furthered refined by applying CA. All cells can 

work in parallel and as a result the used CA can be easily implemented in hardware. Two CA 

transition rules are applied to the DSI. The values of parameters used by them were determined 

after extensive testing to perform best. The first rule attempts to resolve disparity ambiguities. It 

checks for excessive consistency of results along the disparity d axis and, if necessary, corrects on 

the perpendicular (i, j) plane. The second rule is placed in order to smoothen the results and at the 

same time to preserve the details. It checks and acts on constant-disparity planes. The two rules 

can be expressed as: 

1. if at least one of the two pixels lying from either sides of a pixel across the disparity axis d differs 

from the central pixel less than half of its value, then its value is further aggregated within its 

3×3 pixel,constant-disparityneighborhood. 

First CA rule Pseudocode 

if { 

|DSI(i,j,d)-DSI(i,j,d-1)| < (1/2)DSI(i,j,d) } 

or { 

|DSI(i,j,d)-DSI(i,j,d+1)| < (1/2)DSI(i,j,d) } 

then { 

for m,n = (-1,0,1) { 

DSI(i,j,d) = (1/9)sum(sum(DSI(i+m,j+n,d) }} 

2. if there are at least 7 pixels in the 3×3 pixel neighborhood which differ from the central pixel 

less than half of the central pixel’s value, then the central pixel’s value is scaled down by the 

factor 1.3, as dictated by exhaustive testing. 

Second CA rule Pseudocode 

for m,n = (-1,0,1) { 

while (m and n)0 { 

if { 

|DSI(i+m,j+n,d)-DSI(i,j,d)| < (1/2)DSI(i,j,d) } 

then { 

count++ }}} 

if { 

count>=7 } 

then {

46 Chapter 3. Stereo Correspondence Algorithms 

DSI(x,y,d) = (1/1.3)DSI(i,j,d) } 

The two rules are applied once. Their outcome comprises the enhanced DSI that will be used in 

order the optimum disparity map to be chosen by a simple, non-iterative WTA final step. 

In the last stage the best disparity value for each pixel is decided by a WTA selection procedure. 

For each image pixel coordinates (i, j) the smaller value is searched for on the d axis and its position 

is declared to be the pixel’s disparity value. That is: 

3.1.2 Experimental Results 

D(i, j) =arg(min(DSI(i, j, d)) (3.2) 

The algorithm was applied to standard image sets as well as to self-recorded real-life ones, in order 

to be evaluated. Results are presented in terms of calculated images and quantitative metrics. 

Standard Image Sets 

The standard image sets used were the four stereo images (Scharstein & Szeliski 2002, 2003) provided 

along with their corresponding ground truth disparity maps by Scharstein and Szeliski through their 

web site (Scharstein & Szeliski 2010). Figure 3.3 depicts the reference (left) images (a), the provided 

ground truth disparity maps (b), the disparity maps calculated by the presented method (c), maps 

of signed disparity error where middle (50%) gray tone equals to zero error (d), and maps of pixels 

with absolute computed disparity error bigger than 1 shown in black (e). The percentage of pixels 

whose absolute disparity error is greater than 1 in the non-occluded, all, and near discontinuities 

and occluded regions are presented in Table 3.1. The presented algorithm leaves non-calculated a 

frame around the image whose width is equal to the aggregation window width, i.e. 11 pixels. Thus, 

the results of Table 3.1 slightly underestimate the performance of the presented algorithm, except 

for the case of Tsukuba image set, where the ground truth itself ignores that frame as well. 

Table 3.1 Percentage of pixels whose absolute disparity error is greater than 1 in various regions of the images 

Pair Non-occluded (%) All (%) Discontinuities (%) 

Tsukuba 10.3 12.3 23.5 

Venus 8.86 10.2 35.8 

Teddy 24.5 31.5 35.2 

Cones 20.6 28.8 31.1 

Table 3.2, on the other hand, presents the Normalized Mean Square Error (NMSE) for the 

calculated disparity maps of the four image sets, excluding the 11 pixel wide frame. NMSE is 

calculated for a simplified version of the presented algorithm, which makes no use of CA, as well as 

for the complete version of the algorithm. The addition of CA substantially improves the quality, 

as shown from the last column.


(a) 

(b) 

(c) 

(d) 

(e) 

Fig. 3.3 Results for the Middlebury data sets. From left to right: the Tsukuba, Venus, Teddy and Cones image 

From top to bottom: the reference (left) images (a), the provided ground truth disparity maps (b), the disparity 

maps calculated by the presented method (c), maps of signed disparity error (d), and maps of pixels with absolute 

computed disparity error bigger than 1 (e) 

Self-recorded Image Sets 

The presented algorithm was also applied to two self-recorded real-life stereo pairs, as well. The 

pairs were captured using a PointGrey Research, Bumblebee2 stereo camera system and their size is 

512×384 pixels. Two scenes were captured, one outdoor and one indoor. Performance on everyday


Table 3.2 Calculated NMSE for various versions of the presented algorithm 

Data Set Normalized Mean Square Error (NMSE) Improvement (%) 

without CA with CA 

Tsukuba 0.0627 0.0593 5.42 

Venus 0.0545 0.0447 17.98 

Teddy 0.1149 0.1108 3.57 

Cones 0.0809 0.0768 5.07 

scenes, even if generally ignored in favor of synthetic datasets, is very important for a system that 

aspires to be used in robotic applications. The two stereo pairs along with the calculated disparity 

maps are presented in Figure 3.4. 

(a) 

(b) 

Fig. 3.4 Self-recorded scenes. (a) outdoor scene, (b) indoor scene. From left to right: left image, right image, 

calculated disparity map 

The results are acceptable, considering that the main merit of the presented algorithm is its 

simplicity and effectiveness in conjunction with its ability to be easily hardware implemented and 

to run in real-time. A comparison of the aforementioned to the results of other methods listed 

in (Scharstein & Szeliski 2010), shows that they are comparable to those, of the corresponding 

simple-structured algorithms. 

3.1.3 Discussion 

The presented algorithm exhibits satisfactory performance despite its simple structure. Gaussian 

weighted aggregation and CA refinement inside the DSI have been proven to comprise an effective


computational combination. Disparity maps of standard image sets, as well as of self-recorded ones 

are calculated. The data show that the presented algorithm is in the right direction for a hardware 

implementable, real-time solution. However, the quality of the results could be further improved by 

refining further the applied CA rules. The possibilities concerning the nature and the number of 

the applied CA rules are practically endless and the chosen ones, although effective, are only one of 

those possibilities. The presented algorithm’s ability to calculate disparity maps of real-life scenes is 

highly appreciated. Finally, it can be concluded that the algorithm’s serial flow and low complexity 

combined with the presented satisfactory results render it as an appealing candidate for hardware 

implementation. Thus, depth calculation could be performed efficiently in real-time by autonomous 

robotic systems. 

3.2 Quad-view Stereo Correspondence Algorithm 

This Section proposes a quad-camera based system able to calculate fast and accurately a single 

depth map of a scenery. The four cameras are placed on the corners of a square. Thus, three, 

differently oriented, stereo pairs result when considering a single reference image (namely an horizontal, 

a vertical and a diagonal pair). The presented system applies a slightly modified version 

of the stereo correspondence algorithm presented in the previous Section to each stereo pair. This 

way, the computational load is kept within reasonable limits. A reliability measure is used in order 

to validate each point of the resulting disparity maps. Finally, the three disparity maps are fused 

together according to their reliabilities. The maximum reliability is chosen for every pixel. The final 

output of the presented system is a highly reliable depth map which can be used for higher level 

robotic behaviors. 

3.2.1 Algorithm Description 

The presented system is a combination of sensory hardware and a custom-tailored software algorithm. 

The hardware configuration, i.e. the four cameras’ formation, produce three stereo image 

pairs. Each pair is submitted to the simple and rapid stereo correspondence algorithm, resulting, 

thus, in a disparity map. For each disparity map a certainty map is calculated, indicating each 

pixel’s reliability. Finally, the three disparity maps are fused, according to their certainties for each 

pixel. The outcome is a single disparity map which incorporates the best parts of its producing 

disparity maps. The combined hardware and software system is able to produce accurate dense 

depth maps in frame rate suitable for autonomous robotic applications. 

Hardware Sensory System 

The sensory configuration of the presented system consists of four identical cameras. The four 

cameras are placed so as their optical axes to have parallel orientation and their principal points 

to be co-planar residing on the corners of the same square, as shown in Figure 3.5(a). The images 

captured by the upper-left camera are considered as the reference images of each tetrad. Each one


of the other three cameras produces images to be corresponded to the reference images. Thus, for 

each tetrad of images three, differently oriented, stereo pairs result, i.e. an horizontal, a vertical 

and a diagonal one. The concept, as well as the result of such a group of cameras are presented in 

Figure 3.5(b). 

(a) (b) 

Fig. 3.5 (a) The quad-camera configuration and (b) the results (up-left) and scene capturing (right) using the 

quad-camera configuration 

Software Architecture 

The presented algorithm consists of two processing steps. The first one is the stereo correspondence 

algorithm that is applied to each image pair. Then, during a fusion step the results for all the stereo 

pairs are merged. 

Stereo correspondence algorithm 

The presented system utilizes a custom tailored, simple, rapidly executed stereo correspondence 

algorithm applied to each stereo pair. Stereo disparity is computed using a three-stage local stereo 

correspondence algorithm. The algorithm utilized is a slightly modified version of the algorithm 

presented in Section 3.1. The only difference with the aforementioned stereo correspondence algorithm 

has to do with the dimensions of the chosen aggregation window. Noise suppression is very 

important for stereo algorithms that are intended to be applied to outdoors scenes. Outdoors images, 

which is often the case for autonomous navigation tasks, usually suffer from noise induced 

by a variety of reasons, e.g. lighting differences and reflections. The aggregation windows dimensions 

used in the presented algorithm are bigger, i.e. 13 × 13 pixels. This choice is a compromise 

between real-time execution speed and sufficient noise cancellation. Overall, the used stereo correspondence 

algorithm combines low computational complexity with sophisticated data processing. 

Consequently, it is able to produce dense disparity maps of good quality in frame rates suitable for 

robotic applications.


Fig. 3.6 Algorithm’s steps and results for the Tsukuba data set. (column 1) the reference image (up-left), (column 

2) the three target images (up-right, down-left, down-right), (column 3) the certainty maps for the horizontal, vertical 

and diagonal pair, (column 4) the computed disparity map for each stereo pair, (column 5) the fused (top) and the 

ground truth (bottom) disparity maps 

Figure 3.7 shows the experimental results of the presented quad-camera algorithm (left), the 

computationally equivalent simple stereo algorithm (middle) and the utilized single stereo algorithm 

applied on the horizontal stereo pair (right). The first row shows the calculated disparity maps. The 

second row shows the maps of pixels with absolute computed disparity error bigger than 1 shown in 

black. Finally, the third row presents maps of signed disparity error where the middle (50%) gray 

tone equals to zero error. It is obvious that the simple stereo algorithm, shown in the rightmost 

column suffers from noise. The usual confrontation of this issue is to enlarge the utilized 13 × 13 

pixel aggregation window during the respective stage. However, window enlargement generally leads 

to loss of detail and coarse results, as shown in the middle column. This version of the algorithm 

utilizes a 23 × 23 pixel aggregation window, which results in triple computational load. Obviously, 

both of these treatments lack the results’ quality of the presented method. The final result of the 

presented algorithm requires roughly the same computational power as the algorithm in the middle 

column. The outcome is that the presented quad-camera algorithm achieves better results than 

its computationally equivalent simple two-camera stereo counterpart and the simple initial stereo 

algorithm. 

The percentage of pixels whose absolute disparity error is greater than 1 in the non-occluded, all, 

and near discontinuities and occluded regions are presented in Table 3.3. The presented percentages 

refer to the three initially computed stereo pairs (namely the horizontal, vertical and diagonal pair), 

the final fused result of the presented system and, finally, the computationally equivalent two-camera 

utilized stereo correspondence algorithm. 

As shown in Table 3.3 there are cases where the results of the fusion process are marginally worse 

than those of an initial step. However, the image pair direction that provides the optimum results 

and should be considered as the most reliable and useful can not be initially anticipated. Moreover,


Fig. 3.7 Results of the presented fusion system (left), the computationally equivalent simple stereo algorithm 

(middle) and the preliminary simple stereo algorithm applied on the horizontal image pair (right). From top to 

bottom: the computed disparity maps, pixels with absolute computed disparity error bigger than 1 and maps of 

signed disparity error 

the optimum direction is arbitrary and, therefore, there is little chance to coincide with any of the 

available three in the presented system throughout the whole scene. However, the goal of the fusion 

system is to identify the best disparity value for every pixel. Thus, the results will be roughly as, or 

occasionally even more, accurate as the best initial results. On the other hand, the final disparity 

map is, in any case, far more reliable than the initial ones, since it has gone through a validation 

procedure, guaranteed by Equation 3.3. 

Table 3.3 Percentage of pixels whose absolute disparity error is greater than 1 in various regions for the Tsukuba 

pairs 

Pair Non-occluded (%) All (%) Discontinuities (%) 

Horizontal 16.2 18.1 29.9 

Vertical 12.5 13.8 35.1 

Diagonal 10.7 12.4 32.3 

presented 10.8 12.6 31.5 

Equivalent 15.8 17.6 33.9 

The presented algorithm has been also applied to a virtual scenery. A virtual quad-camera system 

was inserted to the virtual room shown in the two first columns of Figure 3.8 and the demanded 

tetrad of images was captured. The room scene was chosen as it is a complex and demanding one, 

having both regions with fine details and low-textured ones. Moreover, the repetitive pattern of


the books, in the background, is a challenging element for the stereo correspondence algorithms. 

Figure 3.8 depicts the reference i.e. up-left image in the first column and the three target images i.e. 

up-right, down-left and down-right in the second column. The third and the fourth columns show 

the certainty and disparity maps calculated for the image pairs consisting of the single reference 

image and the corresponding target ones. Finally, the fifth column of the figure shows the fused 

final disparity map. 

Fig. 3.8 Algorithm’s steps and results for a synthetic room scene. (column 1) The reference image (up-left), (column 

2) the three target images (up-right, down-left, down-right), (column 3) the certainty maps for the horizontal, vertical 

and diagonal pair, (column 4) the computed disparity map for each stereo pair, (column 5) the final fused depth map 

The availability of reliable depth maps is the cornerstone of many computer vision as well as 

robotic applications. Figure 3.9(a) shows a screenshot of the 3D reconstructed Tsukuba scene. The 

depth map of Figure 3.6 obtained using the presented method was utilized in order to add the third 

dimension’s information to the reference image. Thus, a 3D model of the scene was reconstructed 

and a computer user can virtually navigate around the scene. On the other hand, Figure 3.9(b) 

shows an obstacle detection application based on the availability of reliable depth map. Stereo 

vision can be used by autonomous robotic platforms in order to reliably detect obstacles within 

their movement range and move accordingly. The previously obtained depth map of Figure 3.8 

was used for the calculation of the v-disparity image. Using the Hough transformation the floor 

plane was calculated and the obstacles were detected. This result is useful for any path-planning 

algorithm.


(a) (b) 

Fig. 3.9 Application results obtained using the calculated depth maps. (a) View of the reconstructed Tsukuba scene 

and (b) obstacle detection in the virtual room scene 


A depth computing system has been presented aimed for autonomous robotics applications. The 

system utilizes a square formation of four identical cameras capturing the same scene. Selecting 

one of the images of each tetrad as reference, three image pairs result. Each pair is processed by a 

simple and rapid, custom stereo correspondence algorithm which results in an initial disparity map, 

as well as in a certainty map. A fusion process evaluates the three initial disparity maps according 

to their certainty and produces the final combined disparity map. 

Autonomous robotic applications demand reliable depth estimations obtained in real-time frame 

rates, having at the same time limited computational resources. The presented system substitutes 

the computational complexity with a special sensor configuration. However, the demanded configuration 

can be easily and cost efficiently be achieved. The presented results exhibit a fair compromise 

between the objectives of low computational complexity and result’s reliability. 

The accuracy of local algorithms in various regions of a scene is strongly correlated to the orientation 

of the depicted objects in that particular region towards the orientation of the correspondence 

search procedure. That is, the depth discontinuities are more discriminable when they are oriented 

vertically to the correspondence search direction. This conclusion is based on the inherent way local 

algorithms operate and can be confirmed by the preliminary disparity maps, presented in the fourth 

row of Figure 3.6 and Figure 3.8. The presented system has the advantage of being able to adapt to 

various objects’ orientations. The result is that the final fused disparity map is at least as accurate 

as the most accurate of the initial disparity maps and at the same time much more reliable than 

any of them. Moreover, the structure of the presented software architecture is ideal for execution on 

the nowadays widely available quad-core processors. Each one of the identical but separate stereo 

correspondence searches can be assigned to a core, while the fourth core will supervise the whole 

procedure.


3.3 Hierarchical Stereo Correspondence Algorithm for Uncalibrated 

Images 

In motion estimation, the sub-pixel matching technique involves the search of sub-sample positions 

as well as integer-sample positions between the image pairs, choosing the one that gives the best 

match. Based on this idea, the presented algorithm proposes an estimation method, which performs 

a 2D correspondence search using a hierarchical search pattern. The intermediate results are refined 

by 3D CA. The disparity value is then defined using the horizontal distance of the matching position. 

Therefore, the presented algorithm can process uncalibrated and non-rectified stereo image pairs, 

maintaining the computational load within reasonable levels. 

This stereo vision algorithm is inspired by recent motion estimation techniques. The presented 

algorithm has been adapted to the demands of the contemporary outdoor robotic applications. It 

is based on a fast executed SAD core for correspondence search in both the vertical and horizontal 

direction of the input images. The results of this core are enhanced using sophisticated computational 

techniques; Gaussian weighted aggregation and 3D CA rules are used similarly to Section 3.1. 

The hierarchical iteration of the basic stereo algorithm was achieved using a fuzzy scaling technique 

(Amanatiadis et al. 2008). The aforementioned characteristics provide improved quality of results, 

being at the same time easy to be hardware implemented. As a result, the presented algorithm is 

able to cope with uncalibrated input images. 

The presented scheme is block matching-based and does not perform scanline pixel matching. 

As a result, it does require neither camera calibration nor image rectification. However, it is clear 

that block matching approaches require more computational resources since the number of pixels 

to be considered is greatly increased. In order to address this problem, the presented algorithm 

is a variation of a motion estimation algorithm (Yin et al. 2003) which is used for JVT/H.264 

video coding (Wiegand et al. 2003). The adaptation of compression motion estimation algorithms 

into disparity estimation schemes can be effective both in accuracy and complexity terms, since 

compression algorithms also attempt to achieve complexity reduction while maintaining coding 

efficiency. On the other hand, CA have been employed as a intelligent and efficient way to refine 

and enhance the stereo algorithm’s intermediate results. 

3.3.1 Algorithm Description 

The algorithm presented in this Section is an extension of the algorithm found in Section 3.1. The 

original algorithm has been extended so as to perform two-dimensional matching search instead of 

one-dimensional. The search is performed hierarchically in three steps incrementally improving the 

match precision, but avoiding the computational load of a full two-dimensional search scheme. 

Stereo Correspondence Algorithm 

The presented system utilizes a simple, rapidly executed stereo correspondence algorithm applied 

to each stereo pair. The matching cost function utilized is the AD, performed in both dimensions 

of the image.


Fig. 3.10 Quadruple, double and single pixel sample matching algorithm 

Fig. 3.11 General scheme of the presented hierarchical matching disparity algorithm. The search block is enlarged 

for viewing purposes 

The general scheme of the presented hierarchical matching disparity algorithm between a stereo 

image pair is shown in Figure 3.11. Each of the intermediate disparity maps of the first two steps 

are used as initial conditions for the succeeding, refining correspondence searches. 

In order to perform the hierarchical disparity search three different versions of the input images 

are employed and the stereo correspondence algorithm is applied to each of these three pairs. The 

quadruple search step is performed as a normal pixel-by-pixel search, on a quarter-size version of 

the input images. That is, each of the initial images has been down-sampled to 25% of their initial 

dimensions. The quadruple search is performed by applying the stereo correspondence algorithm 

in (D/4) × (D/4) search regions, on the down-sized image pair (D being the maximum expected 

horizontal disparity value in the original image pair). The choice of the maximum searched disparity 

D/4 is reasonable as the search is performed on a 1/4 version of the original images. 

The window value 2w +1 used in this stage is 9, i.e. w =4.Oncethebestmatchisobtained 

for each pixel, another correspondence search is performed in 3 × 3 search regions, on a half-size 

version of the initial image pair. Thus, the double pixel search is performed on a 50% down-sampled 

version of the input images with window dimension 2w +1being 15, i.e. w =7.Finally,thesingle 

pixel matching is performed in 3 × 3 regions, on the original input pair. The window value 2w +1


used in this final stage is 23, i.e. w = 11. The block diagram of the presented algorithm is shown 

in Figure 3.12. The choice of 3 × 3 search regions for the last two steps of the hierarchical pattern 

can be explained as follows. The first stage is expected to find the best match for each pixel. As the 

next stage uses another version of the same image with double dimensions, the initially matched 

pixel could have been mapped to any of the 3 × 3 pixels neighborhood in the bigger version of the 

image. 

Left Image 

Right Image 

Downsize to 

25% 

Downize to 

25% 

Downscale 

to 50% 

Downscale 

to 50% 

Stereo 

Correspondence 

Algorithm 



Algorithm 



Algorithm 

Fig. 3.12 Block diagram of the hierarchical disparity search algorithm 

Quadruple Search 

Disparity Map 

Upscale by 

factor 2 

Upscale by 

factor 2 

Double Search 

Disparity Map 

Final 

Disparity map 

From the block diagram it is obvious that up-scaling and down-scaling play critical role in 

the whole hierarchical process. These two image transformations are realized by interpolation algorithms. 

Image interpolation can be described as the process of using known data to estimate 

values at unknown locations. The interpolated value f(x) at coordinate x in a space of dimension 

q can be expressed as a linear combination of samples fk evaluated at integer coordinates 

k =(k1,k2,...,kk) ∈ Z q


that have been tested are those presented in (Ogale & Aloimonos 2007) and (Yoon & Kweon 

2006a). Their results for the same distorted image set are shown in Figure 3.13(f) and Figure 

3.13(g) respectively. The first of them proposes a compositional approach to unify many early 

visual modules such as segmentation, shape and depth estimation, occlusion detection and local 

signal processing. The second one features an adaptive support weight aggregation scheme based on 

pixels’ color similarity and geometric proximity. Both the rival algorithms are state of the art local 

ones, since global algorithms, even though more accurate, are generally not suitable for real-time 

robotic applications. The results obtained by the presented algorithm are obviously better than 

those of the other two algorithms. This fact is due to the lack of calibration for the two input 

images, which can be handled by the presented algorithm but not by the two others. 

The final result of the presented algorithm, Figure 3.13(e), has the same dimensions as the input 

images, while the previous ones have their half and quarter dimensions respectively. A full search 

algorithm would require D × D calculations for every pixel. On the other hand, the presented 

algorithm performs only (D/4) × (D/4) + 3 × 3+3× 3 calculations. Considering D = 32, itcanbe 

found that the presented algorithm is 15.7 times less computational demanding. 

Additionally, the presented algorithm has been applied to four commonly used image sets. Once 

again the image sets were manually distorted with the use of special commercial software. Thus, 

the radial distortion of an optical lens was simulated. The distortion induced was 10% for all the 

four image pair, as well as for their given ground truth disparity maps. The tested distorted image 

pairs, as well as the calculated disparity maps are shown in Figure 3.14. 

The results shown in Figure 3.14 were compared with the respective disparity maps, which had 

been distorted to the same degree as the input images i.e. 10%. For each distorted image set, the 

NMSE has been calculated as a quantitative measure of the algorithm’s behavior. Moreover, the 

presented algorithm has been applied to the original undistorted versions of the image sets. The 

NMSE has been once more calculated. A typical stereo correspondence algorithm, would have been 

able to cope with the undistorted images, but it would have failed to process the distorted ones. The 

variation of performance, would have been significant, and always in favor of the undistorted image 

pairs. In Table 3.8 the calculated NMSE for the presented algorithm is given, when applied to the 

distorted and the original versions of the four image sets. The last column presents the percentage 

of variation, where positive values indicate better results on the original image sets while negative 

values indicate better results on the distorted image sets. It is evident that the presented algorithm 

is not being affected by the presence of non-calibration effects in the processed images. 

Table 3.4 Calculated NMSE for the presented algorithm for various pairs with constant distortion 10% 

Pair NMSE Variation (%) 

Distorted Original 

Tsukuba 0.0712 0.0781 -0.097 

Venus 0.0491 0.0461 +0.061 

Teddy 0.1098 0.0976 +0.111 

Cones 0.0500 0.0519 -0.038 

The manually induced lens distortion percentage, i.e 10%, was chosen as a typical value. However, 

the performance of the presented algorithm was tested for various values of induced lens distortion. 

Seven versions of the Tsukuba image sets were prepared and tested. In Figure 3.15 the two 

distorted input images, as well as the calculated disparity maps, are shown for various percentages


(a) (b) 

(c) (d) (e) 

(f) (g) 

Fig. 3.13 (a), (b) The uncalibrated, diagonally captured input images and the resulting disparity maps of the 

presented algorithm for (c) the quadruple, (d) double and (e) single pixel estimation respectively. The result of (f) 

Ogale & Aloimonos (2007) and (g) Yoon & Kweon (2006a) for the same input images 

of distortion. The calculated NMSE for each version is given in Table 3.5 and these results can be 

visually assessed in Figure 3.16. It can be deduced that the presented algorithm presents a stable 

behavior over a large range of distortion values.


(a) (b) (c) 

(d) (e) (f) 

(g) (h) (i) 

(j) (k) (l) 

Fig. 3.14 From left to right: the left and right 10% distorted input images and the calculated final disparity map 

for the (from up to down:) Tsukuba, Venus, Teddy and Cones image sets respectively


(a) Distortion 0% 

(b) Distortion 2.5% 

(c) Distortion 5% 

(d) Distortion 7.5% 

(e) Distortion 10% 

(f) Distortion 12.5% 

(g) Distortion 15% 

Fig. 3.15 (from left to right:) The left and right distorted input images and the calculated final disparity maps for 

various percentages of the induced lens distortion


Table 3.5 Calculated NMSE for the presented algorithm for the Tsukuba pair with various distortion percentages 

NMSE 

!"$ 

!"!# 

!"!( 

!"!' 

!"!& 

!"!% 

Distortion (%) NMSE 

0.0 0.0781 

2.5 0.0712 

5.0 0.0708 

7.5 0.0663 

10.0 0.0712 

12.5 0.0723 

15.0 0.0761 

! )"% % '"% $! $)"% $% 

Distortion Percentage 

Fig. 3.16 The NMSE for the Tsukuba image pair for various distortion percentages 

Self-captured Image Sets 

Furthermore, the algorithm has been applied on the self-captured image pairs, shown in Figure 3.17 

-Figure3.19.Theusedpairssuffer from typical outdoor environment’s issues. Apart from being 

shot from cameras displaced both horizontally and vertically, having not parallel directions they 

involve textureless areas and difficult lighting conditions. Moreover, examination of Figure 3.18(a), 

Fig 3.18(b) and Figure 3.19(a), 3.19(b) reveals that the different position of the cameras result in 

lighting and chromatic differences. 


The disparity estimation technique presented is able to process input images from uncalibrated 

stereo cameras and at the same time retain low computational complexity. The hierarchical search 

scheme is based on the JVT/H.264 motion estimation algorithm, initially developed for video coding. 

The presented algorithm searches for stereo correspondences inside D × D search blocks requiring, 

however, significantly less computations than a typical full search. 

Sophisticated methods and techniques, such as Gaussian weighted aggregation and 3D CA refinement 

rules have been applied to an hierarchical process. The presented algorithm’s performance


(a) (b) 

(c) (d) (e) 

Fig. 3.17 (a), (b) The self-captured input images of an alley, and the resulting disparity maps for (c) the quadruple, 

(d) double and (e) single pixel estimation respectively 

(a) (b) 

(c) (d) (e) 

Fig. 3.18 (a), (b) The self-captured input images of a building, and the resulting disparity maps for (c) the quadruple, 

(d) double and (e) single pixel estimation respectively


(a) (b) 

(c) (d) (e) 

Fig. 3.19 (a), (b) The self-captured input images of a corner, and the resulting disparity maps for (c) the quadruple, 

(d) double and (e) single pixel estimation respectively 

is retained practically unaffected by spatial displacements and lens distortions in the input images, 

as was qualitatively and quantitatively indicated. Moreover, the ability to tolerate poorly or even 

not calibrated input images in conjunction with its speed and the presented result quality, show 

that this algorithm can cope with the demanding issue of autonomous outdoor navigation. 

3.4 Biologically and Psychophysically Inspired Stereo Correspondence 

Algorithm 

Amoreadvancedstereocorrespondencealgorithmhasbeendevelopedthatincorporatesmany 

biologically and psychologically inspired features to an adaptive weighted SAD framework in order 

to determine the right depth of the scenery. In addition to ideas already found in the relevant 

literature, such as the color information utilization, gestalt laws of proximity and similarity, new 

ones have been adopted. The presented algorithm introduces the use of circular support regions, 

the gestalt law of continuity as well as the psychophysically-based logarithmic response law. All 

the aforementioned perceptual tools act complementarily inside a straight-forward computational 

algorithm applicable to robotic applications. The results of the algorithm have been evaluated and 

compared to those of similar algorithms.


3.4.1 Novel Concepts 

Circular Windows 

The search for pixel correspondences between the two images of a stereo image pair is usually treated 

by comparing the surrounding regions of the examined pixels, rather than the examined pixels alone. 

The choice of those support windows, as discussed in previous Chapter, play an important role in 

the accuracy of the results. The support windows may vary in shape or dimensions and could be 

either of a fixed size or of an adaptive one. However, the use of square or rectangular regions of 

fixed size is the most common choice. 

Adaptive support weights (ASW) aggregation method, as presented in (Yoon & Kweon 2006a), 

makes use of fixed size, square windows with comparatively large size. However, the biological 

model of stereo vision seems to be better approximated by using circular shaped windows (Bharath 

& Petrou 2008). Aggregation inside circular windows is also preferable since the contribution of the 

neighboring pixels becomes perfectly isotropic, i.e. there is the same number of pixels contributing 

in any direction on the image plane. This fact makes the aggregation results produced by circular 

windows more reliable to any other window’s shape. 

Gestalt laws 

Aggregation is a crucial stage of almost every stereo algorithm. Assigning the right significance 

weights to each pixel during aggregation is a difficult decision, where gestalt theory, as discussed in 

Section 1.1.3, can provide an answer. Within this context, three basic gestalt laws get the following 

interpretation: 

• Proximity (or equivalently Distance): The closer two pixels are the more correlated to each other 

they are. 

• Intensity similarity (or equivalently Intensity dissimilarity): The more similar the colors of two 

pixels are the more correlated they are. 

• Continuity (or equivalently discontinuity): The more similar is the depth of two pixels the more 

probable it is that they belong to the same larger feature and thus the more correlated they are. 

Thus, gestalt theory can be used in order to determine to which degree two pixels are correlated. 

Psychophysically-based Weight Assignment 

The remaining question is exactly how much a correlated pixel to another should contribute to it 

during the aggregation process. In other words, it is necessary to establish an appropriate mapping 

between correlation degree and contribution. It is well known, since the 19th century, that HVS 

interprets physical stimuli in a psychological, non-linear rather than in an absolute, linear manner. 

This psychophysical relationship has been investigated in depth and many explaining theories have 

been expressed (Pinoli & Debayle 2007). The Weber-Fechner law is one of those theories and is 

widely acceptable. It indicates a logarithmic correlation between the subjective perceived intensity 

and the objective stimulus intensity.


The mathematical expression of this psychophysical law can be derived considering that the 

change of perception is proportional to the relative change of the causing stimulus. 

dp = −k dS 

S 

where dp is the differential change in perceived intensity, dS is the differential increase in the stimulus’ 

intensity, S is the stimulus’ intensity at the instant and k is a positive constant determined by 

the nature of the stimulus. However, stimuli whose growth produce decreasing perception intensity, 

e.g. distance, dissimilarity, discontinuity that are used in the presented algorithm, can be described 

by assuming that the proportionality constant is negative. 

Integration of the last equation results in 

(3.8) 

p = −k ln S + C (3.9) 

where C is the constant of integration. Assuming zero perceived intensity, the value of C can be 

found 

C = k ln So 

(3.10) 

where So is the stimulus’ value that results in zero perception and under which no stimulus’ change 

is noticeable. Combining the above formulas it can be derived that 

p = −k ln S 

Figure 3.20 presents the response obtained by such a function. 

Fig. 3.20 Perceived intensity response according to the Weber-Fechner law 

So 

(3.11)


is 1/w and for dissimilarity and discontinuity 1/255, assuming255levelsforeachchromaticchannel 

of an RGB image. The coarsest of them, i.e. 1/w was adopted as the truncation value. Any value 

of all the aforementioned metrics smaller than this truncation value is considered equal to it. This 

way the problem of obtaining infinite weighting factors is bypassed. 

As already discussed, this algorithm proposes three new extensions to the ASW framework, i.e. 

the addition of the gestalt law of continuity, the logarithmic response to stimuli and the use of 

circular support windows. In order to evaluate the contribution of each extension separately, the 

presented overall algorithm was modified so as to exclude one extension at a time. The presented 

overall algorithm and the resulting three truncated ones were applied to the Tsukuba image set. 

The percentages of erroneously calculated pixels in various image regions were calculated and the 

variation of each truncated implementation’s results with respect to the complete algorithm’s was 

calculated. The results and the respective variations are shown in Table 3.6. The version of the 

algorithm that does not involve the logarithmic response uses an exponential function instead; as a 

result, it accounts for each gestalt law in a manner similar to the one described in (Yoon & Kweon 

2006a) and followed by (Gu et al. 2008). The version of the algorithm that excludes the use of 55 

pixel diameter circular window utilizes a 48 × 48 square window instead. Thus, the covered area is 

the same in terms of pixel population and the processing load is preserved constant. The results 

shown in Table 3.6 indicate that the omission of any of the three extensions leads to increased error 

percentages. 

Table 3.6 Variation of the presented algorithm’s results for the Tsukuba image set when excluding one of the new 

concepts 

nonocc all disc 

error variation error variation error variation 

presented 3.62 5.52 14.6 

no continuity 5.19 +43.37% 7.17 +29.89% 21.7 +48.63% 

no log. response 8.89 +145.58% 10.5 +90.22% 36.1 +147.26% 

no circ. window 3.79 +4.70% 5.62 +1.81% 15.8 +8.22% 

The performance of the overall algorithm was evaluated using the standard online test-bench 

hosted by the university of Middlebury (Scharstein & Szeliski 2010). This test provides a common 

evaluation data set and allows an objective comparison of the various stereo algorithms’ results. The 

standard image sets used were the four stereo images (Scharstein & Szeliski 2002, 2003) provided 

along with their corresponding ground truth disparity maps, by Scharstein and Szeliski. Figure 

3.22 depicts the reference (left) images, the provided ground truth disparity maps, the disparity 

maps calculated by the presented method, maps of signed disparity error where the middle gray 

tone equals to zero error, and maps of pixels with absolute computed disparity error bigger than 1 

shown in black. 

The Middlebury result table, available at (Scharstein & Szeliski 2010), presents the results of 

the submitted algorithms without taking into consideration any complexity or execution speed 

factors. Moreover, the results of global and local algorithms are directly compared. The presented 

algorithm can successfully cope with fine detailed images like the Cones dataset. On the other 

hand, problems occur, as in most local algorithms, with large textureless areas like in the Venus 

dataset. However, most of the higher ranked entries in the Middlebury results table involve some 

kind of global optimization and a direct comparison is neither fair nor results in useful conclusions.


Fig. 3.22 Results for the Middlebury data sets. From left to right: the Tsukuba, Venus, Teddy and Cones image. 

From top to bottom: the reference (left) images, the provided ground truth disparity maps, the disparity maps 

calculated by the presented method, maps of signed disparity error and maps of pixels with absolute computed 

disparity error bigger than 1 

Only a small fraction of the listed algorithms are purely local and the performance of the presented 

algorithm is above the average when compared to other local stereo algorithms. The presented 

algorithm stands in-between algorithms providing excellent results at the expense of computational 

load and algorithms providing poor results in favor of computational simplicity. 

The algorithm was also applied to some new image sets (Scharstein & Pal 2007, Hirschmuller & 

Scharstein 2007). The four previously discussed image sets have been in the focus of research for 

quite a long period of time. Consequently, various algorithms have been presented that produce 

impressive results for those image sets. However, this fact does not necessarily imply that those 

algorithms’ results will be equally impressive for different image sets. There is a number of factors 

i.e. structure, complexity, detail, illumination etc for each image set that differentiate the results 

of the same algorithm applied on them. Thus the need for more image sets, other than the typical 

ones, is apparent. In Figure 3.23 the results of the presented algorithm applied on 7 new image sets


Table 3.7 Evaluation of various ASW and local algorithms 

Tsukuba Venus Teddy Cones 

nonocc all disc nonocc all disc nonocc all disc nonocc all disc 

AdaptDispCalib 1.19 1.42 6.15 0.23 0.34 2.50 7.80 13.6 17.3 3.62 9.33 9.72 

AdaptWeight 1.38 1.85 6.90 0.71 1.19 6.13 7.88 13.3 18.6 3.97 9.79 8.26 

RealTimeGPU 2.05 4.22 10.6 1.92 2.98 20.3 7.23 14.4 17.6 6.41 13.7 16.5 

presented 3.62 5.52 14.6 3.15 4.20 20.4 11.5 18.2 23.2 4.93 13.0 11.7 

PhaseBased 4.26 6.53 15.4 6.71 8.16 26.4 14.5 23.1 25.5 10.8 20.5 21.2 

SSD+MF 5.23 7.07 24.1 3.74 5.16 11.9 16.5 24.8 32.9 10.6 19.8 26.3 

PhaseDiff 4.89 7.11 16.3 8.34 9.76 26.0 20.0 28.0 29.0 19.8 28.5 27.5 

are presented. The image sets were once again obtained by the Middlebury web site (Scharstein & 

Szeliski 2010) and as shown in Figure 3.23 they are from top to bottom: Aloe, Babe3, Bowling2, 

Cloth1, Cloth3, Cloth4 and Flowerpots. Each row of this figure shows from left to right: the left 

image of the stereo pair, the provided ground truth, the disparity map computed by the presented 

algorithm and an error map. The error maps denote by black those pixels, the computed disparity 

value of which differs more than 1 from the ground truth. The outcome of these results is that the 

presented algorithm exhibits a good behavior for a variety of stereo pair images. 

Another interesting point is to compare the presented algorithm to similar ones. Both the ASWbased 

algorithms and the traditional local ones will be considered in this comparison. The results 

are presented in Table 3.7. The numbers represent the percentage of pixels whose absolute disparity 

error is greater than 1. The three columns for each dataset represent the percentages for the 

pixels in nonoccluded areas, all pixels and, pixels near depth discontinuities and occluded regions, 

respectively. As for the ASW-based algorithms, there are 3 of them already listed in the Middlebury 

results table apart from the presented one. The method called [AdaptWeight] (Yoon & Kweon 

2006a) is the core that all the three other rival ASW-based algorithms share. Its structure is similar 

to that of the presented algorithm apart from its demand for changing the input images’ color 

space from RGB to CIELab. The [RealTimeGPU] method (Wang et al. 2006) is a global algorithm, 

employing dynamic programming for disparity selection. Finally, [AdaptDispCalib] (Gu et al. 2008) 

is not a typical local single-stage algorithm. It employs more computational stages than all of the 

previous algorithms and has a rather complex structure. ASW-based algorithms generally produce 

more accurate results than the other non-global ones. The comparison of the presented algorithm 

to other non-global ones, such as [PhaseBased] (El-Etriby et al. 2007), [SSD+MF] (Scharstein & 

Szeliski 2002), [PhaseDiff] (El-Etriby et al. 2006), shows its superiority. All these algorithms involve 

no iterative procedures, in contrast to global algorithms, and have a straight-forward structure. This 

common characteristic allows them to be evaluated and directly compared. 

Besides from having a simple structure, the presented algorithm has the merit of employing only 

two user-defined parameters, as discussed earlier in this Section. Other than the almost inevitable 

choice of the window’s size, no empirically defined parameters are used, in contrast to the other 

ASW-based methods presented. Those methods involve the a priori definition of various parameters 

that significantly change the algorithms’ behavior.


Fig. 3.23 Results for new data sets. From top to bottom: Aloe, Babe3, Bowling2, Cloth1, Cloth3, Cloth4 and 

Flowerpots. From left to right: the reference (left) image of the stereo pair, the provided ground truth, the disparity 

map computed by the presented algorithm and error map 


Anovellocalstereocorrespondencealgorithm,applicabletoroboticapplications,waspresented. 

It makes use of AD as matching function and the ASW aggregation technique for matching the 

images’ regions correctly. Many new features inspired by biology, psychology and psychophysics are 

incorporated in this algorithm. It comprises a context within which the gestalt laws, the law of


Fechner-Weber and HVS’s physiology findings can coexist and a act complementarily in a simple 

manner. In accordance with stereoscopic vision in nature no iterative procedures are involved. The 

simple structure of the algorithm was also dictated by the need for rapid execution in robotic 

applications. However, the presented algorithm exhibits remarkably accurate results. 

3.5 Illumination-Invariant Dissimilarity Measure and Stereo 

Correspondence Algorithm 

Many robotic and machine-vision applications rely on the accurate results of stereo correspondence 

algorithms. However, difficult environmental conditions, such as differentiations in illumination depending 

on the viewpoint, heavily affect the stereo algorithms’ performance. This Section presents 

anewillumination-invariantdissimilaritymeasureinordertosubstitutetheestablishedintensitybased 

ones. The presented measure can be adopted by almost any of the existing stereo algorithms, 

enhancing them with its robust features. The performance of the dissimilarity measure is validated 

through experimentation with a new ASW stereo correspondence algorithm. Experimental results 

for a variety of lighting conditions are gathered and compared to those of intensity-based algorithms. 

The algorithm using the presented dissimilarity measure outperforms all the other examined algorithms, 

exhibiting tolerance to illumination differentiations and robust behavior. 

3.5.1 Description of Illumination-Invariant Dissimilarity Measure 

The HSL colorspace inherently expresses the lightness of a color and demarcates it from its qualitative 

characteristics. That is, an object will result in the same values of H and S regardless the 

environment’s illumination conditions. According to this assumption, the presented dissimilarity 

measure disregards the values of the L channel in order to calculate the dissimilarity of two colors. 

The omission of the vertical (L) axis from the colorspace representation leads to 2D circular disk, 

defined only by H and S, as show in Figure 3.24(b). 

The transition from the 3D colorspace representation to the 2D one, can be conceived as a floor 

plan projection of the double cone, when observed along the vertical (L) axis. Thus, any color can 

be described as a planar vector with its initial point being the disc’s center. As a consequence, each 

color Pk can be described as a polar vector or equivalently as a complex number with modulus 

equal to Sk and argument equal to Hk. That is, a color in the luminosity indifferent colorspace 

representation can be described as: 

Pk = Ske iHk (3.22) 

As a result, the difference, or equivalently the luminosity-compensated dissimilarity measure 

(LCDM), of two colors P1 and P2, shown with dashed line in Figure 3.24(b) can be calculated as 

the difference of the two complex numbers:


3.5.3 Experimental Results 

The presented method and, where needed, its AD variant, the ZNCC algorithm’s implementation 

(Corke 2005) and the Ogale and Aloimonos algorithm (Ogale & Aloimonos 2005a,b, 2007)(available 

to be downloaded from (Ogale 2009)) were applied to various stereo image pairs, in order to evaluate 

their behavior. However, it is not the algorithms’ performance that is being considered but rather 

the behavior of the used dissimilarity measures (LCDM, AD, ZNCC, phase differences). Within 

this scope, various different image pairs and various lighting non-uniformities were tested. 

Standard Image Sets 

The performance of the presented algorithm was again evaluated using the standard online testbench 

hosted by the university of Middlebury (Scharstein & Szeliski 2010). The standard image sets 

used were the four stereo images (Scharstein & Szeliski 2002, 2003) provided along with their corresponding 

ground truth disparity maps, by Scharstein and Szeliski. However, the listed benchmark 

images have been acquired under perfect lighting conditions and there are no significant variations 

of luminosity between the left and the right images. Figure 3.26 depicts from left to right, the reference 

(left) input images of the stereo pair, the right input images of the stereo pair, the disparity 

maps as calculated by the presented LCDM-based method, maps of the pixels (shown in black), 

whose absolute computed disparity error is bigger than 1 and maps of signed disparity error (where 

the 50% gray tone indicates null error). 

The results summarized in Figure 3.26 are illustrated in Table 3.8. Table 3.8 presents the percentage 

of pixels whose absolute disparity error is greater than 1 in the non-occluded regions, in 

all the images’ regions, and in the regions near discontinuities or the occluded ones. These results 

show the performance of the presented LCDM-based algorithm when applied to standard image 

sets, captured under ideal lighting conditions. 

Table 3.8 Percentage of pixels whose absolute disparity error is greater than 1 for standard image sets using the 

presented LCDM-based algorithm 

Image Pair Region 

nonocc all disc 

Tsukuba 5.98 7.84 22.2 

Venus 14.5 15.4 35.9 

Teddy 20.8 27.3 38.3 

Cones 8.90 17.2 20.0 

Standard Image Sets with Altered Illumination 

The presented LCDM intentionally excludes some of the pixels’ information, contrary to AD, in 

order to be able to cope with non-symmetrical lighting conditions. As a result, the presented method 

is expected to perform somewhat worse when applied to ideally lighted stereo pairs. However,


Fig. 3.26 Results for the Middlebury data sets. From top to bottom: the Tsukuba, Venus, Teddy and Cones image 

sets. From left to right: the reference (left) input images, the right input images, the disparity maps calculated by 

the presented LCDM-based method, maps of pixels with absolute computed disparity error bigger than 1 and maps 

of signed disparity error 

deviating from this ideal situation is expected to favor the use of LCDM. The ZNCC algorithm 

was computed for 15 × 15 window’s size as this value was found to suppress noise better. Finally, 

the Ogale and Aloimonos algorithm was computed using the default settings. In order to test and 

compare the performance of the four algorithms they were applied to a series of stereo pairs. Each 

pair consisted of the same, original reference (left) image of the Tsukuba image set and a differently 

illuminated version of the original right image of the pair. The right image of the Tsukuba pair, 

was manually processed with specialized software and its luminosity was altered. The amount of 

alteration ranged from -25% to +25% with 5% increments. All the four stereo algorithms were 

applied to each one of the resulting stereo pairs. The input images as well as the results of the four 

algorithms are shown in Figure 3.27. 

Column (c) of Figure 3.27 shows the disparity maps computed by the presented LCDM-based 

algorithm, column (d) shows the disparity maps computed by the RGB-based AD version of the 

algorithm, column (e) shows the disparity maps computed by the ZNCC stereo algorithm and 

finally column (f) shows the disparity maps computed by the Ogale-Aloimonos algorithm. It can 

be seen that for ideal lighting conditions (0% difference in luminosity) Ogale-Aloimonos’ algorithm 

produces the best results and that the presented LCDM algorithm produces slightly inferior results 

compared to its AD counterpart. However, the quality of the LCDM-based algorithm’s results 

remains practically the same for every tested lighting condition, contrary to the Ogale-Aloimonos 

and the AD version of the algorithm. The Algorithm of Ogale and Aloimonos may be able to cope 

with contrast variations but is not as successful against lightness differences. On the other hand, the


Fig. 3.27 Left input images (a), right input images with altered luminosity (b) and calculated disparity maps for 

the presented (c), its RGB-based AD version (d), the ZNCC stereo (e) and the Ogale-Aloimonos (f) algorithms for 

various Lightness conditions


ZNCC stereo algorithm’s precision is less dependent on the lighting conditions than the AD-based 

algorithm but still more dependent than the presented algorithm. Moreover, the ZNCC algorithm 

always produces bigger error rates, and especially for the discontinuities regions. The results shown 

in Figure 3.27 are quantified in Figure 3.28. Figure 3.28(a) shows that the performance of the 

algorithm that uses the presented LCDM is left practically unaffected by any difference of the 

input images’ luminosity. On the contrary, the RGB-based AD version of the algorithm, shown 

in Figure 3.28(b), linearly deteriorates in terms of accuracy with the lighting non-uniformity. The 

ZNCC stereo algorithm, shown in Figure 3.28(c), stands between the two others in terms of results’ 

constancy, but its error percentage is generally higher than that of the presented algorithm’s. Finally, 

the algorithm presented by Ogale and Aloimonos is the more accurate of all the others for ideal 

conditions but is found to fail to compensate for lightness mismatches. 

(a) presented algorithm (b) RGB-based AD version 

(c) ZNCC stereo algorithm (d) Ogale and Aloimonos stereo algorithm 

Fig. 3.28 Percentage of erroneously calculated pixels for the presented, its RGB-based AD version, the ZNCC and 

the Ogale-Aloimonos stereo algorithms for various lightness conditions 

Standard Image Sets with Variably Altered Illumination 

The previously presented results were obtained for image pairs consisting of images with different 

illumination each. Despite the different illumination between the pair, each single image was


uniformly over- or under- lighted. In this case, a luminosity normalization pre-processing step applied 

to each image may have assisted the rival stereo algorithms to obtain similar results with the 

presented method. However, real-life conditions may result in differently illuminated areas within 

the same image. To this end various color constancy methods were utilized in order to provide 

the rival stereo algorithms with images suitable for successful matching even if the original images 

suffered from illumination non-uniformities. The tested methods were the histogram equalization, 

the patented Retinex algorithm (Jobson et al. 1997) and a HVS-inspired algorithm presented by 

Vonikakis (Vonikakis et al. 2008). Histogram equalization remaps an image’s histogram in order to 

improve its visual quality. The image is first converted to the HSL color space and the algorithm is 

applied to the luminance channel. The luminance channel’s values are transformed with respect to 

areferenceimage,sothatthehistogramoftheoutputimageapproximatelymatchesthereference 

image’s histogram. The transformation T minimizes the difference: 

|c1(T (L)) − c0(L)| (3.29) 

where c0 is the cumulative histogram of the image and c1 is the cumulative sum of the desired 

histogram for all the values L. This minimization is subject to the constraints that T must be 

monotonic and c1(T (L)) cannot overshoot c0(L) by more than half the distance between the histogram 

counts at L. The transformation T maps the luminance values to their new ones. Retinex 

is a NASA and TruView Imaging Co. patented image enhancement algorithm (Jobson et al. 1997). 

It corrects under-exposed areas of an image without affecting correctly exposed areas and restores 

rich colors. Both the pictures of each image pair were processed with this algorithm using the 

same parameters’ default values. The use of the default parameters’ values for all the processed 

pictures ensured that the tested approach would be general, i.e. without being specially optimized 

for a certain image pair. Finally, the HVS-inspired algorithm, presented by Vonikakis in (Vonikakis 

et al. 2008) performs spatially modulated tone mapping. That is, the method performs image enhancement 

by lightening the tones in the under-exposed regions while darkening the tones in the 

over-exposed, without affecting the correctly exposed ones. The tone mapping function is inspired 

by the shunting characteristics of the center-surround cells of the HVS. The images of all the tested 

pairs were processed with this algorithm using the same values for the parameters. 

The tested algorithms presented in this Section are the presented LCDM-based algorithm, the 

AD-based RGB variant algorithm, the ZNCC algorithm and finally the stereo algorithm presented 

by Ogale and Aloimonos in (Ogale & Aloimonos 2005a,b, 2007)asimplementedin(Ogale2009). 

The LCDM-based presented algorithm is compared to each one of the three others. Moreover, the 

three other algorithms are considered when applied to the original input images, as well as to the 

images processed by the histogram equalization, the Retinex and the Vonikakis et al. methods. 

As a test example, consider an image whose left end is darker than its right end, and the luminosity 

is continuously varying across the picture. This scenario was tested using the four standard 

image sets of the Middlebury university. The left image of each pair was left intact, while the right 

image was processed with specialized software in order to apply a luminosity gradient across the 

horizontal direction. The gradient ranges from -50% of the original luminosity at the left end of 

the image to +50% at the right, linearly. Figure 3.29 shows the intact left input images in column 

(a) and the illumination-graded right images in column (b). The disparity maps calculated by the 

algorithm using the presented LCDM are shown in column (c), the disparity maps calculated by its 

AD-based variant are shown in column (d), the disparity maps calculated by the AD-based variant 

algorithm using histogram equalized input images are shown in column (e), the disparity maps 

calculated by the AD-based variant algorithm using Retinex enhanced input images are shown in


column (f), and finally the disparity maps calculated by the AD-based variant algorithm using 

input images enhanced by the Vonikakis et al. algorithm’s implementation given by the authors in 

(Vonikakis 2009) are shown in column (g). 

Fig. 3.29 From left to right: left input images with constant luminosity, right input images with luminosity grading 

from -50% to +50% along the horizontal direction, and calculated disparity maps for the standard image sets using 

the presented LCDM (c), the AD-based variant algorithm (d), the AD-based variant algorithm with histogram 

equalization (e), the AD-based variant algorithm with Retinex enhancement (f), the AD-based variant algorithm 

with enhanced pictures according to Vonikakis et al. (2008) (g) 

The tested stereo algorithms and the corresponding disparity maps presented in Figure 3.29 

result in the histograms shown in Figure 3.30. As shown in Figure 3.29 and Figure 3.30 the presented 

algorithm produces better results than any of the other tested compound algorithms. The 

pre-processing tone mapping techniques obviously failed to globally compensate for the lighting differentiations. 

Moreover, a direct comparison of Figure 3.29 and Figure 3.26 shows that the algorithm 

using the presented LCDM retains the same quality of results regardless the lighting conditions. 

Consequently, it can be derived that the presented algorithm compensates for different lighting 

conditions, exhibiting robust behavior. 

Next, the presented algorithm is compared to the ZNCC algorithm. Figure 3.31 shows the intact 

left input images in column (a) and the illumination graded right images in column (b). Once again, 

the disparity maps calculated by the algorithm using the presented LCDM are shown in column 

(c), the disparity maps calculated by the ZNCC algorithm for 15 × 15 pixels window are shown 

in column (d), the disparity maps calculated by the ZNCC algorithm using histogram equalized 

input images are shown in column (e), the disparity maps calculated by the ZNCC algorithm using 

Retinex enhanced input images are shown in column (f), and finally the disparity maps calculated


!""#"!$%"&%'()*% 

*! 

)! 

(! 

'! 

&! 

%! 

$! 

#! 

"! 

! 

+,-.-/0 

123-, 

+2445 

6732, 

+,-.-/0 

123-, 

+2445 

6732, 

+,-.-/0 

123-, 

+2445 

6732, 

+,-.-/0 

123-, 

+2445 

6732, 

+,-.-/0 

123-, 

+2445 

6732, 

869: ;9 ?!@A-0B?!;9 C2>=32D!;9 173=.0.=,!2>! 

0B?!;9 

3737EE 

0BB 

4=,E!F!7EEB 

Fig. 3.30 Percentage of pixels whose absolute disparity error is greater than 1 for standard image sets calculated 

using the presented LCDM, the AD-based variant algorithm, the AD-based variant algorithm with histogram equalization, 

the AD-based variant algorithm with Retinex enhancement, the AD-based variant algorithm with enhanced 

pictures according to Vonikakis et al. (2008) 

by the ZNCC algorithm using input images enhanced by the method of Vonikakis et al. are shown 

in column (g). 

The tested stereo algorithms and the corresponding disparity maps presented in Figure 3.31 result 

in the histograms shown in Figure 3.32. The results shown in Figure 3.31 and Figure 3.32 show 

that the ZNCC algorithm can effectively compensate for illumination differentiations. However, 

the window dimensions that resulted in effective noise suppression, i.e. 15 × 15 pixels, could not 

preserve the fine details of the scenes and as a result produced higher error rates, compared to 

the presented algorithm, in depth discontinuities regions. Moreover, the ZNCC’s computation is 

obligatorily time-consuming and would not be a good candidate for robotic applications. 

Finally, the presented algorithm is compared to the Ogale-Aloimonos algorithm. Figure 3.33 

shows the intact left input images in column (a) and the illumination-graded right images in column 

(b). Again, the disparity maps calculated by the algorithm using the presented LCDM are shown 

in column (c), the disparity maps calculated by the Ogale and Aloimonos algorithm are shown in 

column (d), the disparity maps calculated by the Ogale and Aloimonos algorithm using histogram 

equalized input images are shown in column (e), the disparity maps calculated by the Ogale and 

Aloimonos algorithm using Retinex enhanced input images are shown in column (f), and finally the 

disparity maps calculated by the Ogale and Aloimonos algorithm using input images enhanced by 

the method of Vonikakis et al. are shown in column (g). 

The tested stereo algorithms and the corresponding disparity maps presented in Figure 3.33 result 

in the histograms shown in Figure 3.34. The results shown in Figure 3.33 and Figure 3.34 show that 

the Ogale and Aloimonos algorithm can preserve details better than any other tested algorithm for 

ideal lighting conditions. However, deviations form the ideal conditions result in significantly worse 

results compared to the presented LCDM-based algorithm. 

!

! 


! 

!! !! !! !! !! !! ! 

!! !! !! !! !! !! ! 

!! !! !! !! !! !! ! 

!! !! !! !! !! !! ! 

!!!!!!!!!!!!!!!!!(a) (b) (c) (d) (e) (f) (g) 

! 


from -50% to +50% along the horizontal direction, and calculated disparity maps for the standard image sets using 

the presented LCDM (c), the ZNCC algorithm (d), the ZNCC algorithm with histogram equalization (e), the ZNCC 

algorithm with Retinex enhancement (f), the ZNCC algorithm with enhanced pictures according to Vonikakis et al. 

(2008) (g) 

!""#"!$%"&%'()*% 

'! 

&! 

%! 

$! 

#! 

"! 

! 

()*+*,- 

./0*) 

(/112 

340/) 

()*+*,- 

./0*) 

(/112 

340/) 

()*+*,- 

./0*) 

(/112 

340/) 

5367 8933 :;)?*-@=! 

8933 

()*+*,- 

./0*) 

(/112 

340/) 

()*+*,- 

./0*) 

(/112 

340/) 

A/

! 


! 

!! !! !! !! !! !! ! 

!! !! !! !! !! !! ! 

!! !! !! !! !! !! ! 

!! !! !! !! !! !! ! 

!!!!!!!!!!!!!!!!!(a) (b) (c) (d) (e) (f) (g) 

! 


from -50% to +50% along the horizontal direction, and calculated disparity maps for the standard image sets 

using the presented LCDM (c), the Ogale-Aloimonos algorithm (d), the Ogale-Aloimonos algorithm with histogram 

equalization (e), the Ogale-Aloimonos algorithm with Retinex enhancement (f), the Ogale-Aloimonos algorithm with 

enhanced pictures according to Vonikakis et al. (2008) (g) 

!""#"!$%"&%'()*% 

"!! 

*! 

)! 

(! 

'! 

&! 

%! 

$! 

#! 

"! 

! 

+,-.-/0 

123-, 

+2445 

6732, 

+,-.-/0 

123-, 

+2445 

6732, 

869: ;=7?@737, 

+,-.-/0 

123-, 

+2445 

6732, 

A?,BC!DE-0=C! 

;=7?@737, 

+,-.-/0 

123-, 

+2445 

6732, 

+,-.-/0 

123-, 

+2445 

6732, 

F2B?32G!;=7?@737, 0=C!;=7?@737, 

3737HH 

0== 

4?,H!I!7HH= 

Fig. 3.34 Percentage of pixels whose absolute disparity error is greater than 1 for standard image sets calculated 

using the presented LCDM, the Ogale-Aloimonos algorithm, the Ogale-Aloimonos algorithm with histogram equalization, 

the Ogale-Aloimonos algorithm with Retinex enhancement, the Ogale-Aloimonos algorithm with enhanced 

pictures according to Vonikakis et al. (2008) 

!


Non-synthetic Self-recorded Image Sets 

The previous Sections have shown that the presented algorithm produces better results than the 

other ones. The two developed ASW-based algorithms, using the LCDM and the AD respectively, 

have demonstrated superior characteristics such as preservation of details, some degree of lightness 

differences tolerance and adjustable computational load. The LCDM-based and the AD-based 

algorithms were applied to the extreme case of the self-recorded image pair previously shown in 

Figure 2.8 as well as to various other self-captured real life image pairs exhibiting difficult lighting 

conditions. Examination of the input images reveals the different lighting conditions among each 

couple’s images. In Figure 3.35 the two images of the stereo pair are shown in the two first columns, 

the respective disparity maps calculated by the presented LCDM-based algorithm are shown in the 

third column and the four next columns show the disparity maps calculated by the AD-based variant 

algorithm, the AD-based variant algorithm with histogram equalization, the AD-based variant 

algorithm with Retinex enhancement and the AD-based variant algorithm with enhanced pictures 

according to (Vonikakis et al. 2008). 

(a) Campus images 

(b) Building images 

(c) Standing man images 

(d) Park images 

Fig. 3.35 Various self-recorded outdoor input image pairs and the resulting disparity maps. From left to right: the 

left and right input images and the disparity maps calculated with: the presented LCDM-based algorithm, the RGB 

AD-based algorithm applied on the raw images, the RGB AD-based algorithm applied on the histogram equalized 

images, the RGB AD-based algorithm applied on the Retinex enhanced images, the RGB AD-based algorithm applied 

on the images enhanced according to Vonikakis et al. (2008) 

The input images of Figure 3.35 exhibit large variations and can be considered as extreme cases 

of various lighting difficulties. They challenge every stereo algorithm, but at the same time they have 

to be confronted since they can be found in environments where robots have to operate. While the 

produced disparity maps are not absolutely accurate, the presented LCDM dissimilarity measure


can compensate for the illumination non-uniformities to a large extend, generally outperforming 

the AD-based methods. Summarizing the results presented in this and the previous Sections it 

can be deduced that, although not always the best, the presented algorithm exhibits a robust 

and trustworthy behavior in all the examined image sets. On the other hand, other algorithms 

may produce good results in some image sets but not consistently in all of them. As a result, the 

presented algorithm and in particular the presented LCDM dissimilarity measure can effectively 

and regularly face difficult lighting situations. 


Anewilluminationinvariantdissimilaritymeasure,theluminosity-compensateddissimilaritymeasure 

(LCDM) has been presented. The motivation behind the presented dissimilarity measure and 

stereo algorithm were the problems occurring when using stereo image processing in robots tested in 

actual outdoor environments. Such environments do not guaranty uniform illumination conditions, 

regardless of the camera position. As a consequence, the same feature may exhibit different intensity 

value in different images. The new measure can substitute the traditional RGB intensity-based dissimilarity 

measures (e.g. AD or SD) in almost any of the available stereo algorithms. Using the HSL 

colorspace and being calculated on the Hue-Saturation plane, the presented LCDM is able to compensate 

for lighting differentiations and provide with robust and reliable results. As many robotic 

and machine vision applications rely on the accuracy of stereo algorithms’ results, the presented 

measure could be an ideal choice. 

The presented LCDM was tested on various image pairs and was compared to the simple AD 

measure. In order to obtain reliable, non-biased results for comparison, two identical state of the 

art stereo algorithms were developed and tested. These stereo algorithms used a gestalt-based ASW 

aggregation scheme. The only difference was that the first version used the presented LCDM while 

the second the AD as a dissimilarity measure. Tests with various image sets and different lighting 

conditions have shown that the presented LCDM exhibits good and moreover robust and reliable 

behavior. 

Moreover, the presented algorithm was tested against the ZNCC algorithm and the algorithm 

of Ogale and Aloimonos, which both are able to confront lightness non-uniformities. Regarding the 

ZNCC algorithm, the window dimensions that resulted in effective noise suppression, i.e. 15 × 15 

pixels, could not preserve the fine details of the scene and as a result produced higher error rates, 

compared to the presented algorithm, near depth discontinuities regions. Moreover, the ZNCC 

algorithm is structurally different from algorithms based on the others dissimilarity measures, i.e. 

LCDM, AD, SD. While ZNCC simultaneously compute the dissimilarity inside a support region, 

the others compute the dissimilarity of single pixel pairs and then during a different step they 

aggregate the results. Consequently, the ZNCC algorithm is inevitably computationally intensive 

and thus inappropriate for robotics applications, which should be reasonably fast, as update rate 

is critical. On the other hand, the other measures can be adopted in different aggregation schemes, 

thus resulting in schemes of desirable computational load. This last feature, is highly desirable in 

robotic applications. 

The algorithm of Ogale and Aloimonos propose a compositional approach that unifies many early 

visual modules. As a result this method can robustly process images with contrast, among others, 

mismatches. Its results are remarkably accurate for conditions that do not significantly deviate from


the ideal ones. However, even if this method can process contrast differentiations it does not exhibit 

the same behavior for luminosity differentiations. 

In conclusion, the presented dissimilarity measure is able to compensate for luminosity nonuniformities 

and at the same time preserve the details of the scene. Additionally, it is easy to 

be computed in very high frame rates as its computational load is very small compared to other 

measures such as the ZNCC. The presented measure can be embodied as the first stage of a stereo 

algorithm whose speed and complexity are subjects to the demands. The MATLAB implementation 

of the presented stereo algorithm that is based on the presented LCDM measure is not fast enough to 

achieve the real- or near real-time frame rates demanded by robotic applications. However, a C++ 

version could be reasonably fast. Finally, considering all the aforementioned features combined, it 

can be concluded that the presented dissimilarity measure and the resulting presented stereo vision 

algorithm are ideal candidates for robotic applications.

Chapter 4 

Robotic Applications of Stereo Vision 

Robotics often replicate human modalities in order to achieve autonomous behaviors (Russell et al. 

2004). Above all senses, vision is the most important one to humans. Moreover we have structured 

our environments based on this fact. Thus, it comes naturally that autonomous robotics can be 

greatly benefited by employing vision methods (Santini et al. 2009). 

In this Chapter robotic applications based on stereo vision systems are presented. The stereo 

correspondence algorithms that form the bases of these applications are some of the ones presented 

in the previous Chapter. The robotic applications covered involve new, computationally efficient 

obstacle avoidance and SLAM algorithms. 

4.1 Stereo Vision-based Obstacle Avoidance Algorithm 

In order to achieve reliable obstacle avoiding behavior, many popular methods involve the use of 

artificial stereo vision systems. As affirmed by its biomimetic origin (Gutmann et al. 2005, Sabe 

et al. 2004), stereoscopic vision can be effectively used in order to derive the depth map of a scene. 

The two versions of the vision-based obstacle avoidance algorithm presented in this Section provide 

efficient solutions that use a minimum of sensors and avoid, as much as possible, computationally 

complex processes. The only sensor required is a stereo camera. First, a simple modular algorithm 

is presented. It employs a stereo algorithm, which is essentially the same stereo algorithm covered 

previously in Section 3.1, and a threshold-based decision making algorithm that analyzes the depth 

maps and deduces the most appropriate direction for the robot to avoid any existing obstacles. 

Then, an improved version of the algorithm is presented using a fuzzy decision making algorithm 

instead. The presented methodologies are tested on sequences of self-captured outdoor images and 

their results are evaluated. 

The contribution of the two versions of the developed algorithm is to provide lightweight approaches 

for obstacle avoidance with the sole use of a stereoscopic camera. The use of only one 

sensor and specifically of a stereoscopic camera diminish the complexity of the system and allows 

for easy integration with other vision tasks, such as object recognition or tracking. 

93

94 Chapter 4. Robotic Applications of Stereo Vision 

4.1.1 Threshold Algorithm Description 

The presented vision-based obstacle avoidance algorithm is intended to be used in autonomous 

mobile robotics. The development of an efficient, solely vision-based method for mobile robot navigation 

is still an active research topic. Towards this direction, the first step is to avoid any obstacles 

through vision. However, systems placed on robots have to conform to the restrictions imposed 

by them. Autonomous robot navigation requires almost real-time frame rates from the responsible 

algorithms. Furthermore, computing resources are strictly limited onboard a robot. Thus, the 

omission of popular obstacle detection techniques such as the v-disparity, which require Houghtransformations, 

would be highly appreciated. Instead, simple and efficient solutions are demanded. 

The developed algorithm is based only on a stereo camera. The core of the presented approach 

can be divided into two separate and independent algorithms: 

• The stereo vision algorithm. It retrieves information about the environment from a stereo camera 

and produces a depth image, i.e. disparity map, of the scenery. 

• The threshold-based decision making algorithm. It analyzes the data of the previous algorithm 

and decides the best direction, i.e. forward, right or left, for the robot to move in order to avoid 

any existing obstacles. 

The modularity of the system allows the easy modification, easy debugging and ensures the adaptability 

of the overall algorithm. Figure 4.1 presents the flow chart of the implemented algorithm. 

Stereo Vision 

The stereo correspondence algorithm upon which the presented obstacle avoidance algorithm is 

based is essentially the one covered in Section 3.1. However, there are a number of differences: 

• Additionally to the previously mentioned stereo algorithm, which directly uses the camera’s images, 

this version uses an enhanced version of the captured images as input. The initially captured 

images are processed in order to extract the edges in the depicted scene. The utilized edge detecting 

method is the Laplacian of Gaussian (LoG), using a zero threshold. This choice produces 

the maximum possible edges. The LoG edge detection method smoothens the initial images with 

a Gaussian filter in order to suppress any possible noise. Then a Laplacian kernel is applied that 

marks regions of significant intensity change. Actually, the combined LoG filter is applied at 

once and the zero crossings are found. The extracted edges are, afterwards, superimposed to the 

initial images. The steps of the aforementioned process are shown in Figure 4.2. The outcome of 

this procedure is a new version of the original images having more striking features and textured 

surfaces, which facilitate the following stereo matching procedure. 

• The matching cost function utilized is the truncated AD. The AD are truncated if they excess 

the 4% of the maximum intensity value. Truncation suppresses the influence of noise in the final 

result. This is very important for stereo algorithms that are intended to be applied to outdoor 

scenes. Outdoor pairs usually suffer from noise induced by a variety of reasons, e.g. lighting 

differences and reflections. 

• Moreover the use of CA is absent in this version of the stereo correspondence algorithm for 

simplicity reasons.

Chapter 4. Robotic Applications of Stereo Vision 95 

Fig. 4.1 Flow chart of the implemented threshold-based obstacle avoidance algorithm 

! 

"#$%$&!'()$%(! 

! 

*$+#!,)(-$! 

! 

./-0#!,)(-$! 

! 

! 

Fig. 4.2 Image enhancement steps of the presented stereo algorithm 

! 

! 

12-$!13#%(4#/&5! 

! 

12-$!13#%(4#/&5! 

! 

! 

"67$%/)7&8/#/&5! 

! 

"67$%/)7&8/#/&5! 

! 

! 

! 

9$7#0!:(7! 

! 

!


As mentioned before, the resulting disparity maps are equivalent to depth maps of the depicted 

scene and can be used directly for the subsequent obstacle analysis. 

Threshold Direction Decision Algorithm 

The previously calculated disparity map is used to extract useful information about the navigation 

of a robot. Contrary to many implementations that involve complex calculations upon the disparity 

map, the presented decision making algorithm involves only simple summations and threshold 

checks. This is feasible due to the absence of significant noise in the produced disparity map. The 

goal of the developed algorithm is to detect any existing obstacles in front of the robot and to safely 

avoid it, by steering the robot left, right or to moving it forward. 

In order to achieve that, the developed method divides the disparity map into three windows, 

as in Figure 4.3. The division of the disparity map excludes the boundary regions, in this case 

a peripheral frame of 20 pixels width, because the disparity calculation in such regions is often 

problematic. 

Fig. 4.3 Depth map’s division in three windows 

In the central window, the pixels p whose disparity value D(p) is greater than a defined threshold 

value T are enumerated. Then, the enumeration result is examined. If it is smaller than a predefined 

rate r of all the central window’s pixels, this means that there are no obstacles detected exactly in 

front of the robot and in close distance, and thus the robot can move forward. On the other hand, if 

this enumeration’s result exceeds the predefined rate, the algorithm examines the other two windows 

and chooses the one with the smaller average disparity value. In this way the window with the fewer 

obstacles will be selected. The pseudocode of the implemented simple decision making algorithm 

follows: 

Threshold-based Decision Making Pseudocode 

for all the pixels p of the central window { 

if D(p) > T {


counter++ } 

numC++ } 

if counter < r% of numC { 

GO STRAIGHT } 

else { 

for all the pixels p of the left window { 

sumL =+ D(p) 

numL++ } 

for all the pixels p of the right window { 

sumR =+ D(p) 

numR++ } } 

avL = sumL / numL 

avR = sumR / sumR 

if avL


Fig. 4.4 A sample outdoor route and the algorithm’s outputs 

would be meaningful only in the cases of the left or right decisions. As these two decisions are 

based on the same heuristic, they can be directly compared. The certainty cert of a direction’s 

decision which yields an average disparity avD1 over the other direction which yields avD2 >avD1 

is calculated as: 

cert = avD2 − avD1 

(4.1) 

avD2 

The results for the left and right decisions of the algorithm are shown in Figure 4.6. For each 

decision the pair’s indicating number as well as the algorithm’s decision is given. The certainty 

ranges form 0% for no certainty at all, to 100% for absolute certainty. The bigger the area defined 

by the resulting points, the bigger the algorithm’s overall certainty. However, big values of certainty 

are not always achievable. In the extreme case when both the left and the right direction are 

fully traversable, the certainty measure would become 0%. Despite this fact, the certainty is useful. 

Observing the correlation between false decisions and certainty values, a threshold could be decided, 

below which the algorithm should reconsider its decision.


Fig. 4.5 Percentage of the algorithm’s correct decision 

$$#/0123# 

$!#*+,# 

$)#/0123# 

('#/0123# 

(!!"# 

)#*+,# 

'!"# 

&!"# 

%!"# 

$!"# 

!"# 

(.#/0123# 

Fig. 4.6 Percentage of certainty for the algorithm’s decisions 

4.1.3 Fuzzy Algorithm Description 

-#*+,# 

(%#*+,# 

.#/0123# 

((#*+,# 

An improved version of the threshold-based obstacle avoidance algorithms has been developed 

afterwards. Again, the algorithm requires only one stereo camera as input and consists of two 

independent modules. The first module is exactly the same stereo correspondence algorithm used 

in the threshold algorithm. The second module is a fuzzy decision making algorithm that analyzes 

the depth maps. 

• The stereo vision algorithm. It retrieves information about the environment from a stereo camera 

and produces a depth image, i.e. disparity map, of the scene.


• The fuzzy decision making algorithm. It analyzes the data of the previous algorithm and decides 

the best direction for the robot to move so as to avoid any existing obstacles, based on a simple 

fuzzy inference system (FIS). 

The presented method processes each pair of stereoscopic images and indicates an obstacleavoiding 

direction of movement for a robot, such as the one shown in Figure 4.7(a). First, the 

stereo image pair is given as input to a stereo vision algorithm and a depth map of the scene is 

obtained. This depth map is thereafter used as input of the fuzzy obstacle analysis and direction 

decision module. This fuzzy module indicates the proper direction of movement. The direction of 

movement ranges from −30 o to +30 o ,considering0 o as the current direction of the robot. This 

angle range is dictated by the used stereo camera, i.e. a Bumblebee2 stereo camera manufactured 

by Point Grey Research, having a 60 o horizontal field of view (HFoV). Furthermore, in cases when 

the scene is full of obstacles or the depth map is too noisy to conclude safely a "move backwards" 

signal is foreseen. Figure 4.7(b) presents the mobile robot, shown as the "R" in the center, and the 

possible positions after the application of the presented algorithm, shown by the bold regions of the 

outer circle. 

(a) (b) 

Fig. 4.7 (a) Stereo camera equipped mobile robotic platform and (b) floor plan of the robot’s environment 

This method also divides each disparity map into three equal windows, in exactly the same way 

that has been shown in Figure 4.3. However, this algorithm treats all the three windows identically. 

In each window w, thepixelsp whose disparity value D(p) is greater than a defined threshold value 

T are enumerated. The enumeration results are normalized towards the widow’s pixels population 

and then examined. The more traversable the corresponding direction is the smaller the enumeration 

result should be. Thus, the traversability of the left, central and right window, respectively T RAVL, 

T RAVC and T RAVR, isassessed. 

The results of the traversability estimation for the three windows, i.e. the left, central, and right 

one, are used as the three input values of a FIS that decides the proper direction of movement for 

the robot. The outputs of the FIS are the angle of the direction that the robot should follow and an 

indicator that the robot should move backwards. Figure 4.8 shows the membership functions (MF) 

for the three inputs (all having identical MF, which is shown in Figure 4.8(a)) and the two outputs 

(Figure 4.8(b) and 4.8(c)).


(" 

!#'" 

!#&" 

!#%" 

!#$" 

!" 

!" !#$" !#%" !#&" !#'" (" 

)*+,-" 

(a) Input MF: traversability of the left/central/right window 

(" 

!#'" 

!#&" 

!#%" 

!#$" 

!" 

!#'" 

!#&" 

!#%" 

!#$" 

Fig. 4.8 Fuzzy membership functions 

(" 

!" 

)*!" )$!" )(!" !" (!" $!" *!" 

(b) Output MF: direction angle 

!" !#$" !#%" !#&" !#'" (" 

(c) Output MF: "move backwards" indicator 

+,-" 

./01203" 

45678" 

)*+,-"


(a) (b) 

(c) (d) 

(e) (f) 

(g) (h) 

(i) (j) 

(k) (l) 

Fig. 4.9 Test images and disparity maps where the algorithm chose to move forward 

As shown by the value of the parameter "move backwards" for the pairs of Figure 4.10(c) and 

4.11(d) there is a relatively high tendency to adopt the option to move backwards in these cases. 

However, since the threshold value had been set to 0.65 the robot chose to steer left and right 

respectively. On the other hand, Figure 4.12 and Table 4.4 show a scene and the respective FIS 

variables where the robot actually decided to move backwards. Other threshold value for the "move 

backwards" parameter would had lead to other behaviors for the last three cases.


Table 4.1 Results for the cases where the algorithm chose to move forward 

Image pair Inputs Outputs 

Figure Left (%) Central (%) Right (%) Angle (deg) "move backwards" 

4.9(a) 0.03 1.68 1.14 0.0 0.50 

4.9(b) 0.03 0.02 0.80 0.0 0.50 

4.9(c) 19.82 9.82 5.87 0.0 0.50 

4.9(d) 54.05 5.65 5.23 0.0 0.50 

4.9(e) 4.95 6.00 3.91 0.0 0.50 

4.9(f) 5.20 8.16 9.34 0.0 0.50 

4.9(g) 20.71 15.21 10.52 0.0 0.50 

4.9(h) 18.72 11.65 15.38 0.0 0.50 

4.9(i) 13.06 14.60 9.01 0.0 0.50 

4.9(j) 71.42 16.05 4.83 0.0 0.50 

4.9(k) 9.84 8.44 8.08 0.0 0.50 

4.9(l) 0.85 0.13 3.81 0.0 0.50 

Fig. 4.10 Test images and disparity maps where the algorithm chose to move left 

(a) 

(b) 

(c) 

(d)


Table 4.2 Results for the cases where the algorithm chose to move left 



4.10(a) 8.75 89.62 44.53 -17.1 0.50 

4.10(b) 9.71 37.54 43.81 -16.6 0.50 

4.10(c) 22.86 38.60 43.37 -12.0 0.56 

4.10(d) 3.94 73.83 82.03 -19.0 0.50 

Fig. 4.11 Test images and disparity maps where the algorithm chose to move right 

Table 4.3 Results for the cases where the algorithm chose to move right 

(a) 

(b) 

(c) 

(d) 



4.11(a) 81.45 66.78 7.62 +18.6 0.50 

4.11(b) 83.79 85.95 14.94 +19.3 0.50 

4.11(c) 67.27 80.19 9.28 +18.6 0.50 

4.11(d) 67.83 67.51 36.54 +9.1 0.61


Fig. 4.12 Test images and disparity map where the algorithm chose to move backwards 

Table 4.4 Results for the cases where the algorithm chose to move backwards 




4.12 70.94 87.21 73.51 -2.6 0.69 

In order mobile robots to move towards human-like behaviors autonomous navigation is an essential 

milestone. Obstacle avoidance, using a minimum of sensory and processing resources is the first 

step to this direction. In this Section two versions of a vision-based obstacle avoidance algorithm 

for autonomous mobile robots have been presented. The presented algorithms require only one 

sensor, i.e. a stereo camera, and a low amount of involved computations. The algorithms’ structure 

consists of a specially developed and optimized stereo algorithm that produces noise-free depth 

maps, and a computationally simple decision making algorithm. The decision making algorithm 

avoids complex calculations and transformations. Consider as an example, the case of the popular vdisparity 

implementation where Hough-transformation is needed in order to compensate for the low 

quality disparity maps. On the other hand, simpler than the presented direction deciding algorithms 

fail to yield correct results. In this case, consider an algorithm where the three windows of Fig. 4.3 

are treated equally and the smallest average disparity is sought. This methodology is doomed to 

fail in the case, among many others, where only a thin obstacle is close to the robot and other 

obstacles are in medium range. Such a naive algorithm would chose the direction towards the close 

thin obstacle, avoiding the medium ranged obstacles. 

Firstly, the threshold-based version of the algorithm has been presented and tested on sequences 

of self-captured outdoor images. Its performance has been presented and discussed. The presented 

algorithm managed to successfully avoid obstacles in the vast majority of the tested image pairs. 

Despite its simple calculations, both during the disparity map generation and the decision making, 

the algorithm exhibited promising behavior. 

Then, an improved fuzzy-based version of the obstacle avoidance algorithm has been covered. The 

presented method is based on the same custom developed stereo algorithm and a simple but effective 

fuzzy obstacle analysis and direction decision module. The robot executing the presented algorithm 

has effectively detected and avoided any obstacles using only stereo vision as input. The behavior of 

the method has been validated by real outdoor data sets of various scenes. The algorithm exhibits 

robust behavior and is able to ensure collision-free autonomous mobility to robots. Moreover, the 

trajectory of the robot’s overall movement is smooth, resembling that of living creatures, due to 

the fuzzy system’s continuous range of output values. 

The simple structure of the presented algorithms and the absence of heavy computational payload 

are characteristics highly desirable in autonomous robotics. The real-time collision-free navigation


of autonomous robotic platforms is the first step towards the accomplishment of more complex 

activities, e.g. path planning and mapping of an area. Consequently, the presented algorithms are 

suitable for autonomous robotic applications and are able to provide real-time obstacle avoidance 

behavior, based solely on stereo vision. 

4.2 Stereo Vision-based SLAM 

A visual SLAM algorithm suitable for indoor applications has been developed. The algorithm is 

focused on computational effectiveness. The only sensor used is a stereo camera placed onboard a 

moving robot. The algorithm processes the acquired images calculating the depth of the scenery, 

detecting occupied areas and progressively building a map of the environment. The stereo visionbased 

SLAM algorithm embodies a custom-tailored stereo correspondence algorithm, the robust 

scale- and rotation-invariant feature detection and matching SURF method, a computationally effective 

v-disparity image calculation scheme, a novel map-merging module, as well as a sophisticated 

CA-based enhancement stage. The presented algorithm is suitable for autonomously mapping and 

measuring indoor areas using robots. 

The presented SLAM approach adopts a simple solution that avoids complex update strategies 

in favor of a computationally efficient one. Emphasis has been given to the development of customtailored, 

non-iterative solutions for each step of the presented algorithm’s execution. The specially 

developed stereo correspondence algorithm is a rapidly executed local SAD algorithm embodying 

Gaussian weighted aggregation and a double validation scheme based on a certainty estimation 

criterion, bidirectional consistency check and sub-pixel accuracy. Concerning the camera’s motion 

estimation, the SURF feature detector and matcher (Bay et al. 2008) has been utilized as the first 

step of an efficient estimation method. This estimation is further refined afterwards during a sophisticated 

map merging procedure and sharpened up by CA. The presented algorithm progressively 

builds a map of the environment, based entirely on stereo vision information. The produced maps 

indicate the occupied and free regions of the explored environment. The outline of the presented 

algorithm is summarized in Figure 4.13. 

-*)+"&.,/)#& 

0#*%?:& 

.,/)#& 

@?A#$& 

!"#$#%& 

'()%$*"+,& 

Fig. 4.13 Outline of the presented SLAM algorithm 

0%1/(&2/3& 

4#5#$/6%5& 

8/,#$/9:&& 

2%6%5& 

;:6,/6%5& 

4(%7/(& 

2/3& 

4#5#$/6%5& 

=$#>*%?:& 

2/3& 

@?A#$& 

8/,#$/9:& 

2%6%5& 

4(%7/(&2/3&


(a) (b) 

Fig. 4.14 Reference image (a) of an indoor scene and sparse disparity map (b) obtained with the presented stereo 

correspondence algorithm 

4.2.2 Camera’s Motion Estimation 

While depth of the depicted objects is obtained by examining the two images of each stereo image 

pair, the motion of the camera is estimated by correlating the reference images from two consecutive 

image pairs, as shown in Figure 4.15. 

Fig. 4.15 Depth vs. camera’s motion estimation


Feature Detection and Matching 

Feature detection and matching has become a very attractive and useful field for many computer 

vision applications. Among the variety of possible detectors and descriptors this work has embodied 

SURF, as described in (Bay et al. 2008). SURF is a scale and rotation invariant detector and 

descriptor. It has the advantages of achieving high repeatability, distinctiveness and robustness. 

However, the most attractive feature of SURF is its computational efficiency, which allows very 

fast computation times. Preliminary experiments have confirmed the accuracy and effectiveness of 

SURF for the examined situations. SURF is given with two consecutive reference images as input 

and provides as output a list containing the coordinates of N matched features in the two images. 

(a) Local map (b) Initial global map 

(c) Updated global map (d) CA enhanced global map 

Fig. 4.16 Environment’s maps for the scene of Figure 4.14 obtained with the presented algorithm


4.2.3 Local Map Generation 

Alocal2Dmapiscomputedfromeachstereoimagepair.Figure4.14(a)presentsthereference 

image of a test image pair. Using the sparse disparity map obtained (see Figure 4.14(b)) a reliable 

v-disparity image can be computed (Labayrade et al. 2002, Zhao et al. 2007), as shown in Figure 

4.17(a). The terrain is modeled in the v-disparity image by a linear equation. The parameters of 

this linear equation can be found using Hough transform (De Cubber et al. 2009), if the cameraenvironment 

system’s geometry is unknown. However, if the geometry of the system is constant 

and known (which is the case for a camera firmly mounted on a robot exploring a flat, e.g. indoor, 

environment) the two parameters can be easily computed beforehand and used in all the image 

pairs during the exploration. A tolerance region on either side of the terrain’s linear segment is 

considered and any point outside this region is considered as an "obstacle". The linear segments 

denoting the terrain and the tolerance region overlaid on the v-disparity image are shown in Figure 

4.17(b). 

(a) Calculated v-disparity image (b) V-disparity image with terrain modeled by the 

continuous line and the tolerance region shown between 

the two dashed lines 

Fig. 4.17 V-disparity images for the image of Figure 4.14(a) and the corresponding disparity map of Figure 4.14(b) 

For each pixel corresponding to an "obstacle" the local coordinates are computed. The local map, 

e.g. the one shown in Figure 4.16(a), is an occupancy grid of the environment consisting of all the 

points corresponding to "obstacles". 

4.2.4 Global Map Generation 

The motion estimation technique gives the relative translation T and rotation R required to superimpose 

the new local map, Figure 4.16(a), to the global map accumulated up to that point, Figure 

4.16(b). However, the situation of perfectly matched features that result in exactly precise T and R


images of the used dataset, as detected using the SURF algorithm. It can be seen that there are 

some faulty matches. However, the presented algorithm is not significantly affected by such cases, 

as shown by the accuracy of the results shown in Figure 4.19. 

Fig. 4.18 Features detected and matched using SURF for various consecutive images of the used dataset 

(a) 

(b) 

(c) 

In Figure 4.19 the first column (a), presents the reference images of the first, second, sixth, and 

tenth image pair of the tested image series. The differences in the illumination conditions are evident, 

especially in the image of the third row. The second column, Figure 4.19(b), presents the sparse 

disparity maps computed with the used stereo algorithm. One can observe that very little falsely 

matched pixels have been produced. However, the overall coverage of the scene is more than enough 

in order to detect any obstacles in the scene. The third column, Figure 4.19(c), shows the computed 

local maps, i.e. the occupancy grids of the obstacles detected in the corresponding disparity maps. 

On the other hand, the fourth column, Figure 4.19(d), shows the global maps of the environment


(a) Reference images of 

the scene 

(b) Sparse disparity 

maps of the scene 

(c) Local maps (d) Updated global 

maps 

(e) CA enhanced 

global maps 

Fig. 4.19 Experimental results after processing 1 (first row), 2 (second row), 6 (third row), and 10 (fourth row) 

image pairs of the scene 

containing the accumulated local maps up to that point. In the first image pair, the local and the 

global maps are identical since no prior knowledge about the environment existed. The gradual 

superimposition of further local maps is remarkably accurate and results in clear arrangements 

of "obstacle"-points. The final column of the figure, Figure 4.19(e), shows the global maps after 

the presented CA enhancement. This procedure, makes the sparse information of the global maps 

continuous and more clear. 

As shown by the lower-right image of Figure 4.19 the result of the algorithm with only ten input 

image pairs is a clear and reliable representation of the obstacles found in the hallway. The walls, 

the closed door straight ahead of the camera, and some open doors are all clearly visible.



The presented stereo vision-based SLAM algorithm incorporates new methodologies for the generation 

and the superimposition of the partial maps. The sole use of one sensor, i.e. a stereo camera, 

and the substitution of computationally demanding procedures are indicative of the algorithm’s 

focus on computational simplicity. The experimental results that have been presented reveal the 

overall algorithm’s accuracy.

Chapter 5 

Conclusion and Future Work 

5.1 Conclusion 

This thesis had been motivated by the observation that even though stereo vision and autonomous 

robots are often used together, no special care is attributed so as each component to take into 

full consideration the requirements of the other. Consequently, the objective of this thesis was to 

develop sophisticated stereo vision systems suitable for use in autonomous robots. 

Towards this end, first, the relevant literature was surveyed. Stereo vision algorithms and respective 

robotic applications were covered. This survey pointed out the weaknesses of the available 

stereo correspondence algorithms, when used by robots in real working environments. The compilation 

of a list containing the major open issues of robotic stereo vision, as given in Section 2.4, is 

a first step towards confronting these issues. Up to now, little care had been given so as to develop 

custom stereo vision algorithms for robotic applications and simple or even simplistic approaches 

were often adopted by robotics researchers. As a result, the survey indicated that the most suitable 

and realistic way to proceed was to develop some new custom-made stereo correspondence 

algorithms. 

The next step after recognizing the weaknesses of the current systems was to propose new and 

efficient ways to deal with those weaknesses. The solutions to those issues proposed within this 

dissertation make use of various sophisticated, interdisciplinary computational tools. There have 

been used biologically and psychologically inspired methods, such as the logarithmic response law 

(Weber-Fechner law) and the gestalt laws of perceptual organization (proximity, similarity and 

continuity). Furthermore, there have been used sophisticated computational methods, such as 2D 

and 3D cellular automata and fuzzy inference systems for robotic vision applications. Additionally, 

ideas from the field of video coding have been incorporated in stereo vision applications. The 

experimental results obtained by the proposed algorithms show that efficient robotic stereo vision 

is achievable with the use of carefully selected computational tools. 

What all of the developed stereo correspondence algorithms have in common is that they embody 

non-iterative computational stages. The proposed algorithms varied from simple local stereo algorithms 

to sophisticated ASW-based ones. Many open issues detected by the literature survey were 

addressed by these algorithms. The goals of simple computational schemes, efficient exploitation of 

input redundancy, tolerance to non-calibrated input, tolerance to difficult lighting conditions and 

use of biological inspired concepts have been pursued. The proposed algorithms exhibited advan- 

117

118 Chapter 5. Conclusion and Future Work 

tages over other methods when the objective had been the use in robotic applications. Moreover, 

the online test-bench of (Scharstein & Szeliski 2010) has been used to compare the results of the 

developed algorithms to those of other ones. The algorithms listed in that site include computationally 

intensive and not suitable for robotic applications ones. However, the most advanced of the 

algorithms proposed within this thesis (Nalpantidis & Gasteratos 2010a,b) areindexedinadequate 

places in the web-site’s evaluation list. 

Finally, the knowledge gained by the development of new stereo correspondence algorithms was 

exploited in developing stereo vision systems for robotic applications. The depth estimations provided 

by stereo vision algorithms were further analyzed in order robots to safely navigate in their 

environments. Obstacle avoidance and SLAM applications have been developed. Once again, the 

developed systems focused on avoiding complex computational schemes. Instead, new, efficient 

methods have been proposed so as to retain the computational load low. The experimental results 

show that the objective, set at the beginning of this thesis, has been met. The presented systems 

respect the restrictions set by their host robotic platforms and still achieve accurate results. 

Using only vision sensors for robotic navigation purposes is an appealing solution. Vision provides 

enough information which, if accurate and reliable, can diminish the need for additional sensors’ 

input. As a result, the complexity of the system can be significantly reduced. However, the solely 

vision-based autonomous robotic applications demand highly effective vision algorithms. The reliability 

of vision algorithms and their successful operation under difficult conditions are necessary 

conditions that have to be met. The success of vision-based robotic applications depends on their 

underlying vision algorithms. 

5.2 Future Work 

This thesis has accomplished its initially set objective. However, the course of this work has revealed 

various other appealing research directions. Even better and more efficient new stereo vision systems 

can be developed. The field of stereo vision has not reached a state of saturation over the last decades 

and is not expected to do so in the next few years. The knowledge gained through this thesis can 

provide a stable basis upon which even better results can be achieved. 

One possible future research direction has to do with incorporating the latest neuroscience findings 

in robotic vision algorithms and beyond that, in robotic vision-based inference systems. Neuroscience 

has made a tremendous progress during the last years but its findings are neither completely 

decoded nor adapted to help solving the open problems of the robotic vision community. The use of 

the HVS’s mechanisms in robotic vision issues is a very interesting and challenging prospect both 

from a scientific and a technological point of view. The analog nature of brain stimuli and the vast 

processing power demanded for their processing are both available in contemporary technology. 

Filling the missing link between these two aspects, i.e. neuromorphic sensors and vision algorithms, 

requires working in both directions. The pursue of this endeavor can possibly produce results that 

will further advance the robotic vision field. 

On the other hand, using the already developed stereo vision algorithms and methods as a basis 

for achieving further and more advanced autonomous behaviors is another interesting research 

direction. More precisely, problems such as the SLAM, human-machine interaction, as well as scene 

analysis and understanding are still open to a large extent. The need for robust, autonomous 

capabilities of robotic assistants in defense, security and civil protection applications make applied

Chapter 5. Conclusion and Future Work 119 

research in this area rather interesting and appealing. Additionally, according to many people’s 

opinions robots are going to play an increasingly important role in many aspects of our lives. For 

robots to seamlessly adapt in our anthropocentric environments cognitive capabilities are required 

and effective vision systems play an essential role. 

Hardware implementation of the presented stereo vision algorithms is also an appealing research 

direction. The algorithms developed within this thesis focused on the simplicity of the used computational 

tools and adopted non-iterative schemes. These attributes make the hardware implementation 

of those algorithms feasible. An implementation in FPGA would provide very rapid execution 

times, small power consumption and would avoid the extensive usage of the PC located onboard 

the host robotic platform. 

To sum up, stereo vision is rapidly evolving so as to cover the demands posed by autonomous 

robots. Numerous and more reliable vision-based applications are expected to emerge as this technology 

matures. As a result, the axes, along which the future work on stereo vision is expected to 

be deployed, mainly lies on the application level.

References 

Agrawal, M. & Konolige, K. (2008), ‘FrameSLAM: From bundle adjustment to real-time visual 

mapping’, IEEE Transactions on Robotics 24(5). 

Agrawal, M., Konolige, K. & Bolles, R. (2007), Localization and mapping for autonomous navigation 

in outdoor terrains: A stereo vision approach, in ‘IEEE Workshop on Applications of Computer 

Vision’, Austin, Texas, USA. 

Amanatiadis, A., Andreadis, I. & Konstantinidis, K. (2008), ‘Design and Implementation of a Fuzzy 

Area-Based Image-Scaling Technique’, IEEE Transactions on Instrumentation and Measurement 

57(8), 1504–1513. 

Arias-Estrada, M. & Xicotencatl, J. M. (2001), Multiple stereo matching using an extended architecture, 

in ‘International Conference on Field-Programmable Logic and Applications’, Vol. 2147 

of Lecture Notes in Computer Science, Springer-Verlag,pp.203–212. 

Bailey, T. & Durrant-Whyte, H. (2006), ‘Simultaneous localization and mapping (SLAM): Part II’, 

IEEE Robotics & Automation Magazine 13(3), 108–117. 

Barnard, S. T. & Thompson, W. B. (1980), ‘Disparity analysis of images’, IEEE Transactions on 

Pattern Analysis and Machine Intelligence 2(4), 333–340. 

Bay, H., Ess, A., Tuytelaars, T. & Van Gool, L. (2008), ‘Speeded-up robust features (SURF)’, 

Computer Vision and Image Understanding 110, 346–359. 

Berthouze, L. & Metta, G. (2005), ‘Epigenetic robotics: modelling cognitive development in robotic 

systems’, Cognitive Systems Research 6(3), 189–192. 

Bharath, A. & Petrou, M. (2008), Next Generation Artificial Vision Systems: Reverse Engineering 

the Human Visual System, ArtechHouse,USA. 

Binaghi, E., Gallo, I., Marino, G. & Raspanti, M. (2004), ‘Neural adaptive stereo matching’, Pattern 

Recognition Letters 25(15), 1743–1758. 

Bleyer, M. & Gelautz, M. (2005), ‘A layered stereo matching algorithm using image segmentation 

and global visibility constraints’, ISPRS Journal of Photogrammetry and Remote Sensing 

59(3), 128–150. 

Borenstein, J. & Koren, Y. (1990), ‘Real-time obstacle avoidance for fast mobile robots in cluttered 

environments’, IEEE Transactions on Systems, Man, and Cybernetics 19(5), 1179–1187. 

121

122 References 

Borenstein, J. & Koren, Y. (1991), ‘The vector field histogram-fast obstacle avoidance for mobile 

robot’, IEEE Transactions on Robotics and Automation 7(3), 278–288. 

Brockers, R. (2009), Cooperative stereo matching with color-based adaptive local support, in ‘International 

Conference on Computer Analysis of Images and Patterns’, Springer-Verlag, Berlin, 

Heidelberg, pp. 1019–1027. 

Brockers, R., Hund, M. & Mertsching, B. (2005), Stereo vision using cost-relaxation with 3d support 

regions, in ‘Image and Vision Computing New Zealand’, pp. 96–101. 

Chen, Z., Samarabandu, J. & Rodrigo, R. (2007), ‘Recent advances in simultaneous localization 

and map-buildingcusing computer vision’, Advanced Robotics 21(3), 233–265. 

Chonghun, R., Taehyun, H., Sungsik, K. & Jaeseok, K. (2004), Symmetrical dense disparity estimation: 

algorithms and fpgas implementation, in ‘IEEE International Symposium on Consumer 

Electronics’, pp. 452–456. 

Chopard, B. & Droz, M. (1998), Cellular Automata Modeling of Physical systems, Cambridge University 

Press. 

Corke, P. (2005), ‘Machine vision toolbox’, IEEE Robotics and Automation Magazine 12(4), 16–25. 

Darabiha, A., Maclean, J. W. & Rose, J. (2006), ‘Reconfigurable hardware implementation of a 

phase-correlation stereo algorithm’, Machine Vision and Applications 17(2), 116–132. 

Davison, A. (2007), Vision-based SLAM in real-time, in ‘Pattern Recognition and Image Analysis’, 

Vol. 1 of Lecture Notes in Computer Science, SpringerBerlin/Heidelberg,pp.9–12. 

Davison, A. J. (2003), Real-time simultaneous localisation and mapping with a single camera, in 

‘IEEE International Conference on Computer Vision’, Vol. 2, pp. 1403–1410. 

Davison, A. J. & Kita, N. (2001), 3d simultaneous localisation and map-building using active vision 

for a robot moving on undulating terrain, in ‘IEEE Conference on Computer Vision and Pattern 

Recognition’, Vol. 1, IEEE Computer Society Press, pp. 384–391. 

Davison, A. J., Mayol, W. W. & Murray, D. W. (2003), Real-time localisation and mapping with 

wearable active vision, in ‘IEEE International Symposium on Mixed and Augmented Reality’, 

IEEE Computer Society Press, pp. 18–27. 

Davison, A. & Murray, D. (2002), ‘Simultaneous localization and map-building using active vision’, 

IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 865–880. 

De Cubber, G., Doroftei, D., Nalpantidis, L., Sirakoulis, G. C. & Gasteratos, A. (2009), Stereo-based 

terrain traversability analysis for robot navigation, in ‘IARP/EURON Workshop on Robotics for 

Risky Interventions and Environmental Surveillance’, Brussels, Belgium. 

De Cubber, G., Nalpantidis, L., Sirakoulis, G. C. & Gasteratos, A. (2008), Intelligent robots need 

intelligent vision: Visual 3d perception, in ‘IARP/EURON Workshop on Robotics for Risky 

Interventions and Environmental Surveillance’, Benicàssim, Spain. 

Di Stefano, L., Marchionni, M. & Mattoccia, S. (2004), ‘A fast area-based stereo matching algorithm’, 

Image and Vision Computing 22(12), 983–1005. 

Dissanayake, G., Newman, P., Durrant Whyte, H., Clark, S. & Csorba, M. (2001), ‘A solution to the 

simultaneous localisation and map building (SLAM) problem’, IEEE Transactions on Robotics 

and Automation 17(2), 229–241. 

Durrant-Whyte, H. & Bailey, T. (2006), ‘Simultaneous localisation and mapping (SLAM): Part 

I the essential algorithms. robotics and automation magazine’, IEEE Robotics and Automation 

Magazine 2, 2006. 

El-Etriby, S., Al-Hamadi, A. & Michaelis, B. (2006), ‘Dense depth map reconstruction by phase 

difference-based algorithm under influence of perspective distortion’, Machine Graphics and Vision 

International Journal 15(3), 349–361.

References 123 

El-Etriby, S., Al-Hamadi, A. & Michaelis, B. (2007), Dense stereo correspondence with slanted surface 

using phase-based algorithm, in ‘IEEE International Symposium on Industrial Electronics’, 

Vigo, Spain, pp. 1807–1813. 

Faugeras, O. (1993), Three-dimensional computer vision: a geometric viewpoint, MIT Press, Cambridge, 

MA. 

Faugeras, O., Hotz, B., Mathieu, H., Vieville, T., Zhang, Z., Fua, P., Theron, E., Moll, L., Berry, 

G., Vuillemin, J., Bertin, P. & Proy, C. (1993), Real time correlation based stereo: algorithm 

implementations and applications, Technical Report RR-2013, INRIA. 

Feynman, R. (1982), ‘Simulating physics with computers’, International Journal of Theoretical 

Physics 21(6), 467–488. 

Forsyth, D. A. & Ponce, J. (2002), Computer Vision: A modern Approach, PrenticeHall,Upper 

Saddle River, NJ, USA. 

Gasteratos, A. & Sandini, G. (2002), Factors Affecting the Accuracy of an Active Vision Head, Vol. 

2308 of Lecture Notes in Computer Science, Springer-Verlag,Berlin-Heidelberg,pp.413–422. 

Georgoulas, C., Kotoulas, L., Sirakoulis, G. C., Andreadis, I. & Gasteratos, A. (2008), ‘Real-time 

disparity map computation module’, Journal of Microprocessors and Microsystems 32(3), 159– 

170. 

Gong, M., Yang, R., Wang, L. & Gong, M. (2007), ‘A performance study on different cost aggregation 

approaches used in real-time stereo matching’, International Journal of Computer Vision 

75(2), 283–296. 

Gong, M. & Yang, Y.-H. (2005a), ‘Fast unambiguous stereo matching using reliability-based 

dynamic programming’, IEEE Transactions on Pattern Analysis and Machine Intelligence 

27(6), 998–1003. 

Gong, M. & Yang, Y.-H. (2005b), Near real-time reliable stereo matching using programmable 

graphics hardware, in ‘IEEE Computer Society Conference on Computer Vision and Pattern 

Recognition’, Vol. 1, pp. 924–931. 

Gonzalez, R. C. & Woods, R. E. (1992), Digital Image Processing, Addison-WesleyLongmanPublishing 

Co., Inc., Boston, MA, USA. 573607. 

Gu, Z., Su, X., Liu, Y. & Zhang, Q. (2008), ‘Local stereo matching with adaptive support-weight, 

rank transform and disparity calibration’, Pattern Recognition Letters 29(9), 1230–1235. 

Guivant, J. & Nebot, E. (2001), ‘Optimization of the simultaneous localization and map building 

algorithm for real time implementation’, IEEE Transactions on Robotics and Automation 

17(3), 242–257. 

Gutierrez, S. & Marroquin, J. L. (2004), ‘Robust approach for disparity estimation in stereo vision’, 


Gutmann, J.-S., Fukuchi, M. & Fujita, M. (2005), A floor and obstacle height map for 3d navigation 

of a humanoid robot, in ‘IEEE International Conference on Robotics and Automation’, pp. 1066 

–1071. 

Hariyama, M., Kobayashi, Y., Sasaki, H. & Kameyama, M. (2005), ‘Fpga implementation of a stereo 

matching processor based on window-parallel-and-pixel-parallel architecture’, IEICE Transactions 

on Fundamentals of Electronics, Communications and Computer Science 88(12), 3516– 

3522. 

Hariyama, M., Sasaki, H. & Kameyama, M. (2005), ‘Architecture of a stereo matching vlsi processor 

based on hierarchically parallel memory access’, IEICE Transactions on Information and Systems 

E88-D(7), 1486–1491.


Hariyama, M., Takeuchi, T. & Kameyama, M. (2000), Reliable stereo matching for highly-safe intelligent 

vehicles and its vlsi implementation, in ‘IEEE Intelligent Vehicles Symposium’, pp. 128–133. 

Hartley, R. & Zisserman, A. (2004), Multiple View Geometry in Computer Vision, secondedn, 

Cambridge University Press. 

Hirschmuller, H. (2005), Accurate and efficient stereo processing by semi-global matching and mutual 

information, in ‘IEEE Computer Society Conference on Computer Vision and Pattern Recognition’, 

Vol. 2, pp. 807–814. 

Hirschmuller, H. (2006), Stereo vision in structured environments by consistent semi-global matching, 

in ‘IEEE Computer Society Conference on Computer Vision and Pattern Recognition’, Vol. 2, 

pp. 2386– 2393. 

Hirschmuller, H. & Scharstein, D. (2007), Evaluation of cost functions for stereo matching, in 

‘IEEE Computer Society Conference on Computer Vision and Pattern Recognition’, Minneapolis, 

Minnesota, USA. 

Hogue, A., German, A. & Jenkin, M. (2007), Underwater environment reconstruction using stereo 

and inertial data, in ‘IEEE International Conference on Systems, Man and Cybernetics’, Montreal, 

Canada, pp. 2372–2377. 

Holmes, S. A., Klein, G. & Murray, D. W. (2009), ‘An O(N 2 ) square root unscented kalman filter 

for visual simultaneous localization and mapping’, IEEE Transactions on Pattern Analysis and 

Machine Intelligence 31(7), 1251–1263. 

Hong, L. & Chen, G. (2004), Segment-based stereo matching using graph cuts, in ‘IEEE Conference 

on Computer Vision and Pattern Recognition’, Vol. 1, pp. 74–81. 

Hosni, A., Bleyer, M., Gelautz, M. & Rhemann, C. (2009), Local stereo matching using geodesic 

support weights, in ‘IEEE International Conference on Image Processing’, pp. 2093–2096. 

Hua, X., Yokomichi, M. & Kono, M. (2005), Stereo correspondence using color based on competitivecooperative 

neural networks, in ‘International Conference on Parallel and Distributed Computing 

Applications and Technologies’, Dalian, China, pp. 856–860. 

Huang, S., Wang, Z. & Dissanayake, G. (2008), ‘Sparse local submap joining filter for building 

large-scale maps’, IEEE Transactions on Robotics 24(5), 1121–1130. 

Huang, X. & Dubois, E. (2004), Dense disparity estimation based on the continuous wavelet transform, 

in ‘Canadian Conference on Electrical and Computer Engineering’, Vol. 1, pp. 465–468. 

Iocchi, L. & Konolige, K. (1998), A multiresolution stereo vision system for mobile robots, in ‘Italian 

AI Association Workshop on New Trends in Robotics Research’. 

Jain, R., Kasturi, R. & Schunck, B. G. (1995), Machine vision, McGraw-Hill, New York, USA. 

Jeong, H. & Park, S. (2004), Generalized trellis stereo matching with systolic array, in ‘International 

Symposium on Parallel and Distributed Processing and Applications’, Vol. 3358, Springer Verlag, 

pp. 263–267. 

Jia, Y., Xu, Y., Liu, W., Yang, C., Zhu, Y., Zhang, X. & An, L. (2003), A miniature stereo vision 

machine for real-time dense depth mapping, in ‘International Conference on Computer Vision 

Systems’, Vol. 2626 of Lecture Notes in Computer Science, pp.268–277. 

Jobson, D. J., ur Rahman, Z. & Woodell, G. A. (1997), ‘A multiscale retinex for bridging the 

gap between color images and the human observation of scenes’, IEEE Transactions on Image 

Processing 6(7), 965–976. 

Jung, H. (1994), ‘Visual navigation for a mobile robot using landmarks’, Advanced Robotics 

9(4), 429–442.


Kalomiros, J. A. & Lygouras, J. (2008), ‘Hardware implementation of a stereo co-processor in a 

medium-scale field programmable gate array’, IET Computers and Digital Techniques 2(5), 336 

–346. 

Kalomiros, J. & Lygouras, J. (2009), ‘Comparative study of local sad and dynamic programming for 

stereo processing using dedicated hardware’, EURASIP Journal on Advances in Signal Processing 

2009, 1–18. 

Kelly, A. & Stentz, A. (1998), Stereo vision enhancements for low-cost outdoor autonomous vehicles, 

in ‘International Conference on Robotics and Automation, Workshop WS-7, Navigation of 

Outdoor Autonomous Vehicles’. 

Khatib, O. (1996), ‘Motion coordination and reactive control of autonomous multi-manipulator 

system’, Journal of Robotic Systems 15(4), 300–319. 

Khatib, O. (1999), ‘Robot in human environments: basic autonomous capabilities’, International 

Journal of Robotics Research 18(7), 684–696. 

Kim, H. & Sohn, K. (2005), ‘3d reconstruction from stereo images for interactions between real and 

virtual objects’, Signal Processing: Image Communication 20(1), 61–75. 

Kim, J. C., Lee, K. M., Choi, B. T. & Lee, S. U. (2005), A dense stereo matching using two-pass 

dynamic programming with generalized ground control points, in ‘IEEE Conference on Computer 

Vision and Pattern Recognition’, Vol. 2, pp. 1075–1082. 

Klancar, G., Kristan, M. & Karba, R. (2004), ‘Wide-angle camera distortions and non-uniform 

illumination in mobile robot tracking’, Journal of Robotics and Autonomous Systems 46, 125– 

133. 

Klaus, A., Sormann, M. & Karner, K. (2006), Segment-based stereo matching using belief propagation 

and a self-adapting dissimilarity measure, in ‘18th International Conference on Pattern 

Recognition’, Vol. 3, Hong Kong, China, pp. 15–18. 

Klippenstein, J. & Zhang, H. (2007), Quantitative evaluation of feature extractors for visual SLAM, 

in ‘Fourth Canadian Conference on Computer and Robot Vision’, pp. 157–164. 

Kohler, W. (1969), The task of Gestalt psychology, PrincetonUniversityPress,Princeton,N.J. 

Konolige, K., Agrawal, M., Bolles, R. C., Cowan, C., Fischler, M. & Gerkey, B. P. (2006), Outdoor 

mapping and navigation using stereo vision, in ‘International Symposium on Experimental 

Robotics’, Vol. 39, Springer, Brazil, pp. 179–190. 

Kotoulas, L., Gasteratos, A., Sirakoulis, G. C., Georgoulas, C. & Andreadis, I. (2005), Enhancement 

of fast acquired disparity maps using a 1-d cellular automation filter, in ‘IASTED International 

Conference on Visualization, Imaging and Image Processing’, Benidorm, Spain, pp. 355–359. 

Kotoulas, L., Georgoulas, C., Gasteratos, A., Sirakoulis, G. C. & Andreadis, I. (2005), A novel 

three stage technique for accurate disparity maps, in ‘EOS Conference on Industrial Imaging and 

Machine Vision’, Munich, Germany, pp. 13–14. 

Kunchev, V., Jain, L., Ivancevic, V. & Finn, A. (2006), Path planning and obstacle avoidance 

for autonomous mobile robots: A review, in ‘International Conference on Knowledge-Based and 

Intelligent Information and Engineering Systems’, Vol. 4252 of LNCS, Springer-Verlag,pp.537– 

544. 

Kyung Hyun, C., Minh Ngoc, N. & M. Asif Ali, R. (2008), ‘A real time collision avoidance algorithm 

for mobile robot based on elastic force’, International Journal of Mechanical, Industrial and 

Aerospace Engineering 2(4), 230–233. 

Labayrade, R., Aubert, D. & Tarel, J.-P. (2002), Real time obstacle detection in stereovision on non 

flat road geometry through V-disparity representation, in ‘IEEE Intelligent Vehicle Symposium’, 

Vol. 2, Versailles, France, pp. 646–651.


Lee, S., Yi, J. & Kim, J. (2005), Real-time stereo vision on a reconfigurable system, in ‘International 

Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation’, Vol. 3553 

of Lecture Notes in Computer Science, Springer,pp.299–307. 

Lei, C., Selzer, J. & Yang, Y.-H. (2006), ‘Region-tree based stereo using dynamic programming 

optimization’, Computer Vision and Pattern Recognition, IEEE Computer Society Conference 

on 2, 2378–2385. 

Lemaire, T., Berger, C., Jung, I. & Lacroix, S. (2007), ‘Vision-based SLAM: Stereo and monocular 

approaches’, International Journal of Computer Vision 74(3), 343–364. 

Liu, C., Pei, W., Niyokindi, S., Song, J. & Wang, L. (2006), ‘Micro stereo matching based on wavelet 

transform and projective invariance’, Measurement Science and Technology 17(3), 565–571. 

Lowe, D. G. (2004), ‘Distinctive image features from scale-invariant keypoints’, International Journal 

of Computer Vision 60(2), 91–110. 

Maimone, M. W. & Shafer, S. A. (1996), A taxonomy for stereo computer vision experiments, in 

‘ECCV Workshop on Performance Characteristics of Vision Algorithms’, pp. 59–79. 

Manz, A., Liscano, R. & Green, D. (1993), A comparison of realtime obstacle avoidance methods 

for mobile robots, in ‘Experimental Robotics II’, Springer-Verlag, pp. 299–316. 

Manzotti, R., Gasteratos, A., Metta, G. & Sandini, G. (2001), ‘Disparity estimation on log-polar 

images and vergence control’, Computer Vision and Image Understanding 83(2), 97–117. 

Mardiris, V., Sirakoulis, G. C., Mizas, C., Karafyllidis, I. & Thanailakis, A. (2008), ‘A cad system 

for modeling and simulation of computer networks using cellular automata’, IEEE Transactions 

on Systems, Man, and Cybernetics, Part C 38(2), 253–264. 

Marr, D. & Poggio, T. (1976), ‘Cooperative computation of stereo disparity’, Science 

194(4262), 283–287. 

Masrani, D. K. & MacLean, W. J. (2006), A real-time large disparity range stereo-system using 

fpgas, in ‘IEEE International Conference on Computer Vision Systems’, Vol. 3852, pp. 13–20. 

Mayoral, R., Lera, G. & Perez-Ilzarbe, M. J. (2006), ‘Evaluation of correspondence errors for stereo’, 


Mead, C. (1990), ‘Neuromorphic electronic systems’, Procceedings of the IEEE 78(10), 1629–1636. 

Mei, C., Sibley, G., Cummins, M., Newman, P. & Reid, I. (2009), A constant time efficient stereo 

SLAM system, in ‘British Machine Vision Conference’. 

Metta, G., Gasteratos, A. & Sandini, G. (2004), ‘Learning to track colored objects with log-polar 

vision’, Mechatronics 14(9), 989–1006. 

Mingxiang, L. & Yunde, J. (2006), ‘Trinocular cooperative stereo vision and occlusion detection’, 

IEEE International Conference on Robotics and Biomimetics pp. 1129–1133. 

Miyajima, Y. & Maruyama, T. (2003), A real-time stereo vision system with fpga, in ‘International 

Conference on Field-Programmable Logic and Applications’, Vol. 2778 of Lecture Notes in 

Computer Science, Springer,pp.448–457. 

Montemerlo, M. (2003), FastSLAM: A Factored Solution to the Simultaneous Localization and 

Mapping Problem with Unknown Data Association, PhD thesis, Robotics Institute, Carnegie 

Mellon University, Pittsburgh, PA. 

Montemerlo, M. & Thrun, S. (2007), FastSLAM: A Scalable Method for the Simultaneous Localization 

and Mapping Problem in Robotics, Springer. 

Moravec, P. (1987), Certainty grids for mobile robots, in ‘NASA/JPL Space Telerobotics Workshop’, 

Vol. 3, pp. 307–312. 

Mordohai, P. & Medioni, G. G. (2006), ‘Stereo using monocular cues within the tensor voting 

framework’, IEEE Transactions on Pattern Analysis and Machine Intelligence 28(6), 968–982.


Moreno, F., Blanco, J. & Gonzalez, J. (2009), ‘Stereo vision specific models for particle filter-based 

SLAM’, Robotics and Autonomous Systems 57(9), 955 – 970. 

Muhlmann, K., Maier, D., Hesser, J. & Manner, R. (2002), ‘Calculating dense disparity maps 

from color stereo images, an efficient implementation’, International Journal of Computer Vision 

47(1-3), 79–88. 

Murray, D. & Jennings, C. (1997), Stereo vision based mapping and navigation for mobile robots, 

in ‘IEEE International Conference on Robotics and Automation’, Vol. 2, pp. 1694–1699. 

Murray, D. & Little, J. J. (2000), ‘Using real-time stereo vision for mobile robot navigation’, Autonomous 

Robots 8(2), 161–171. 

Nalpantidis, L. & Gasteratos, A. (2010a), ‘Biologically and psychophysically inspired adaptive support 

weights algorithm for stereo correspondence’, Robotics and Autonomous Systems 58, 457– 

464. 

Nalpantidis, L. & Gasteratos, A. (2010b), ‘Stereo vision for robotic applications in the presence of 

non-ideal lighting conditions’, Image and Vision Computing 28, 940–951. 

Nalpantidis, L. & Kostavelis, I. (2009), ‘http://robotics.pme.duth.gr/reposit/stereoroutes.zip’. 

Group of Robotics and Cognitive Systems. 

Nalpantidis, L., Kostavelis, I. & Gasteratos, A. (2009), Stereovision-based algorithm for obstacle 

avoidance, in ‘International Conference on Intelligent Robotics and Applications’, Vol. 5928 of 

Lecture Notes in Computer Science, Springer-Verlag,Singapore,pp.195–204. 

Nalpantidis, L., Sirakoulis, G. C. & Gasteratos, A. (2007), Review of stereo matching algorithms for 

3d vision, in ‘16th International Symposium on Measurement and Control in Robotics’, Warsaw, 

Poland, pp. 116–124. 

Nalpantidis, L., Sirakoulis, G. C. & Gasteratos, A. (2008a), A dense stereo correspondence algorithm 

for hardware implementation with enhanced disparity selection, in ‘5th Hellenic conference on 

Artificial Intelligence’, Vol. 5138 of Lecture Notes in Computer Science, Springer-Verlag,Syros, 

Greece, pp. 365–370. 

Nalpantidis, L., Sirakoulis, G. C. & Gasteratos, A. (2008b), ‘Review of stereo vision algorithms: 

from software to hardware’, International Journal of Optomechatronics 2(4), 435–462. 

Nister, D., Naroditsky, O. & Bergen, J. R. (2006), ‘Visual odometry for ground vehicle applications’, 

Journal of Field Robotics 23(1), 3–20. 

Ogale, A. S. (2009), ‘http://www.cs.umd.edu/users/ogale/download/code.html’. 

Ogale, A. S. & Aloimonos, Y. (2005a), Robust contrast invariant stereo correspondence, in ‘IEEE 

International Conference on Robotics and Automation’, pp. 819–824. 

Ogale, A. S. & Aloimonos, Y. (2005b), ‘Shape and the stereo correspondence problem’, International 

Journal of Computer Vision 65(3), 147–162. 

Ogale, A. S. & Aloimonos, Y. (2007), ‘A roadmap to the integration of early visual modules’, 

International Journal of Computer Vision 72(1), 9–25. 

Ohya, A., Kosaka, A. & Kak, A. (1998), ‘Vision-based navigation of mobile robot with obstacle 

avoidance by single camera vision and ultrasonic sensing’, IEEE Transactions on Robotics and 

Automation 14(6), 969–978. 

Park, S. & Jeong, H. (2007), Real-time stereo vision fpga chip with low error rate, in ‘International 

Conference on Multimedia and Ubiquitous Engineering’, pp. 751–756. 

Pinoli, J. C. & Debayle, J. (2007), ‘Logarithmic adaptive neighborhood image processing (LANIP): 

Introduction, connections to human brightness perception, and application issues’, EURASIP 

Journal on Advances in Signal Processing 2007(1), 114–135.


Reignier, P. (1994), ‘Fuzzy logic techniques for mobile robot obstacle avoidance’, Robotics and 

Autonomous Systems 12(3-4), 143–153. 

Ruigang, Y., Welch, G. & Bishop, G. (2002), ‘Real-time consensus-based scene reconstruction using 

commodity graphics hardware’, 10th Pacific Conference on Computer Graphics and Applications 

pp. 225–234. 

Russell, R. A., Taylor, G., Kleeman, L. & Purnamadjaja, A. H. (2004), ‘Multi-sensory synergies in 

humanoid robotics’, International Journal of Humanoid Robotics 1(2), 289–314. 

Sabe, K., Fukuchi, M., Gutmann, J.-S., Ohashi, T., Kawamoto, K. & Yoshigahara, T. (2004), Obstacle 

avoidance and path planning for humanoid robots using stereo vision, in ‘IEEE International 

Conference on Robotics and Automation’, Vol. 1, pp. 592 – 597. 

Salmen, J., Schlipsing, M., Edelbrunner, J., Hegemann, S. & Luke, S. (2009), Real-time stereo 

vision: Making more out of dynamic programming, in ‘International Conference on Computer 

Analysis of Images and Patterns’, pp. 1096–1103. 

Santini, F., Nambisan, R. & Rucci, M. (2009), ‘Active 3d vision through gaze relocation in a 

humanoid robot’, International Journal of Humanoid Robotics 6(3), 481–503. 

Scharstein, D. & Pal, C. (2007), Learning conditional random fields for stereo, in ‘IEEE Conference 

on Computer Vision and Pattern Recognition’, pp. 1–8. 

Scharstein, D. & Szeliski, R. (2002), ‘A taxonomy and evaluation of dense two-frame stereo correspondence 

algorithms’, International Journal of Computer Vision 47(1-3), 7–42. 

Scharstein, D. & Szeliski, R. (2003), High-accuracy stereo depth maps using structured light, in 

‘IEEE Computer Society Conference on Computer Vision and Pattern Recognition’, Vol. 1, 

pp. 195–202. 

Scharstein, D. & Szeliski, R. (2010), ‘http://vision.middlebury.edu/stereo/’. 

Schirmacher, H., Li, M. & Seidel, H.-P. (2001), On-the-fly processing of generalized lumigraphs, in 

‘EUROGRAPHICS’, pp. 165–173. 

Scholl, B. J. (2001), ‘Objects and attention: the state of the art.’, Cognition 80(1-2), 1–46. 

Schreer, O. (1998), Stereo vision-based navigation in unknown indoor environment, in ‘5th European 

Conference on Computer Vision’, Vol. 1, pp. 203–217. 

Shimonomura, K., Kushima, T. & Yagi, T. (2008), ‘Binocular robot vision emulating disparity 

computation in the primary visual cortex’, Neural Networks 21(2-3), 331–340. 

Siciliano, B., Sciavicco, L., Villani, L. & Oriolo, G. (2008), Robotics: Modelling, Planning and 

Control, Springer Publishing Company, Incorporated. 

Siegwart, R. & Nourbakhsh, I. R. (2004), Introduction to Autonomous Mobile Robots, MIT Press, 

Massachusetts. 

Sim, R., Elinas, P. & Little, J. (2007), ‘A study of the rao-blackwellised particle filter for efficient 

and accurate vision-based SLAM’, International Journal of Computer Vision 74(3), 303–318. 

Sim, R. & Little, J. J. (2009), ‘Autonomous vision-based robotic exploration and mapping using 

hybrid maps and particle filters’, Image and Vision Computing 27(1-2), 167 – 177. Canadian 

Robotic Vision 2005 and 2006. 

Sirakoulis, G. C., Karafyllidis, I. & Thanailakis, A. (2003), ‘A cad system for the construction and 

vlsi implementation of cellular automata algorithms using vhdl’, Microprocessors and Microsystems 

27(8), 381–396. 

Soquet, N., Aubert, D. & Hautiere, N. (2007), Road segmentation supervised by an extended 

V-disparity algorithm for autonomous navigation, in ‘IEEE Intelligent Vehicles Symposium’, 

Istanbul, Turkey, pp. 160–165.


Stentz, A., Fox, D. & Montemerlo, M. (2003), FastSLAM: A factored solution to the simultaneous 

localization and mapping problem with unknown data association, in ‘AAAI National Conference 

on Artificial Intelligence’, AAAI, pp. 593–598. 

Sternberg, R. J. (2002), Cognitive Psychology, WadsworthPublishing. 

Strecha, C., Fransens, R. & Van Gool, L. J. (2006), Combined depth and outlier estimation in multiview 

stereo, in ‘IEEE Computer Society Conference on Computer Vision and Pattern Recognition’, 

Vol. 2, pp. 2394–2401. 

Sun, J., Li, Y., Kang, S. B. & Shum, H.-Y. (2005), Symmetric stereo matching for occlusion handling, 

in ‘IEEE Computer Society Conference on Computer Vision and Pattern Recognition’, Vol. 2, 

pp. 399–406. 

Sunyoto, H., van der Mark, W. & Gavrila, D. M. (2004), A comparative study of fast dense stereo 

vision algorithms, in ‘IEEE Intelligent Vehicles Symposium’, pp. 319–324. 

Thevenaz, P., Blu, T. & Unser, M. (2000), ‘Interpolation revisited’, IEEE Transactions on Medical 

Imaging 19(7), 739–758. 

Torra, P. H. S. & Criminisi, A. (2004), ‘Dense stereo using pivoted dynamic programming’, Image 

and Vision Computing 22(10), 795–806. 

Ulam, S. (1952), Random processes and transformations, in ‘International Congress on Mathematics’, 

Vol. 2, Cambridge, USA, pp. 264–275. 

Vandorpe, J., Van Brussel, H. & Xu, H. (1996), Exact dynamic map building for a mobile robot 

using geometrical primitives produced by a 2d range finder, in ‘IEEE International Conference 

on Robotics and Automation’, Minneapolis, USA, pp. 901–908. 

Veksler, O. (2002), ‘Dense features for semi-dense stereo correspondence’, International Journal of 

Computer Vision 47(1-3), 247–260. 

Veksler, O. (2003), Extracting dense features for visual correspondence with graph cuts, in ‘IEEE 

Computer Vision and Pattern Recognition’, Vol. 1, pp. 689–694. 

Veksler, O. (2005), Stereo correspondence by dynamic programming on a tree, in ‘IEEE Computer 

Society Conference on Computer Vision and Pattern Recognition’, Vol. 2, pp. 384–390. 

Veksler, O. (2006), Reducing search space for stereo correspondence with graph cuts, in ‘British 

Machine Vision Conference’, Vol. 2, pp. 709–718. 

Von Neumann, J. (1966), Theory of Self-Reproducing Automata, University of Illinois Press, Urbana, 

Illinois. 

Vonikakis, V. (2009), ‘http://electronics.ee.duth.gr/vonikakis.htm’. 

Vonikakis, V., Andreadis, I. & Gasteratos, A. (2008), ‘Fast centre-surround contrast modification’, 

IET Image Processing 2(1), 19–34. 

Wang, L., Liao, M., Gong, M., Yang, R. & Nister, D. (2006), High-quality real-time stereo using 

adaptive cost aggregation and dynamic programming, in ‘Third International Symposium on 3D 

Data Processing, Visualization, and Transmission’, pp. 798–805. 

Wheatstone, C. (1838), ‘Contributions to the physiology of vision—part the first. on some remarkable, 

and hitherto unobserved, phenomena of binocular vision’, Philosophical Transactions of the 

Royal Society of London pp. 371–394. 

Wiegand, T., Sullivan, G., Bjntegaard, G. & Luthra, A. (2003), ‘Overview of the H.264/AVC video 

coding standard’, IEEE Transactions on Circuits and Systems for Video Technology 13(7), 560– 

576. 

Wilburn, B., Smulski, M., Lee, K. & Horowitz, M. A. (2002), The light field video camera, in ‘Media 

Processors’, pp. 29–36. 

Wolfram, S. (1986), Theory and Applications of Cellular Automata, WorldScientific,Singapore.


Yang, J. C., Everett, M., Buehler, C. & Mcmillan, L. (2002), A real-time distributed light field 

camera, in ‘Eurographics Workshop on Rendering’, pp. 77–86. 

Yang, Q., Wang, L. & Ahuja, N. (2010), A constant-space belief propagation algorithm for stereo 

matching, in ‘IEEE Conference on Computer Vision and Pattern Recognition’. 

Yang, Q., Wang, L. & Yang, R. (2006), Real-time global stereo matching using hierarchical belief 

propagation, in ‘British Machine Vision Conference’, Vol. 3, pp. 989–998. 

Yang, Q., Wang, L., Yang, R., Stewenius, H. & Nister, D. (2009), ‘Stereo matching with colorweighted 

correlation, hierarchical belief propagation and occlusion handling’, IEEE Transactions 

on Pattern Analysis and Machine Intelligence 31(3), 492–504. 

Yi, J., Kim, J., Li, L., Morris, J., Lee, G. & Leclercq, P. (2004), Real-time three dimensional vision, 

in ‘Asia-Pacific Conference on Advances in Computer Systems Architecture’, Vol. 3189 of Lecture 

Notes in Computer Science, Springer,pp.309–320. 

Yin, P., Tourapis, H., Tourapis, A. & Boyce, J. (2003), Fast mode decision and motion estimation 

for JVT/H.264, in ‘IEEE International Conference on Image Processing’, Vol. 3, pp. 853–856. 

Yoon, K.-J. & Kweon, I. S. (2006a), ‘Adaptive support-weight approach for correspondence search’, 

IEEE Transactions on Pattern Analysis and Machine Intelligence 28(4), 650–656. 

Yoon, K.-J. & Kweon, I. S. (2006b), Correspondence search in the presence of specular highlights 

using specular-free two-band images, in ‘7th Asian Conference on Computer Vision’, Vol. 3852, 

Springer, Hyderabad, India, pp. 761–770. 

Yoon, K.-J. & Kweon, I. S. (2006c), Stereo matching with symmetric cost functions, in ‘IEEE 

Computer Society Conference on Computer Vision and Pattern Recognition’, Vol. 2, pp. 2371– 

2377. 

Yoon, S., Park, S.-K., Kang, S. & Kwak, Y. K. (2005), ‘Fast correlation-based stereo matching with 

the reduction of systematic errors’, Pattern Recognition Letters 26(14), 2221–2231. 

Yu, T., Lin, R.-S., Super, B. & Tang, B. (2007), ‘Efficient message representations for belief propagation’, 

IEEE International Conference on Computer Vision pp. 1–8. 

Zach, C., Karner, K. & Bischof, H. (2004), Hierarchical disparity estimation with programmable 3d 

hardware, in ‘International Conference in Central Europe on Computer Graphics, Visualization 

and Computer Vision’, pp. 275–282. 

Zhao, J., Katupitiya, J. & Ward, J. (2007), Global correlation based ground plane estimation using 

V-disparity image, in ‘IEEE International Conference on Robotics and Automation’, Rome, Italy, 

pp. 529–534. 

Zhu, Z., Oskiper, T., Samarasekera, S., Kumar, R. & Sawhney, H. S. (2007), ‘Ten-fold improvement 

in visual odometry using landmark matching’, IEEE International Conference on Computer 

Vision pp. 1–8. 

Zitnick, C. L. & Kanade, T. (2000), ‘A cooperative algorithm for stereo matching and occlusion 

detection’, IEEE Transactions on Pattern Analysis and Machine Intelligence 22(7), 675–684. 

Zitnick, C. L. & Kang, S. (2007), ‘Stereo for image-based rendering using image over-segmentation’, 

International Journal of Computer Vision 75(1), 49–65. 

Zitnick, C. L., Kang, S. B., Uyttendaele, M., Winder, S. & Szeliski, R. (2004), ‘High-quality video 

view interpolation using a layered representation’, ACM Transactions on Graphics 23(3), 600– 

608.

Abbreviations 

AD Absolute Differences 

ASIC Application-Specific Integrated Circuit 

ASW Adaptive Support Weight 

CA Cellular Automaton 

CIE Commission Internationale d’Eclairage (International Commission of Illumination) 

CPU Central Processing Unit 

DP Dynamic Programming 

DSI Disparity Space Image 

DSP Digital Signal Processor 

EDA Electronic Design Automation 

EKF Extended Kalman Filter 

EM Expectation Maximization 

FIS Fuzzy Inference System 

FPGA Field-Programmable Gate Array 

GGCP Generalized Ground Control Points 

GPU Graphics Processing Unit 

CWT Continuous Wavelet Transform 

HFoV Horizontal Field of View 

HSL Hue Saturation Luminosity/Lightness (Color model) 

HSV Hue Saturation Value (Color model) 

HVS Human Visual System 

IR Infrared 

LCDM Luminosity-Compensated Dissimilarity Measure 

LoG Laplacian of Gaussian 

LWPC Local Weighted Phase-Correlation 

MF Membership Function 

MRF Markov Random Field 

NCC Normalized Cross Correlation 

NMSE Normalized Mean Square Error 

NN Neural Network 

NURBS Non-Uniform Rational B-Splines 

PC Personal Computer 

131

132 Abbreviations 

RDP Reliability-based Dynamic Programming 

RGB Red Green Blue (Color model) 

SAD Sum of Absolute Differences 

SD Squared Differences 

SIFT Scale-Invariant Feature Transform 

SLAM Simultaneous Localization and Mapping 

SSD Sum of Squared Differences 

SURF Speeded-Up Robust Features 

VR Virtual Reality 

WTA Winner Takes All 

ZNCC Zero-Normalized Cross Correlation

Thesis Publications 

Journals: 

1. L. Nalpantidis and A. Gasteratos. Stereovision-based fuzzy obstacle avoidance method. International 

Journal of Humanoid Robotics, in press. 

2. L. Nalpantidis, A. Amanatiadis, G. C. Sirakoulis, and A. Gasteratos. An efficient hierarchical 

matching algorithm for processing uncalibrated stereo vision images and its hardware architecture. 

IET Image Processing, in press. 

3. L. Nalpantidis and A. Gasteratos. Biologically and psychophysically inspired adaptive support 

weights algorithm for stereo correspondence. Robotics and Autonomous Systems, 58:457-464, 

2010. 

4. L. Nalpantidis and A. Gasteratos. Stereo vision for robotic applications in the presence of nonideal 

lighting conditions. Image and Vision Computing, 28:940-951, 2010. 

5. L. Nalpantidis, G. C. Sirakoulis, and A. Gasteratos. Review of stereo vision algorithms: from 

software to hardware. International Journal of Optomechatronics, 2(4):435-462, 2008. 

Conferences: 

1. L. Nalpantidis, G. C. Sirakoulis, A. Carbone, and A. Gasteratos. Computationally effective stereovision 

SLAM. In IEEE International Conference on Imaging Systems and Techniques, Thessaloniki, 

Greece, July 2010. 

2. I. Kostavelis, L. Nalpantidis, and A. Gasteratos. Comparative presentation of real-time obstacle 

avoidance algorithms using solely stereo vision. In IARP/EURON International Workshop on 

Robotics for risky interventions and Environmental Surveillance-Maintenance, Sheffield, UK, 

January 2010. 

3. L. Nalpantidis, I. Kostavelis, and A. Gasteratos. Stereovision-based algorithm for obstacle avoidance. 

In International Conference on Intelligent Robotics and Applications, volume 5928 of Lecture 

Notes in Computer Science, pages 195-204, Singapore, December 2009. Springer-Verlag. 

4. L. Nalpantidis, D. Chrysostomou, and A. Gasteratos. Obtaining reliable depth maps for robotic 

applications with a quad-camera system. In International Conference on Intelligent Robotics 

and Applications, volume 5928 of Lecture Notes in Computer Science, pages 906-916, Singapore, 

December 2009. Springer-Verlag. 

133

134 Thesis Publications 

5. Y. Baudoin, D. Doroftei, G. De Cubber, S. A. Berrabah, C. Pinzon, F. Warlet, J. Gancet, E. 

Motard, M. Ilzkovitz, L. Nalpantidis, and A. Gasteratos. View-finder: Robotics assistance to firefighting 

services and crisis management. In IEEE International Workshop on Safety, Security, 

and Rescue Robotics, pages 1-6, Denver, Colorado, USA, November 2009. 

6. I. Kostavelis, L. Nalpantidis, and A. Gasteratos. Real-time algorithm for obstacle avoidance. In 

Third Panhellenic Scientific Student Conference on Informatics, Corfu, Greece, September 2009. 

7. L. Nalpantidis, A. Amanatiadis, G. C. Sirakoulis, N. Kyriakoulis, and A. Gasteratos. Dense 

disparity estimation using a hierarchical matching technique from uncalibrated stereo vision. In 

IEEE International Workshop on Imaging Systems and Techniques, pages 427-431, Shenzhen, 

China, May 2009. 

8. G. De Cubber, D. Doroftei, L. Nalpantidis, G. C. Sirakoulis, and A. Gasteratos. Stereo-based 

terrain traversability analysis for robot navigation. In IARP/EURON Workshop on Robotics for 

Risky Interventions and Environmental Surveillance, Brussels, Belgium, 2009. 

9. L. Nalpantidis, G. C. Sirakoulis, and A. Gasteratos. A dense stereo correspondence algorithm 

for hardware implementation with enhanced disparity selection. In 5th Hellenic conference on 

Artificial Intelligence, volume 5138 of Lecture Notes in Computer Science, pages 365-370, Syros, 

Greece, 2008. Springer-Verlag. 

10. G. De Cubber, L. Nalpantidis, G. C. Sirakoulis, and A. Gasteratos. Intelligent robots need 

intelligent vision: Visual 3d perception. In IARP/EURON Workshop on Robotics for Risky 

Interventions and Environmental Surveillance, Benic-ssim, Spain, 2008. 

11. L. Nalpantidis, G. C. Sirakoulis, and A. Gasteratos. Review of stereo matching algorithms for 

3d vision. In 16th International Symposium on Measurement and Control in Robotics, pages 

116-124, Warsaw, Poland, 2007.

Study and Implementation of Stereo Vision Systems for ... - Utopia

Create successful ePaper yourself

Delete template?

Save as template?