motion estimation and compensation for very low bitrate video coding

LABORATOIRE DE TELECOMMUNICATIONS ET 

TELEDETECTION 

B - 1348 Louvain-la-Neuve Belgique 

MOTION ESTIMATION AND COMPENSATION 

FOR VERY LOW BITRATE VIDEO CODING 

Xavier MARICHAL 

These presentee en vue de l'obtention du grade de 

Docteur en Sciences Appliquees 

Jury compose de 

Beno^t MACQ (UCL/TELE) - Promoteur 

Paul DELOGNE (UCL/TELE) - Examinateur 

Jean-Didier LEGAT (UCL/DICE) - Examinateur 

Ferran MARQUES (UPC - Barcelona) - Examinateur 

Thomas SIKORA (HHI - Berlin) - Examinateur 

Luc VAN GOOL (KUL - Leuven) - Examinateur 

Piotr SOBIESKI (UCL/TELE) - President 

Mai 1998

Nous pouvons aisement communiquer d'un 

continent a l'autre, mais un homme ne sait pas 

encore entrer en contact avec un autre homme. 

Vaclav Havel

Avant-propos 

Mener a bien une these de doctorat necessite certes de la part du doctorant 

un travail personnel appreciable ainsi qu'une certaine dose de tenacite. 

Toutefois, il m'aurait ete impossible de realiser un tel e ort sans le soutien de 

nombreuses personnes envers qui je souhaiterais exprimer ici ma gratitude. 

E n premier lieu, je voudrais remercier Beno^t Macq, mon promoteur. Tout 

d'abord pour la con ance qu'il m'a temoignee en me proposant d'entreprendre 

ce travail. Ensuite, pour sa direction juste assez autoritaire que pour m'empêcher 

de me disperser, mais n'empietant jamais sur ma liberte de recherche. Je remercie 

egalement le FRIA pour le soutien nancier qu'il m'a octroye au cours 

de ces trois annees et demi; et ma vive reconnaissance vaa Messieurs Watteyne, 

Preux et Mertes qui ont soutenu ma candidature a cette bourse. 

Je tiens a remercier les membres de mon jury, Paul Delogne, Jean-Didier Legat, 

Ferran Marques, Thomas Sikora, Luc Van Gool et Piotr Sobieski, pour leurs 

remarques et leurs commentaires aussi judicieux que precis a n d'ameliorer la 

version nale du texte. 

Realisee au sein du Laboratoire de Telecommunications et Teledetection, la 

presente these a grandement bene cie de l'environnement materiel et humain 

qui y est propose. Je tiens plus particulierement a souligner l'apport exceptionnel 

des personnes avec qui j'ai partage plusieurs annees le A.176: Xavier, qui 

joua le rôle de mentor a mes debuts, Olivier, dont l'energie servait de catalyseur, 

Vicent, pour un sourire et une disponibilite omnipresents, Jean-Francois, 

dont le franc-parler n'a d'egal que le dynamisme, et surtout Christophe, a qui 

je suis notamment redevable d'une relecture commentee du present manuscrit. 

Il m'est malheureusement impossible de citer tout le monde, mais j'adresse ma 

plus vive sympathie a l'ensemble des autres membres du personnel scienti que 

du Laboratoire, sans qui ce dernier ne serait pas ce qu'il est. Avec un clin 

d'oeil particulier a ceux qui ont partage l'experience COMIS. Un grand merci 

aussi au personnel \technique" pour son aide precieuse dans la resolution de 

nombreux problemes pratiques. 

C amarades de route, parents, soeurs, grands-parents, famille, belle-famille, 

amis, veuillez trouver dans ces quelques lignes l'expression de mon a ection et 

de ma gratitude pour votre soutien durant toutes ces annees. 

I niquement, il revient toujours aux êtres les plus chers de se voir remercier 

en dernier lieu. A Veronique qui a eu la patience de m'aider sans relâche a 

corriger ce texte, redige dans une langue dont jenema^trise pas toutes les 

subtilites. Compagne de tous les instants, elle est la source qui me donne la 

force d'entreprendre et de realiser. A Joauma qui, adefaut d'avoir toujours ete 

une muse inspiratrice, a apporte un formidable rayon de soleil dans nos vies. 

Mes plus tendres remerciements leur sont reserves a toutes deux. 

Xavier 

8 mai 1998

Abstract 

Motion estimation is a key issue in the eld of moving images analysis. 

In the framework of video compression, it is combined with motion 

compensation in order to exploit the spatio-temporal correlation of image 

sequences along the motion trajectory. It then achieves one of the 

most important compression factor of a video coder. The research presented 

in this thesis is mainly concerned with improvements of classical 

motion estimation and compensation techniques in the context of verylow 

bitrate transmissions thanks to contents-adapted information that 

can be extracted from images. This concept is consecutively, but independently, 

applied to the various steps of exploiting motion in a video 

coding scheme: estimation, transmission and compensation. 

The manuscript starts with a brief overview of the \state of the arts" in 

video compression, introducing underlying concepts while putting some 

emphasis on very-low bitrate conditions. Motion estimation is further 

detailed in the following chapter, which includes a description of motion 

representation and disturbing phenomena, as well as a review of the 

most commonly used estimation techniques. The contributions of the 

thesis are then presented. 

First, a reliable motion estimation technique based on multiscale algorithms 

is introduced: it outputs segmented and more coherent motion 

elds that are adapted to the spatial contents of the images and which 

can be more e ciently coded. A model of distribution of the engendered 

computational burden is also proposed and demonstrates a linear speedup. 

Secondly, the transmission of moving images is analyzed in the light 

of the Rate-Distortion theory. Because of the very-low bitrate constraint, 

some spatial pre-processing is performed prior to motion estimation in 

order to raise the correlation between the already encoded images and 

the new ones. Prospects are also formulated for selective pre-processing 

according to the contents relevance. Thirdly, the reconstruction (compensation) 

of block-based motion elds is subjectively improved by using 

image warping techniques. A new corner detector is set up so as to automatically 

design an active mesh, thanks to the Delaunay triangulation, 

on the reference image. Inverse kriging interpolation then allows one 

to determine the motion of the mesh vertices from the original vector 

eld. It results in the suppression of the blocking artefacts while o ering 

the possibility to further edit and modify the images thanks to the mesh 

structure. The present thesis analyses thus three prospects of improving 

the quality of existing video schemes.

Resume 

L'estimation du mouvement revêt une importance capitale dans le domaine 

de l'analyse des images mobiles. Combinee a la compensation 

de mouvement, elle permet d'exploiter au mieux la correlation spatiotemporelle 

inherente aux sequences d'images et o re ainsi le facteur de 

compression le plus signi catif d'un codeur video. La presente these s'est 

principalement attelee aameliorer des techniques classiques d'estimation 

et compensation de mouvement, dans un contexte de transmission a 

tres bas debit, en utilisant l'information adaptee au signal qui peut être 

extraite des images-mêmes. Cette approche est appliquee independemment 

aux di erentes etapes d'exploitation de l'information de mouvement 

au sein d'un schema de codage video: estimation, transmission et 

compensation. 

Le texte debute par un survol de l'etat de l'art en compression video. Il 

introduit les concepts de base et met l'accent sur les conditions relatives 

au codage atres bas debit. Le principe de l'estimation de mouvement 

est ensuite expose endetail, en incluant un compte rendu des techniques 

les plus utilisees. Les contributions du travail sont alors presentees. 

Tout d'abord, une technique d'estimation du mouvement basee sur des 

outils multi-echelle est proposee: elle permet d'obtenir des champs de 

mouvement coherents, segmentes et adaptes au contenu spatial des images, 

ce qui permet leur codage e cace. Un modele de distribution de 

la charge de calcul de l'algorithme est etabli et resulte en une acceleration 

lineaire. En second lieu, la transmission de sequences d'images est 

analysee a la lumiere de la theorie de Debit-Distortion. Au vu de la contrainte 

de tres bas debit, un pre-traitement spatial des images est e ectue 

prealablement a l'estimation du mouvement. Le but est d'augmenter la 

correlation entre les images deja codees et les autres. Des pistes sont 

egalement tracees pour un traitement selectif sur base du contenu. Finalement, 

la reconstruction (compensation) de descriptions du mouvement 

par blocs est amelioree par des techniques de deformation d'image. 

Un nouvel outil de detection de \coins" est mis au point a n de construire 

automatiquement un treillis actif sur l'image de reference grâce ala 

triangulation de Delaunay. Le krigeage inverse permet alors d'attribuer 

un vecteur de mouvement achaque noeud du treillis par interpolation 

du champ original. Les artefacts de type bloc sont ainsi supprimes. De 

plus, la technique des treillis o re des possibilites etendues de manipulation 

des images. La presente these analyse donc trois pistes pour 

ameliorer la qualite de systemes video actuels.

Contents 

Introduction 1 

Preamble :::::::::::::::::::::::::::::: 1 

Introduction and Thesis Outline ::::::::::::::::: 6 

Contributions of the Thesis :::::::::::::::::::: 10 

1 Digital Video Coding at Very-Low BitRate 13 

1.1 Digital Video ::::::::::::::::::::::::: 14 

1.2 Video Coding ::::::::::::::::::::::::: 15 

1.2.1 Image Compression :::::::::::::::::: 15 

1.2.2 Video Compression :::::::::::::::::: 18 

1.3 Coding at Very-Low BitRate :::::::::::::::: 20 

1.4 Some (Very-) Low BitRate Codecs ::::::::::::: 21 

1.4.1 H.263 ::::::::::::::::::::::::: 21 

1.4.1.1 Intra Macroblocks ::::::::::::: 22 

1.4.1.2 Inter Macroblocks and Residues ::::: 23 

1.4.1.3 Options ::::::::::::::::::: 23 

1.4.1.4 Future Improvements: H.263+ :::::: 24 

1.4.2 \COMIS", the UCL Approach ::::::::::: 24 

1.4.2.1 Intra Images :::::::::::::::: 25 

1.4.2.2 Inter Images :::::::::::::::: 31 

1.4.3 The Emerging MPEG-4 Standard ::::::::: 33 

1.4.3.1 Developments to Be Supported :::::: 34 

1.4.3.2 A New Challenge for the Representation 

of Audio-Visual Information ::::::: 34 

1.4.3.3 Video Coding in MPEG-4 ::::::::: 35 

1.5 Discussion ::::::::::::::::::::::::::: 37 

2 Motion in the Framework of Video Coding 39 

2.1 Image Formation and Motion :::::::::::::::: 41

x 

2.1.1 Apparent versus Real Motion :::::::::::: 43 

2.1.2 Unsolvable Problems of Motion Estimation :::: 44 

2.2 Rate Distortion Theory ::::::::::::::::::: 47 

2.3 Practical Approaches to Motion Estimation :::::::: 48 

2.3.1 Additional Constraints :::::::::::::::: 49 

2.3.1.1 Preservation Constraint :::::::::: 49 

2.3.1.2 Coherence Constraint ::::::::::: 51 

2.3.2 Estimation Methods ::::::::::::::::: 51 

2.3.2.1 Multigrid and Multiscale Optimization 

Methods :::::::::::::::::: 52 

2.3.2.2 Forward versus Backward Estimation : : 54 

2.4 Background Techniques for Motion Estimation :::::: 55 

2.4.1 Linear Regression ::::::::::::::::::: 55 

2.4.2 Iterative Motion Estimation ::::::::::::: 56 

2.4.3 Pel-Recursive Algorithms :::::::::::::: 59 

2.4.4 Stochastic Estimation Relying on Markov Random 

Field ::::::::::::::::::::::: 60 

2.4.5 Parametric Models of the Motion Field :::::: 62 

2.4.6 Within a Transform Domain :::::::::::: 64 

2.5 The Block-Matching Algorithm (BMA) :::::::::: 65 

2.5.1 BMA Principle :::::::::::::::::::: 65 

2.5.1.1 Search Techniques ::::::::::::: 66 

2.5.1.2 Advanced Possibilities ::::::::::: 67 

2.5.1.3 Result :::::::::::::::::::: 67 

2.5.2 Overlapped BMA ::::::::::::::::::: 69 

2.6 Image Warping Techniques ::::::::::::::::: 69 

2.6.1 The Hexagonal Matching Algorithm (HMA) :::: 72 

2.6.2 Adaptive Hexagonal Matching Algorithm (AHMA) 74 

2.7 Conclusion :::::::::::::::::::::::::: 74 

3 Multiscale Block Matching Algorithm 77 

3.1 Adaptive Block Matching Algorithm :::::::::::: 78 

3.1.1 Global Motion Estimation :::::::::::::: 78 

3.1.2 Change Detection :::::::::::::::::: 79 

3.1.3 Local Motion Estimation :::::::::::::: 81 

3.1.4 Results :::::::::::::::::::::::: 82 

3.2 Distributed Version of the Local Motion Estimation ::: 85 

3.2.1 Pseudo-Code of the Sequential Loop :::::::: 85 

3.2.2 Model of Distribution :::::::::::::::: 86 

3.2.3 Practical Implementation :::::::::::::: 88

3.2.3.1 Data Structures :::::::::::::: 89 

3.2.3.2 State Transitions :::::::::::::: 89 

3.2.4 Experimental Results :::::::::::::::: 92 

3.3 Conclusion :::::::::::::::::::::::::: 92 

4 Image Pre-Processing for VLBR Video Coding 95 

4.1 Intuitive Rationale :::::::::::::::::::::: 96 

4.2 Rate Distortion Conditions :::::::::::::::::101 

4.2.1 Image Model :::::::::::::::::::::101 

4.2.2 Intra Coding of Images :::::::::::::::102 

4.2.3 Inter Coding without Motion Compensation ::::103 

4.2.3.1 Usual Scheme :::::::::::::::104 

4.2.3.2 Proposed Scheme :::::::::::::105 

4.2.4 Inter Coding with Motion Compensation :::::105 

4.2.5 Theoretical Conclusion ::::::::::::::::106 

4.3 Experimental Results :::::::::::::::::::::107 

4.4 Conclusion ::::::::::::::::::::::::::119 

5 Mesh-Based Motion Compensation 121 

5.1 Estimation, Transcription and Reconstruction :::::::123 

5.1.1 The Proposed Reconstruction ::::::::::::123 

5.1.2 Problems to be Addressed ::::::::::::::124 

5.1.2.1 Mesh Vertices :::::::::::::::125 

5.1.2.2 Motion Interpolation :::::::::::127 

5.1.2.3 Reversing the Motion Information ::::128 

5.2 Mesh Design :::::::::::::::::::::::::128 

5.2.1 Detecting Edges :::::::::::::::::::130 

5.2.1.1 Enhancing Image Boundaries :::::::130 

5.2.1.2 Following Half Boundaries ::::::::134 

5.2.2 Corner Extraction ::::::::::::::::::137 

5.3 Motion Transcription :::::::::::::::::::::141 

5.3.1 Reversing the Sense of the Motion Information : :141 

5.3.2 Interpolating the Motion Values ::::::::::141 

5.3.3 Mesh Connectivity ::::::::::::::::::146 

5.3.4 First results ::::::::::::::::::::::147 

5.4 Results :::::::::::::::::::::::::::::148 

5.5 Conclusion ::::::::::::::::::::::::::158 

Conclusion 161 

xi

xii 

A VLBR-like video sequences 167 

B Rate Distortion theory 171 

B.1 Information Theory :::::::::::::::::::::173 

B.2 Discrete Memoryless Sources and Single-Letter Distortion 175 

B.3 Rate Distortion Function ::::::::::::::::::177 

B.4 Extension to Moving Pictures ::::::::::::::::178 

B.4.1 Note on Predictive Coding :::::::::::::178 

B.4.2 Intra Images :::::::::::::::::::::181 

B.4.3 Inter Images without Motion Compensation ::::181 

B.4.4 Inter Images with Motion Compensation ::::::181 

C Markov Random Fields 183 

C.1 Bayesian Model ::::::::::::::::::::::::183 

C.2 Markov Random Fields :::::::::::::::::::184 

C.3 Gibbs measure and Markov elds ::::::::::::::186 

C.4 System solution ::::::::::::::::::::::::187 

D Complements to Chapter 5 189 

D.1 Triangulation :::::::::::::::::::::::::189 

D.1.1 De nitions ::::::::::::::::::::::189 

D.1.2 Delaunay Triangulation :::::::::::::::190 

D.2 Pseudo-code of the implementation :::::::::::::191 

D.2.1 Compensation scheme ::::::::::::::::191 

D.2.2 Corner detection :::::::::::::::::::193 

D.2.3 Inverse Kriging System :::::::::::::::201 

Bibliography 203

Introduction 

Preamble 

1998 2010 

Since the early ages of prehistory, 

mankind has slowly but 

in an inextinguishable way 

tried to understand their environment 

so as to adapt themselves 

to it and, whenever possible, 

in uence it. It probably 

started in Mesopotamia, 

the \cradle of civilization", 

where Sumerians (3500-3000 

BC) erected the rst cities 

and invented the wheel. Writing, 

agriculture and many 

other characteristics of our 

modern age were invented at 

that time. The well-known 

civilizations of Egypt and of 

the Indus valley pursued the 

initial movement of what we 

today mean with \progress". 

The movie was particularly 

long and when Joauma got 

out of the theater it was already 

dark. She had been 

deeply moved (or was it intrigued?) 

by the story and 

she was not paying attention 

to the road. When Daddy 

made the car turn right and 

stopped in front of the house, 

she suddenly had to get out 

though she did not want to. 

She slowly entered home and 

climbed the stairs. She put 

on her pyjamas but could not 

get to sleep. She opened the 

window, looked at the desert 

street under the faint light of 

the lamp and listened to the 

silence of the night.

2 Introduction 

Ancient Greece initiated the 

aspiration to understand both 

natural mechanisms, which 

found an accomplishment in 

the rst space shuttles, and 

human nature. The latter 

profoundly a ected Western 

philosophy through the discourse 

of eminent thinkers like 

Socrates (c. 470-c. 399 BC), 

whose contribution was essentially 

ethical in character, 

for instance when he asserted 

that \Bad men live that they 

may eat and drink, whereas 

good men eat and drink that 

they may live". His pupil 

Plato (c. 428-c. 347 BC) 

probably is the father of political 

thought, while Aristotle 

(384-322 BC) o ers a perfect 

synthesis of what a man can 

be as both a scientist and a 

philosopher. 

Societies have been organizing 

themselves, and progress 

was made in various technical 

elds, not only in the West. In 

271 AD, Chinese mathematicians 

are reputed to have used 

a simple compass. The invention 

was not put to use in Europe 

for navigation before the 

13th century. Printing from 

carved wood blocks was inven- 

One has to say that this adaptation 

of J. Gaarder's \Sophie's 

World" was particularly 

brilliant but for sure not 

easy to tackle for a twelveyears 

old girl. Of course she 

agreed with the foreword of 

Goethe that says that \Whoever 

can not draw the lessons 

from 3000 years only lives 

from day to day", but it is 

not easy to discover Democrite, 

Socrates, Athens history, 

Plato, Aristotle, the hellenism, 

various aspects of the 

Middle Ages, the Renaissance 

and the baroque, Descartes, 

Spinoza, Locke, Hume, Berkeley, 

Kant, Hegel, Kierkegaard, 

Marx, Darwin, Freud, Einstein, 

the Big Bang and all 

at once. And all these guys 

alive and kicking thanks to 

the magic of cinema! 

She glanced at the stars as she 

never did before and nally 

went to sleep. 

Heather Nova slowly started 

whispering \Heal" but every 

minute the song was becoming 

louder... When it 

changed to \Inside my Head" 

from Waitstate, the song 

was just bathing the whole 

room. Joauma woke up. She 

looked at her watch: ten 

o'clock. This Panasonic integrated 

system (\Pansense, the


ted in China in the 6th century. 

The rst book known to 

have been printed from wood 

blocks was a Chinese edition 

of the Diamond Sutra, a Buddhist 

text, dating from 868, 

i.e. hundreds of years before 

the rst printed edition of the 

Bible by Gutenberg (c. 1400- 

1468). 

In 1543, after 25 years of 

work, the Polish astronomer 

Nicolaus Copernicus (1473- 

1543) forced men and women 

to rethink their place within 

the solar system. It is only after 

150 years that the theory 

that constitutes the Copernican 

revolution, reinforced by 

Galileo's (1564-1642) observations 

with his telescope, became 

widely accepted. Since 

then, the understanding of 

physical laws have been improved 

by many scientists, like 

Johannes Kepler (1571-1630), 

Sir Isaac Newton (1642-1727), 

Gottfried Leibniz 

(1646-1716), James Maxwell 

(1831-1879) and Albert Einstein 

(1879-1955). One should 

not forget Charles Darwin's 

(1809-1882) other revolution 

when he claimed that man 

was not the center of creation. 

ultimate solution for senses 

entertainment" stated the ad) 

driven with fuzzy logic had 

just perfectly memorized (understood?) 

the way she liked 

to be woken up on Fridays: 

a random compilation of her 

top ten \soft" singles with a 

crescendo volume. 

Friday, the rst day of the 

week-end. As the stars had 

announced it, the weather was 

just splendid. And birds were 

singing. The beginning of 

a great week-end! Unconsciously, 

Joauma changed the 

Pansense program and asked 

for downloading the week-end 

edition of \VideoBelche", an 

online video magazine that 

appears twice a week and tells 

her more about her country. 

After viewing the report, she 

quickly had a look at her 

Email. Before leaving the 

room, she required the system 

to download all possible information 

about the movie. And 

also about the actors whom 

she introduced some pictures 

in the scanner box. She asked 

the system to reserve a videoconferencing 

line for eleven 

o'clock and nally went down 

the stairs.


Meanwhile, the desire among 

men and women for freedom 

had led to other revolutions in 

France and the USA. Another 

major change was the Industrial 

Revolution, i.e. the shift, 

at di erent times in di erent 

countries, from a traditional 

agriculturally based economy 

to an economy based on 

the mechanized mass production 

of manufactured goods in 

large-scale rms. 

Nowadays some people already 

consider the end of our 

2 nd millennium to be that 

of the Information Revolution. 

If revolution there is, 

it all started with the patent 

of Alexander Graham Bell 

(1847-1922), which opened 

the way to telephony and - 

nally telecommunication networks. 

In the 1950s, television 

appeared and quickly became 

part of any living room: 

image was then accompanying 

sound. Besides this invention, 

the \Analytical machine" 

of Charles Babbage 

(1792-1871) and the invention 

of the transistor in 1947 resulted 

in the development of 

computers. Those came in 

common use in government 

and industry during the 1960s 

while the 1980s brought small, 

powerful and inexpensive Personal 

Computers (PC) into 

In the kitchen, breakfast was 

waiting for her and she had a 

large slice of bread with some 

compote of apples (88% apples, 

sugar, citric acid, ascorbic 

acid, 12% added sugar) 

and had a glass of orange 

juice. 

She cleared the table and 

went back to her room to 

get dressed. You usually 

want to look nice while videoconferencing... 

10.55 A video-conferencing 

line has e ectively been reserved. 

She just dials Moma 

(her grand-mother) number. 

11.00 The screen-saver animation 

(including some ads) of 

her ITU H.666 communication 

module starts playing on 

the screen. A few minutes 

later, the image of Moma appears 

on the screen. \The image 

is not as perfect as with 

Antoine. For sure she only has 

a 609 system!", she thinks. 

-\Hi Joauma! What a nice 

surprise!" says her grandmother. 

-\Hi Moma!", she answered, 

\How are you doing?" 

-... 

She surreptitiously pushes the 

record button.


the home. Experts have predicted 

that by the year 2000, 

the worldwide revenues of the 

computer industry will be second 

to agricultural revenues. 

For a few years now, the 

convergence of the digitized 

world of computer with the 

worlds of audio-visual data 

and telecommunication have 

taken us to an age of multimedia 

and information technology. 

Such a trend is especially 

visible on the WWW 

where pages (created with a 

computer) that one can download 

(across the Internet) contain 

more and more visual 

information. This evolution 

is quickly changing the organization 

of our society as it 

abolishes both time and distances. 

Data, sounds and images 

are now traveling around 

the world at the speed of light 

thanks to the nascent information 

highways. Nevertheless, 

technology evolution requires 

an evolution of mentalities. 

Moreover, the end-user 

of the proposed services is and 

remains the human being. A 

rst step in this direction is 

the stress put by recent systems 

on the interactivity concept: 

the user is not passive 

anymore but really becomes 

the pilot of the application. 

She nally logs o , answers 

\positive" to the automatic 

query about the quality of 

transmission and switches to 

the video editing table. 

In analysis mode, she opens 

the recording of her conversation 

and selects the center of 

the screen. Once the tracking 

algorithm has performed 

its task, she can save the segmented 

head and body of her 

grand-mother into a separate 

le. 

She then intermingles images, 

past and present, reality and 

ction, lms and cartoons. 

She has a grand-mother jump 

on Terminator's motorbike, 

and join a sabbath of great 

historical gures. Should she 

have her dance with Attila or 

Mandela? Would she have her 

wear M. Monroe's dress? 

Joauma nally got it! 

This animation perfectly introduces, 

with her twelveyears 

old humor, the compilation 

she achieved with all 

the video archives of the family. 

Moma's birthday present 

is ready.


However, one can question 

the content of the transmitted 

information and its real 

use. Some people already 

claim that the next revolution 

is going to be social but we 

will stop the preamble here. It 

is up to every reader to bring 

answers, up to the masses to 

make other revolutions and up 

to time to help building new 

worlds after new worlds. 

Introduction and Thesis Outline 

She just hopes her grandmother 

will like it! 

Among all the information that circulates through telecommunications 

networks, (moving) images occupy anever-increasing place, with respect 

to both their contents and volume. Images e ectively require very important 

storage and transmission capacities. Digitization has opened 

the way to the automatic treatment of data by computers. As far as 

(moving) images are concerned, digitization allows easier manipulation, 

modi cation, the possibility to extract some characteristics out of them 

and to encode them: an appropriate analysis of the image data allows 

modifying the representation of images so as to more easily detect the 

redundancies (parts of the signal that are similar to other ones) and the 

irrelevancies (parts of the signal that are not perceived by the human 

eye). Coding algorithms aim thus at compressing the signal by reducing 

redundancies and suppressing irrelevancies. International video coding 

standards that gather some \state of the arts" techniques (at the time of 

the standard) into a uniquely described scheme enable providing video 

services using existing networks and storage facilities. 

Chapter One introduces such concepts as digitization, redundancy reduction 

and video coding. It also clari es the speci city ofvery-low 

bitrate which iscentral to the present work. Very-low bitrate generally 

refers to a bitrate inferior to sixty-four kbit=s which allows transmissions 

over audio channels (e.g. for audio-visual personal communication services). 

Moreover, this chapter brie y presents some overall video coding 

schemes, the ITU H.263 standard and the COMIS scheme that is a contribution 

of the present research. It ends by introducing the future ISO


MPEG-4 standard which adds a new dimension to video coding as it 

o ers the possibility to separately encode the di erent objects of a scene 

and therefore interact with these objects at the decoder end. 

Time-varying image sequences can be compressed by independently coding 

each frame (intra-frame image coding) or by extending spatial coding 

techniques to the time dimension (e.g. 3D transform coding). However, 

the main characteristic of a video sequence is precisely its spatiotemporal 

component: most of the information in an image sequence is 

the result of motion. A lot of e ort has therefore been put into motion 

analysis of video sources. The range of applications of such an analysis 

includes, but is not limited to, automatic tracking of targets, piloting 

of robots, events detection for surveillance, tridimensional reconstruction 

of objects, image restoration... In a video coding context, motion 

analysis is mainly used to reduce the inter-image redundancy: instead 

of coding every new frame on its own basis, references are searched for 

in the previously coded image. On a practical point of view, it means 

that one searches for parts of the new picture which are already present 

in the previous frame and which have just undergone some movement. 

Once the motion parameters have been estimated and transmitted to 

the decoder, their application provides a very good prediction of the 

new image. This technique, referred to as \motion estimation and compensation", 

achieves one of the most important compression factor in 

a video coder thanks to its radical reduction of the spatio-temporal redundancy. 

Since understanding image formation is a prerequisite for fully grasping 

the methods to recover motion information from images, Chapter 

Two starts with a short description on how images are generated and 

how the real tridimensional motion of objects results in motion on the 

bidimensional picture plane. Chapter Two simultaneously presents the 

phenomena that can disturb or prevent a correct motion estimation, 

and stresses the ill-posed nature of the problem. The chapter provides 

a Rate-Distortion justi cation of the use of motion estimation in video 

coders and introduces the various models and methodologies that can 

constitute the basis of di erent motion algorithms. Classical and emerging 

motion estimation techniques developed for image coding purposes 

are detailed and the two of them (namely the Block-Matching Algorithm, 

BMA, and Image Warping technique) which are used in the present work 

conclude the chapter. 

More largely, the present thesis mainly deals with motion estimation and


compensation in a video coding context. While the BMA is the most 

widely used technique for motion estimation because it has emerged as 

the one achieving the best compromise between complexity and quality, 

it does not at all take the content depicted on the images into account. 

The aim of our work is thus to explore new possibilities for helping the 

BMA to bene t as much as possible from the contents-adapted information 

that can be extracted from the images. This concept is consecutively, 

but independently, applied to the various steps of exploiting 

motion in a video coding scheme: estimation, transmission and compensation. 

Chapter Three rst tackles the motion estimation problem. It carries 

on with the research achieved by M.P. Queluz, i.e. a new algorithm 

to estimate the motion between successive frames. The Adaptive BMA 

(ABMA) implements a measure of the (in)certitude of the motion analysis 

achieved by the BMA so as to operate an adaptation of the block 

size along object edges and avoid having di erent moving objects in the 

same block. On the other hand, a merge procedure, based on motion 

con dence measures, is applied to correctly propagate the motion vectors 

from blocks with reliable motion to blocks with uncertain motion. The 

ABMA is a multiscale BMA algorithm which uses a quad-tree structure 

to represent the various steps of its split-and-merge procedure. The 

ABMA thereafter performs an adaptation of the motion eld estimation 

to the spatial contents of the image. 

Although our contribution to the ABMA mainly consists in implementing 

and ne tuning the scheme, we also had the opportunity to supervise 

a Master Thesis whose aim was to distribute the computational burden 

of the ABMA among several processors because the calculation of the 

various con dence measures and of several BMA for the same block 

seriously puts a strain on the BMA performances. A distributed model of 

the ABMA has thus been theoretically established. A practical \masterslave" 

version has been derived from this model and demonstrates a 

linear speed-up. 

Chapter Four indirectly takes on the transmission of motion parameters. 

Since very-low bitrate channels enforce video coders to debase more 

information than the only irrelevant part of the signal, the image used 

as a reference for motion estimation (i.e. the previously coded one) does 

not contain the information one would like to nd in it anymore, and 

especially the information needed to correctly predict the new image. 

Motion estimation is then partly noise-driven and the result is a sparse


motion eld that is more di cult to encode e ciently. This analysis is 

detailed at the beginning of Chapter Four and leads to the hypothesis 

that voluntarily simplifying the contents of the images to code could 

be pertinent. The motion estimation would then be performed between 

two images with the same characteristics in terms of the accuracy of the 

contents description, and it is expected that the resulting motion eld 

will be easier to encode. 

This idea of pre-processing is rst formalized with the help of the Rate- 

Distortion theory: both intra-coding and pre-processing are modeled as 

the combination of a low-pass lter and additive white noise. Conditions 

for improving the coder performances are derived from this model. The 

in uence of various types of pre-processing on coding image sequences 

with the ITU H.263 standard is then experimentally tested. These preprocessings 

are: intra-coding and Gaussian, median or morphological 

ltering. Prospects are also formulated for selective pre-processing according 

to the contents relevance. 

The aim of Chapter Five, which focuses on compensation, is to demonstrate 

that it is possible to subjectively improve the result of the motion 

compensation stage (at the decoder) without modifying the estimation 

(at the encoder) nor the transmission (the bitstream) 1 . This is possible 

by taking the spatial contents of the reference image into account in 

order to adapt the motion information to it. 

The warping image techniques that have been described at the end of 

Chapter Two o er such a possibility, namely to easily adapt the motion 

information on irregular grid. However, their estimation phase is very 

e ort-demanding because of its iterative nature. On the contrary, the 

BMA reveals itself a very e cient estimation method while its compensation 

stage generates an image that su ers from so-called \blocking 

artifacts". We do thereafter suggest to use an asymmetric scheme that 

consists in a BMA estimation and a warping compensation. The warping 

technique involves meshes made out of triangular patches (that ensures 

a direct link with the a ne transform). Moreover, warping techniques 

(should they use triangular, quadrilateral or other structures) are very 

well-suited for further editing and modi cation of images, manipulations 

that an increasing number of users like toachieve. 

1 Of course, if one wants to use the proposed reconstruction scheme in the coding 

loop, it has to modify the coder so as to include the developed tool, which in turn 

will modify the contents (not the structure) of the bitstream.


Once the proposed scheme is outlined, Chapter Five presents solutions 

to the two main steps of the scheme: the automatic design of a mesh 

adapted to the spatial contents of the reference image, and the adaptation 

of the BMA vector eld so as to be applicable to the mesh. The 

rst problem is addressed by detecting object contours in the image 

and tracking these contours so as to extract \corners" (i.e. maximum 

curvature points). The selected corners then serve asvertices of a triangular 

mesh generated by Delaunay triangulation. The second problem is 

solved byaninterpolation technique known as \inverse kriging". Finally, 

the chapter concludes by presenting some subjective results. 

Finally, a general conclusion is drawn from the obtained results. It 

reviews the various contributions of the present work but also raises the 

question as to which parts of the motion exploitation chain (estimation 

- transmission - compensation) is it useful to take the spatial contents 

of the images into account. 

Contributions of the Thesis 

The key contributions of the present research are summarized on the gure 

below. Some of these contributions are personal, others are the result 

of a close collaboration with other members of the TELE laboratory 2 . 

Some are also the subject of previous publications [111, 77, 11, 73, 68, 

70, 23, 69, 19, 20, 71, 75, 74, 76]. 

Framework 

Thesis 

contributions 

CODER 

Estimation: 

motion analysis 

Chap. 3 

Adaptive Block- 

Matching 

Algorithm 

Channel 

Transmission 

Chap. 4 

Study of the 

impact of preprocessing 

DECODER 

Motion 

compensation 

Chap. 5 

Mesh-based 

reconstruction 

of BMA fields 

Improve BMA with content-adapted techniques 

2 Most of the work related to the COMIS scheme has been achieved in collaboration 

with many other researchers in image processing at the laboratory: M.P. Queluz, B. 

Simon, O. Bruyndonckx, V. Warscotte, T. Delmot, C. Devleeschouwer.


With the exception of the COMIS contribution presented in Chapter 

One, all the contributions obey the same approach: to use some contentadapted 

information in order to improve the performances of the reference 

Block-Matching Algorithm. This approach involves the following 

achievements: Chapter Three requires the programming and ne tuning 

of the existing Adaptive Block Matching Algorithm (ABMA) in order 

to improve motion estimation. It also details how to distribute the 

computational burden of the ABMA among several processors. Chapter 

Four establishes a model of the behavior of very-low bitrate video coding. 

It proposes an analytical study of the impact of pre-processing on such 

a VLBR coding. Some experimentation aims at validating the developed 

model and at practically determining whether pre-processing o ers 

some gain when inserted in a complete coding scheme. Chapter Five 

deals with the conception and the development of an original scheme 

for motion compensation. It rst involves developing a tool for corner 

extraction, which isachieved by improving the half-boundaries detector 

of Noble. The application of interpolation to motion vector elds is 

also tackled so as to transpose the information they provide from a 

regular grid to a non-regular one. Finally, Chapter Five integrates and 

simulates the proposed asymmetric motion estimation & compensation 

process, relying on mesh-based reconstruction.

Chapter 1 

Digital Video Coding at 

Very-Low BitRate 

\Visual communication is commonly regarded as the next generation 

communication tool beyond the conventional voice communication (... 

and) to achieve more e cient visual communication and fully utilize 

limited channel bandwidth and storage space, video compression (or coding) 

that reduces data amount is necessary." (C.T. Chen) 

Aware of this relevant challenge, research has been focusing on video 

compression for already more than twenty years. People from both the 

industry and research teams have collaborated in groups like the International 

Telecommunications Union (ITU) or the Moving Picture Experts 

Group (MPEG) to design several international standards for transmission 

and storage of digital video. 

The present chapter aims at introducing the general framework of the 

doctoral work. It rst very brie y describes the structure of a digital 

video signal, and expands on the generic principles of video compression 

according to type of correlation which is exploited: spatial or spatiotemporal. 

It then highlights the speci city ofVery-Low Bitrate Coding 

(VLBR) with regards to high-bitrate coding: VLBR is unable to transmit 

a visually perfect copy of the source images as the limited channel 

capacity prevents the coder from sending all the information needed. 

Secondly, it reviews three video coders-decoders (codecs): the H.263 

standard from the ITU, the scheme to be standardized in November 

1998 under the acronym MPEG-4, and some trials implemented at UCL 

within a scheme named COMIS.

14 Chapter 1. Digital Video Coding at Very-Low BitRate 

Finally, the state of the art of video coding is discussed and some new 

investigations in image analysis are introduced. 

1.1 Digital Video 

Digital images or frames consist of luminance (i.e. the brightness) and 

chrominance (i.e. the colors) intensities of regularly sampled points 

(the picture elements 1 ). The sampling process is performed either on 

a natural scene (digital camera) or on an \analog" image (digital scanning). 

The spatial information (in the two-dimensional space) is characterized 

by the image resolution. Instead of characterizing this resolution 

in pel=cm or pel=inch, it is generally expressed as the product 

pels=line lines, which is a measure independent from the screen size. A 

(digital) video sequence is a succession of (digital) images whose characteristic 

is the temporal resolution in terms of frames=s or images=s. 

This temporal domain information (the changes of image intensity along 

the time axis) is speci c to video transmission and raises the problems 

addressed in the present thesis, namely motion estimation and compensation. 

In its Recommendation 601 [12], the ITU-R (International 

Telecommunication Union - Radiocommunication, formerly CCIR) has 

de ned a way to digitize images in standard format (720 576). An extension 

of this de nition provides one with the di erent resolutions which 

should be used according to the target application (table 1.1), which directly 

introduces the required bitrate if no compression is achieved. 

Application Luminance Chrom. Aspect Temporal Bitrate 

resolution resolution ratio (fr:=s) (Mbit=s) 

HDTV 1920 1152 960 576 16=9 50 1800 

TV (broad.) 720 576 360 576 4=3 25 166 

TV (CD rec.) 360 288 180 144 4=3 25 31 

Video phone 360 288 180 144 4=3 10 12:4 

Mobile video 180 144 90 72 4=3 5 1:6 

Table 1.1: CCIR 601 formats for moving pictures applications 

A few remarks can be made concerning table 1.1. At rst, the spatial resolution 

of the chrominance components is lower than the luminance one. 

1 Commonly abbreviated as \pixels" or\pels".

1.2 Video Coding 15 

These di erences result from the less important impact of colors on the 

Human Visual System (HVS). Secondly, achange of aspect ratio 2 (16=9 

like for movies) accompanies the High De nition TV (HDTV) resolution. 

Another comment deals with the range of values of the luminance 

and chrominance components: digital images refer to components described 

by a nite number of bits, namely 8 bits for every luminance or 

chrominance pel. This 8-bit range allows values to go from 0 (black) to 

255 (white) 3 . 

All these characteristics help computing the required bandwidths: table 

1.1 introduces the necessary bitrates in Mbit=s. If one expects to 

transmit video telephony across an ISDN network (64kbit=s), one must 

rst nd a way to compress the signal by a factor superior to 194, while 

for mobile video transmission a compression ratio of only 25 is needed 

for ISDN networks. This factor rises again to 160 for mobile channels 

at 10kbit=s. 

Appendix A introduces some typical very-low bitrate sequences used to 

test the video compression algorithms. 

1.2 Video Coding 

1.2.1 Image Compression 

Compression is thus needed and is possible thanks to the reduction of the 

high redundancies and to the irrelevancies present in the data of a video 

sequence. The redundancy has to deal with the correlation and the 

predictability inherent to the pictures. Its use for compression purposes 

does not involve any loss of information. The irrelevancy exploits the 

perceptual limits of the HVS so as to avoid the transmission of invisible 

information. It introduces irreversible degradations. 

On a practical point of view, compression involves a succession of many 

di erent steps which aim at detecting what is redundant or irrelevant 

and how to encode it di erently. The following description proposes a 

2 The aspect ratio is the ratio between the image width and height. 

3 Generally, a color image may be described by a combination of three chromatic 

stimuli. Color television for instance uses the Red, Green and Blue (RGB) primary 

colors. Another possibility, usually used for video compression, is the Y luminance, 

plus two di erence chrominance signals: Cb and Cr,atlower spatial resolution like 

in table 1.1. People interested in more advanced readings concerning color treatment 

may consult [106].


closer look at some of the underlying concepts of video compression. 

The scheme of gure 1.1 introduces the classical chain of tools used to 

compress moving pictures. 

Input image(s) 

T Q 

G 

T 

Q 

G 

enC 

enC 

chC 

Source Coding Channel Coding 

Tranform, analysis,... 

Quantization 

Transcription 

Entropy Coding 

Figure 1.1: Operators classically involved in a compression process of 

(moving) images 

Hereafter is a brief description of the aim of every step. The way the 

process is reversed at the decompression stage is also tackled. More 

detailed overview of compression techniques may be found in [86, 59, 

113]. 

Compression 

{ First, the image characteristics are analyzed: the analysis 

may consist in frequential waveforms analysis, likewavelet [66], 

matching pursuits [90], or more simply block transform (e.g. 

the Discrete Cosine Transform, DCT [2], like in JPEG [135]), 

or in spatial contours and texture analysis. It may also consist 

in an estimation of the motion eld between two successive 

frames. This analysis step aims at detecting the redundant 

parts of the images and proposing another way of transmitting 

the same information. 

{ The resulting parameters are then usually quantized in order 

to suppress all irrelevancies of the signal. Quantization can 

be scalar or vector [41, 89].


{ The redundancy and the irrelevancy reduction of the input 

signal by the transform and the quantization can be more 

globally viewed as a transcription (or a projection) of the 

input signal into a decorrelated relevant representation (or 

space). 

{ This transcription is encoded (entropy coding) and sent 

to the channel interface: this coding step is referred to as 

source coding [38], and aims at reducing the bitrate without 

corrupting the data. Such a reduction is possible thanks to 

an appropriate exploitation of the statistical properties of the 

signal. 

{ Finally, some channel coding may beintroduced. It adds 

speci c redundancies (error correcting codes and synchronization 

words) in order to protect the signal in case of erroneous 

transmission. 

Decompression 

{ After a decoding step (both channel and entropy decoding 

if necessary), the decoder recuperates the transcription generated 

by the coder. 

{ Then two possibilities exist: 

Either, the transcription is merely reversed in order to 

obtain a description of the images. The obtained images 

are more or less identical to the original ones according 

to the type of analysis and quantization (the channel 

and entropy coding-decoding phases are assumed to be 

lossless). 

Or a reconstruction stage may be added: the transcription 

is no more simply reversed but speci cally treated in 

order to better take advantage of the available information. 

An example of such a reconstruction process is the 

use of overlapping functions to recover a block description. 

Reconstruction is di erent from post-processing, 

which tries to improve the image quality after inversion of 

the transcription, without taking the transcription structure 

into account. 

All these parts of a coding algorithm mayvary from one implementation 

to another. Sometimes two or more steps are merged together, while in


other cases one step is not used. However, the di erent steps have to 

be coherent if one wants the scheme to be e cient: the type of analysis 

made often guides the rest of the algorithm. 

Up to now, two di erent philosophies have been used in (VLBR) video 

coding. The rst class is the block-based one [62, 96], where pictures 

are divided into subblocks according to an a priori grid: every subblock 

may be independently coded according to its own characteristics. The 

other class is the segmentation-based or object-based one. Such 

schemes [85, 84, 34] aim to better preserve essential characteristics: high 

compression ratios are obtained by removing insigni cant objects and 

by encoding textures more coarsely, while contours are considered as 

essential features for image description. After a detection of the di erent 

pictured regions (objects), every region is separately coded: its shape 

(contour) is rst described and is followed by a texture (or motion) 

information. 

1.2.2 Video Compression 

Original 

images 

Intra-images 

encoder 

Motion 


Motion 

estimation 

Frame 

memory 

Intra-images 

decoder 

Motion vectors 

Multiplexer 

to channel 

coder 

Figure 1.2: Typical video coder: intra-images, motion estimation & 

compensation and residues are used 

Independently of these di erent approaches of the image contents, one 

can point out that both categories of (VLBR) coding schemes make an


exhaustive use of the two following types of pictures: 

Intra-images: When coding an image in intra mode, only the 

spatial correlation inherent to the picture itself is exploited. In 

fact, an intra-image is an isolated picture and is compressed as 

such (cf. the JPEG [135] algorithm for still pictures compression). 

Inter-images: These images are speci c to video sequences in 

comparison with still pictures. They allow the coder to exploit 

the spatio-temporal correlation between the present image and 

the previous one(s) within a sequence. The classical tool used to 

achieve compression on such images is motion estimation & compensation 

which allows one to obtain a prediction of the present 

image from the previous one(s). 

{ Residues: As the estimation & compensation of an interimage 

is not perfect, the di erence between the estimate and 

the original image is computed: the so-called Displaced Frame 

Di erence (DFD) has to be sent if it is relevant enough. It can 

only be transmitted on its own basis, i.e. as an intra-image. 

The rst image of a sequence must of course be encoded as an intraimage. 

For the subsequent images, a scene-cut detector compares the 

new image with the previous one(s) to check if some temporal correlation 

still exists. In the a rmative, the new image will be inter-coded. 

Otherwise, the coder considers that a scene-cut has occurred: the new 

image di ers so much from the previous one(s) that it is better to encode 

it as an intra-image. The user may also enforce intra-image coding to 

take place every x seconds to o er speci c functions like fast retrieval, 

temporal scalability or resynchronization in case of noisy channel. 

According to these considerations, one can a rm that the scheme presented 

on gure 1.2 (from [13]) is typical of most (VLBR) video coders. 

The rst image is intra-coded. Then the second image is motion estimated 

on the basis of the rst one (original or reconstructed). The 

motion vectors of course need to be sent to the decoder in order to allow 

it to behave the same way. The motion vector eld is applied to the 

rst image thereby obtaining a prediction of image two, which is used 

for the DFD computation. This DFD is the residual image that also 

needs to be coded. The process goes on until some scene cut arises or 

until a de ned threshold enforces to start again with an intra-image.


Although the coding technique of the DFD is often similar to intracoding, 

Strobach has demonstrated [120] the ine ciency of applying 

the same technique to intra and residual coding because of the drastic 

reduction of the spatial correlation after motion compensation. This 

inadequacy can also be explained by the fact that the information of 

an intra image is uniformly spread out, while the contents of residual 

images is located in speci c areas of the images (e.g. the borders of the 

moving objects). For instance, Matching Pursuits [90] are designed for 

residual coding. 

1.3 Coding at Very-Low BitRate 

Very-Low BitRate transmission can be understood as transmissions across 

bandwidths of 5 to 64kbit=s, or sometimes up to 128kbit=s. Some of 

the applications which might be addressed are: wired video phones 

(28:8kbit=s modems and below), wireless video phones (under 13kbit=s), 

Internet video conferencing (28:8kbit=s modems and below), remote 

monitoring, tele-operation, tele-working,::: Appendix A presents the 

typical VLBR-like video sequences that are used as test sequences in 

the present document. 

The rst video coding standards considered higher bitrates. For instance 

MPEG-1 [62] isintended to ensure domestic use quality (CD-ROM use) 

under 1:5Mbit=s and MPEG-2 [62] proposes digital TV broadcasting 

below 10Mbit=s or HDTV at 60Mbit=s. So, what constitutes the main 

di erence between codecs working at such bitrates and VLBR ones? 

The rst di erence was already introduced in table 1.1: in order to 

quickly reach low bitrates, lower spatial and temporal resolutions are 

used. But another major di erence can be pointed out: to transmit 

(moving) pictures across high-bitrate channels, only the irrelevant part 

of the image(s) has to be suppressed, while VLBR compression schemes 

generally have to suppress more information. Therefore, VLBR video 

coding introduces more artifacts. A key-issue for VLBR coding is thus 

an e cient management of the visual degradations that have to be accepted. 

The in uence of the bitrate mainly arises at the residual level: 

very-low bitrates prevent the coder from sending a su cient amount 

of residual information. The imperfections of the motion estimation & 

compensation phase are not entirely corrected and many artifacts remain 

visible.

1.4 Some (Very-) Low BitRate Codecs 21 

1.4 Some (Very-) Low BitRate Codecs 

It appears that new solutions have to be set up if one wants to achieve 

VLBR video coding: just using the same techniques as for high bitrate 

coding would not compress the signal su ciently or would cause too 

many artifacts. At least the coding parameters and the tables for entropy 

coding should be adapted. Several codecs speci cally devoted to 

VLBR have been designed. The present section aims at introducing 

three coders: two international standards, namely the existing H.263 

which has been explicitely designed for VLBR and the future MPEG-4 

standard which allows one to address VLBR although its primary goal 

is to provide the user with added functionalities, and an original trial 

from UCL, the COMIS scheme. 

Only the main outline of these three algorithms is presented here. Readers 

interested in a more detailed presentation of these algorithms should 

refer to the cited bibliography. The various motion analysis tools utilized 

by the codecs are fully described in the next chapter. 

1.4.1 H.263 

The ITU-T Recommendation H.263 [96] is the very rst VLBR codec 

able to ensure good subjective quality atlow bitrates. It is why it served 

as an anchor in the MPEG-4 tests (cf. Section 1.4.3.3). 

Figure 1.3 depicts the overall scheme of the H.263 coder. This scheme 

is very similar to the generic one on gure 1.2. H.263 mainly consists 

in a particularization of MPEG-1 and MPEG-2 codecs: every picture is 

divided into 16 16 pels macroblocks (MB). The main di erence is that 

all parameters (entropy coding tables,...) have been speci cally tuned 

for low bitrates. 

Every picture is intra or inter labelled but, for every macroblock, the 

coding control unit decides on the most e cient way to code it. In 

addition, every 132 times, a macroblock shall be coded in intra mode 

in order to prevent looped error propagation, even if the inter mode is 

more e cient or if the rest of the picture is inter coded. 

The various types of macroblocks are presented in table 1.2. The specicities 

of every mode are detailed in the following sections.


CC 

Video 

in 

T Q 

P 

-1 

Q 

-1 

T 

T Transform 

Q Quantizer 

P Picture Memory with motion compensated variable delay 

CC 

p 

t 

qz 

q 

v 

Coding control 

Flag for INTRA/INTER 

Flag for transmitted or not 

Quantizer indication 

Quantizing index for transform coefficients 

Motion vector 

1.4.1.1 Intra Macroblocks 

Figure 1.3: H.263 encoder 

p 

t 

qz 

q 

v 

To 

video 

multiplex 

coder 

H.263 is part of the transform coding class (Chapter 10 of [113]) of algorithms. 

The macroblocks to be intra coded are linearly transformed by 

the Discrete Cosine Transform (DCT, [2]) that decomposes the signal 

into its di erent frequency components. Every coe cient is then separately 

quantized and the set of coe cients is entropy-coded. The DCT 

is applied to blocks of 8 8 pels, i.e. a macroblock is made out of four 

DCT blocks for the luminance and two DCT blocks for the chrominance


Picture MB DCT Motion Residual Change 

type type coef. vector coding quant. 

Intra Intra 

Intra Intra 

Intra Stu ng 

Inter Intra 

Inter Intra 

Inter Inter 

Inter Inter 

Inter Inter 

Inter Stu ng 

Table 1.2: Macroblock types and included data elements in H.263 

component (one for Cb, one for Cr). The quantization matrix applied 

to the frequency-based matrix of coe cients can be regularly updated 

during the encoding process. There are thus three ways of transmitting 

an intra macroblock, as illustrated in the three rst lines of table 1.2. 

\Stu ng" means that the macroblock remains exactly the same as in 

the previous frame. 

1.4.1.2 Inter Macroblocks and Residues 

In inter mode, a motion vector is rst searched for so as to determine the 

origin of the macroblock. It is achieved by a Block Matching Algorithm 

(BMA, cf. Section 2.5.1). Then, according to the relevance of the result, 

the macroblock is either coded in intra mode or via the motion vector 

and (possibly) additional residues: one of the inter picture modes of 

table 1.2 is selected. In the case of inter coding with residues, the DFD 

is coded as an intra macroblock with the DCT. 

1.4.1.3 Options 

In addition to this basic algorithm, four options can be used to improve 

the quality of the decoded images (with the same bitrate). Each option 

may be used separately or in combination with some others. The options 

are: 

Unrestricted motion vectors: the BMA can select vectors that 

locate the origin of a MB out of the reference image. The last line


of pels on the image border is then reproduced (for more details, 

see Section 2.5.1). 

Arithmetic coding [60]: instead of Hu man codes [38], a more 

sophisticated entropy coding can be used. 

Advanced prediction mode: if necessary, a motion vector can 

be assigned to every 8 8 block. In addition, the motion reconstruction 

is achieved with overlapping (cf. Section 2.5.2). 

PB-frames: a PB-frame consists in two pictures being coded as 

one unit. A predicted (P) frame is a normal inter-coded one, while 

a bi-directionally predicted (B) frame is computed on the basis of 

both the previous frame and the next P frame (which is located in 

the future of B along the time axis). 

1.4.1.4 Future Improvements: H.263+ 

Thanks to its ne tuning, H.263 already achieves very good results, 

combined with low complexity and fast computation. A software coder 

able to treat 5 frames of 144 176 pels per second and a decoder 

working faster than 30 frames=s are provided by Telenor Corp. at 

http://www.nta.no/brukere/DVC/. 

Moreover Bjontegaard [9] made a few improvements that are to be added 

to the existing standard in order to de ne a new one: H.263+. 

1.4.2 \COMIS", the UCL Approach 

Initially developed by M.P. Queluz, B. Simon and B. Macq in the context 

of the European COST 211ter research group [112], and further rened 

by C. Devleeschouwer, T. Delmot, X. Marichal and B. Macq [70], 

the UCL has proposed an original scheme for VLBR video coding in 

which spatial information and temporal changes are encoded using similar 

tools: binary trees are used to transcribe a multiscale decomposition 

of the information, which explains the name of the algorithm: \COding 

on Multiscales Image Sequences" (COMIS). A complete description of 

the algorithm may be found in [65, 68]. 

Designed as a VLBR codec, COMIS tries to combine the cheap blockbased 

picture description provided by the multiscale representation and 

a region-oriented understanding of the pictures in order to improve the 

analysis and the reconstruction stages. Figure 1.4 presents the codec


scheme. One can notice that a \region model" box is present in both the 

encoder and the decoder. Therefore, no object information needs 

to be transmitted: the object detection can be simultaneously generated 

on both sides (for instance, via a morphological watershed procedure 

[134]), every time the decoder receives additional picture information. 

preprocessing 

CODER DECODER 

region 

model 

optional 

communication 

region 

model 

Figure 1.4: Scheme of the COMIS codec 


user’s interaction 

An original bitrate regulation scheme [19] gives then the priority tothe 

regions that are considered subjectively more important. 

The following sections aim at presenting the way images are coded and 

the peculiarity of COMIS that voluntarily simpli es images (both in 

intra and inter mode) in order to reach VLBR with a good subjective 

quality. 

1.4.2.1 Intra Images 

Figure 1.5 describes the way a picture is treated in the intra-mode. The 

following paragraphs provide explanations about the di erent parts. All 

the images presenting intermediate results will always refer to a letter 

(A to F) of the diagram. 

Tree decomposition. The objective of the spatial segmentation is to 

identify the regions with uniform luminance or with random textures. 

At rst, the image is split into non-overlapping 16 16 blocks 4 . Starting 

from such large blocks, the aim is to obtain homogeneous block 

4 Any size 2 x would be convenient: if 16 16 is used, it is because analysis of 

test images [128] has shown that larger blocks are always inhomogeneous in a QCIF 

(144 176 pixels) picture.


Watershed 

computation 

original 

image tree 

A 

decomposition 

B 

C 

regions 

labels 

interest 

criteria 

merge 

procedure 

interest 

labels 

D 


E F 

Figure 1.5: Block diagram to intra-code a picture in the COMIS coder 

components by a succession of binary decisions. A binary tree ensures 

the correspondence with a multiscale representation of the picture (Figure 

1.6). 

0 

1 

1 1 

1 

0 0 

1 0 

0 0 

1 

1 

1 0 

0 1 

1 0 

0 1 

Figure 1.6: Example of binary tree segmentation 

The split process is repeated until reaching homogeneous blocks or a 

block of prede ned minimal size. Once the block is considered homogeneous 

enough, the luminance of all its pixels is replaced by the mean 

luminance of the block (plus midtread quantization) and is attached to 

the appropriate leaf of the tree. A criterion (see references) is added 

to prevent regions with random textures from splitting. Such regions 

are assumed not to be perceptually important and are also described by 

their mean value. 

0 0


Figure 1.7: Result of the tree decomposition - (left) \Foreman" (right) 

arti cial image - (top) original A (center, bottom) multiscale decomposition 

B 

Figure 1.7 illustrates the principle and the result of the decomposition.


Figure 1.8: Result of the tree decomposition (B) with coarser thresholds 

-\Foreman" (left) and an arti cial image (right) 

One of its advantages is the use it can make of the correlation between 

luminance and chrominance components: for color images, the same tree 

is used to describe the three components. 

If the reader has a close look at the \Foreman" decomposition (image 

1.7(center)), he/she will notice that the image is rather complicated. 

This complexity has to be reduced to transmit such a picture 

over a VLBR channel. A rst way toachieve such a reduction of the 

complexity is to use coarser thresholds (Figure 1.8). 

The visual result is not particularly pleasant. The solution adopted here 

is to interpret the pictures on a region basis: after a segmentation step, 

all regions are classi ed according to their subjective relevance, and the 

algorithm tries to keep a good quality on the interesting regions (e.g. 

speaker's head,:::) while coarsely coding the other regions. 

Interesting regions determination. For the purpose of region selection, 

an interest criterion is de ned [72, 20, 19]. It can automatically 

distinguish between what is important and what is not. Three criteria 

are combined so as to provide the nal classi cation: 

the image border criterion: it is based on the fact that the 

most important part of an image in a video sequence is correctly 

centered and that the eye precision is the best in the center of the


vision area. The criterion gives less priority to regions with most 

of their pixels located along image borders. 

the interactive criterion: it can reinforce the previous criterion 

or ght against it. Its principle is to eliminate regions with no 

pixels inside a pre-de ned window of the image. This criterion is 

interactive as the user can easily select another \interest window". 

the face texture criterion: mainly designed for video-phone 

or video-conference, this criterion rejects all the regions whose 

chrominance components do not coincide with a set of skin samples. 

These criteria are combined using the Fuzzy Logic Theory [80] and result 

in a classi cation of the image regions that is used in the next step of 

the algorithm (Figure 1.9 (b) depicts the nal classi cation). 

Merge procedure. The tree description of the image can be considered 

as a split procedure. A classical way of dealing with such split 

algorithms is to combine them with a merge procedure [59] that aims at 

homogenizing their result and correcting the split errors engendered by 

local traps. 

The split tree, a watershed analysis of the original image and an associated 

region classi cation are the inputs of the present merge procedure. 

The aim of this algorithm is to homogenize the value of all the subblocks 

belonging to a region that has not been labeled \interesting". Figure 1.9 

shows the result of the complete algorithm with reference to gure 1.5. 

One can notice a dark spot on the nose of the \Foreman": it results 

from a bad classi cation of the associated region. Nevertheless, this 

problem would be directly solved if no over-segmentation was present. 

Classi cation would be much more stable if dealing with coherent regions 

(for instance one region for the entire head, and not twenty). 

According to the bitrate available, a di erent percentage of regions may 

be protected or not (Figure 1.10). 

Parameters coding. The resulting parameters have then to be transmitted. 

Two elements are necessary to fully describe the image: the binary 

tree structure and the attached leaves values. The tree is entropy 

coded with the M-Coder [77] [111] previously developed [67]. The leaves


(a) (b) 

(c) (d) 

Figure 1.9: Split and merge procedure on \Foreman": (a) original A, 

(b) interesting regions determination D (dark = interesting, light = not 

interesting) and (c,d) merged image and associated multigrid E 

obtained in the analysis-transcription phase (the luminance and chrominance 

means) are rst decorrelated by optimum linear prediction, using 

the nearest neighbors. The resulting prediction errors are encoded by 

the Universal Variable Length Coder (UVLC [64]). 

Reconstruction. Describing tree segmented images as the juxtaposition 

of variable size blocks (with one single mean value associated to 

every block) leads to large blocking artifacts (cf. gures 1.7, 1.8 and 1.9)


Figure 1.10: \Foreman": comparison between a merge with region protection 

(left) and a full merge (right). 

and leads to the conclusion that the description would better not be the 

mere reverse operation of the transcription. Overlapping functions, that 

respect transitions between small and large blocks but are also spread 

out in low resolution areas, are used. The step impulse response is replaced 

by a longer impulse response but only locally, in order to respect 

transitions in the image. Low resolution areas receive an importantoverlap 

while high resolution areas need to keep the interpolation functions 

disjoint. 

The visual improvement o ered by this reconstruction technique is depicted 

on gure 1.11. 

1.4.2.2 Inter Images 

The Adaptive Block-Matching Algorithm. The overall motion 

estimation procedure used in the COMIS codec is based on the BMA (cf. 

Section 2.5.1), but some substantial improvements have been brought 

by P. Queluz. The resulting scheme manages a hierarchical structure 

of block sizes. It is thus called an Adaptive Block Matching Algorithm 

(ABMA, [110]) and is presented in detail in Section 3.1. 

Image pre-treatment prior to motion estimation. The idea is 

still to voluntarily simplify the images prior to motion estimation and


Figure 1.11: Overlapping reconstruction - (top) original A (center) decoding 

B,E (bottom) reconstruction F 

coding, and the new image is rst treated as an intra-image. Once


arriving at point E of gure 1.5, it is used for ME. The impact of such 

an approach is studied in Chapter 4. 

To intra-treat the picture, the determination of interesting regions (cf. 

Section 1.4.2.1) is needed. Two criteria have been added so as to take 

the motion between two successive frames into account 5 . 

a motion criterion: based on a succession of motion estimations 

between the images at the highest frequency (even if the sequence 

is coded at 8; 33Hz, the criterion will compute three motion estimation 

between images at 25Hz), it eliminates regions with no 

motion and no texture (the Human Visual System does not focus 

on these areas) as well as the highly textured regions with a very 

important motion (that are too noisy to be correctly perceived). 

the continuity criterion: some tracking helps the algorithm taking 

the classi cation of the region in the previous image into 

account in order to ensure the temporal stability of the nal criterion. 

Motion transcription and coding. The high level of smoothness 

provided by the ABMA method justi es the use of contour/content coding 

of the motion vector eld. The same kind of transcription and 

coding is used to take advantage of what is done in the intra mode. 

1.4.3 The Emerging MPEG-4 Standard 

The ISO group ISO/IEC JTC1/SC29/WG11 (MPEG) has been working 

on a new work item since November 1992. After a lot of procrastination, 

MPEG-4 has nally found a suitable match at the Singapore meeting 

in November 1994. The rst Proposal Package Description (PPD) was 

established, focusing on three major trends of today's world: 

the trend towards wireless communications, 

the trend towards interactive computer applications, and 

the trend towards integration of audio-visual data into an ever increasing 

number of applications. 

5 A motion estimation is thus performed to help computing these criteria. Nevertheless, 

this motion estimation is not the one used later by the coding algorithm and 

may even use a di erent technique.


This PPD considers that Multimedia is the convergent point of three 

industries, i.e. computer, telecommunications and TV/ lm industries. 

Although focusing at rst on all the very-low bitrate applications [103], 

MPEG-4 has decided to extend its scope so as to be a much more 

open and evolutive system: exibility and extensibility are its driving 

forces. 

The following sections highlight [57] the key points of MPEG-4 motivation(s) 

and principle(s). More details may be found in the Image 

Communication Special Issue [104] that explores the various aspects of 

the future standard on a video point of view. 

1.4.3.1 Developments to Be Supported 

In addition to the increasing need for audio-visual communications at 

very-low bitrate [103], some new trends concerning the use of audiovisual 

information have to be taken into account. These are mainly: 

The way images are produced. Computer-generated images are 

being very commonly used. MPEG-4 addresses both \Synthetic 

& Natural Hybrid Coding" (SNHC [26]). 

The waymultimedia content isdelivered. VLBR is needed across 

\Global System for Mobile communications" (GSM) networks, but 

the same content could also been transmitted across ADSL (Asymmetric 

Digital Subscriber Loop) or ATM (Asynchronous Transfer 

Mode) nets, at a rate of several Megabits per second. Such a large 

range of bitrates requires scalability. Moreover, a scalable description 

of the information will enable the decoder to adapt itself 

to the processing power of the machine. 

The consumption of multimedia is rapidly evolving as it is taking 

more and more place in the way people communicate. The main 

changes in consumer habits are the need for interactive supports, 

possibilities to re-use the information and software implementations. 

1.4.3.2 A New Challenge for the Representation of Audio- 

Visual Information 

In order to ll all these needs, MPEG-4 represents an audio-visual scene 

in a new way [100]: rather than considering a frame-based video, it


de nes the scene as a coded representation of Audio-Visual Objects 

(AVO). An AVO can be a Video Object Component (VOC), or an audio 

one (AOC) or a combination of both. These AVOs are obeying a scenario 

describing their relations in space and time. The scene depicted 

on gure 1.12 can be interpreted as a combination of ve AVOs: the 

background plus four elements. 

4 

2 

Figure 1.12: Audio-visual scene described in terms of AVOs 

All these objects are separately encoded and the various objects bitstreams 

are multiplexed before being transmitted (Figure 1.13). In addition 

to this object-based representation of the information, a scene 

compositor ensures the possibility of re-using objects and extends the 

interaction capabilities of the scheme. Of course, this compositor is standardized 

(MPEG-4, just like many image coding standards, is a decoding 

standard) unlike the segmentation stage prior to coding (see Section 1.5 

for a brief discussion of this topic). 

1.4.3.3 Video Coding in MPEG-4 

H.263 was used as an anchor for the subjective tests established with all 

the proposed codecs. Finally, the tools used to encode and compress the 

video information in MPEG-4 are very similar to the ones of H.263 (cf. 

Section 1.4.1), with the slight di erence that they are applied to objects 

whose borders should also be described. 

5 

3 

1

and the video group one (http://wwwam.HHI.DE/mpeg-video/). 

the general MPEG web page (http://drogo.cselt.stet.it/mpeg/), 

The rst Committee Draft (CD) has been set up (October 97, Fribourg 

meeting), and the Draft International Standard is planned for November 

98. However, in order to be able to include more sophisticated tools that 

still need development and assessment (like the matching pursuits [90]), 

a second CD will be established in November 98 and will become a Draft 

International Standard (MPEG-4 version 2) in November 1999. People 

interested in being kept informed of the rapid evolution of the standard 

should refer to: 

A major innovation of MPEG-4 is to guarantee the access to multimedia 

information from every part of the world. Therefore, the bitstream 

structure is speci c in a twofold way: it includes coding tools and a 

system layer that can cope with severe channel errors and it allows 

scalability at the bitstream level. This scalability ensures universal 

accessibility as any decoder can select the information it can decode. 

Figure 1.13: Schematic overview of an MPEG-4 system 

interaction 

objects 

objects 

objects 

objects 

encoder 

multiplexer 

demultiplexer 

decoder 

compositor 

video 

audio 

stored 

objects 

local objects 

(coded/uncoded) 

36 Chapter 1. Digital Video Coding at Very-Low BitRate

1.5 Discussion 37 

1.5 Discussion 

This rst chapter has presented di erent facets of video compression, 

with emphasis on very-low bitrates. A clear distinction has been made 

between intra-coding and inter-coding. Inter-coding, which tries to exploit 

the spatio-temporal correlation between successive images, o ers 

the framework of the present thesis. VLBR video transmission is characterized 

by the impossibility for the decoder to reach avery good 

approximation of the original images. It is commonly admitted that 

the information must be deteriorated in order to reach the requested 

bitrates. 

Three VLBR codecs have been brie y reviewed and some new trends can 

be emphasized. With MPEG-4, the video coding community has realized 

that compression is not the only aim anymore. The convergence of both 

the ever-increasing channel capacities and the compression performances 

appears to be su cient to ll most needs. Yet, the market seems to 

expect new features like interaction, manipulation, software support, 

copyright protections,... 

Is it then the end of research in video coding? De nitely not! But new 

requirements have tobetaken into account. Among the various new 

topics that are emerging, one may cite: 

Joint source and channel coding for speci c transmissions in errorprone 

environments. 

As stated in the description of MPEG-4 (cf. Section 1.4.3), this 

future standard uses segmented objects but does not standardize 

how to obtain them. Tools to perform segmentation and tracking 

prior to coding will o er industrial competitors a struggle eld to 

distinguish one from each other. 

MPEG has decided to end with pure coding, and the next standard 

to come, MPEG-7 [101], deals with image analysis in the 

framework of audio-visual search engines [102], which promises to 

issue some new very interesting challenges. 

Aware of this evolution, many people have already started working in 

these new topics. Among them, one may cite the European COST 211ter 

(quater) project [1].

Chapter 2 

Motion in the Framework 

of Video Coding 

The extraction of motion information from a sequence of time-varying 

images has numerous applications in the eld of image processing: medical 

image analysis, mobile robot navigation, automatic tracking of moving 

objects, interpretation of atmosphere observation (remote sensing), 

image interpolation and restoration,... as well as digital video compression. 

As seen in the introduction to (VLBR) video coding (cf. Chapter 

1), motion estimation & compensation plays a key role in video 

compression as it results in the most performing compression gains. 

This chapter does not claim to propose a complete overview of all existing 

motion estimation techniques in the video coding context. This 

task has already been successfully achieved with numerous results and 

comments in specialized books like [127] or articles like [108, 31], in 

which extensive references are given. Yet, it seems important to review 

the most classical techniques in order to present the state of the art and 

to highlight the contribution put forward by the present work. 

Another aim is to emphasize the distinction between the estimation and 

the compensation stages of a codec. Figure 2.1 brie y reminds one 

that estimation is performed at the encoder side in order to extract the 

motion parameters of the video sequence, while the decoder uses the 

estimated motion information during the compensation phase. A parallel 

can be established between the general principles of video compression 

(cf. Section 1.2) and motion estimation & compensation processes, 

which are based on three main steps:

40 Chapter 2. Motion in the Framework of Video Coding 

1. The motion estimation: it is the analysis performed on the encoder 

side. Two images at time t and t , 1 (or even more reference 

images) are compared and the algorithm attempts to nd the motion 

between the two frames. It results in motion description: 

dense or discrete motion eld, a ne parameters,... 

2. The transcription phase takes the result of the motion estimation 

and tries to describe it in the most compact representation. 

The aim of this step is e cient coding (so as to reach the required 

bitrate). This step is reversed at the decoder in order to get the 

initial motion estimation parameters back. It can be done with or 

without losses, according to the type of transcription. 

3. The motion compensation takes place at the decoder (and also 

at the encoder to simulate the decoder behavior). It aims at predicting 

the image at time t on the basis of both the motion parameters 

and the image at time t , 1 (or more images). 

It seems important to take this distinction into account toevaluate all 

motion estimation techniques in a video coding context. Both estimation 

and compensation have tobeevaluated with regards to their 

computational costs. This is particularly important when one has realtime 

industrial applications in mind. But, while the aim of estimation 

and transcription will be the extraction of the motion parameters and 

their cheap transmission (because of the (VLBR) coding context), compensation 

has to be evaluated on the basis of the visual quality of the 

compensated image. 

But before evaluating some classical techniques, the rst section of this 

chapter introduces why motion computation is an estimation problem. 

The reason is that the motion present in the real scene rendered by 

the image sequence is not directly observable: only its e ects on the 

pictured scene are observable. Consequently, it cannot be measured but 

estimated instead. Moreover, the problem is said to be ill-posed in the 

sense that the available data insu ciently constrain the solution that 

may be non-unique or non-existing. 

The Rate-Distortion theory will demonstrate in Section 2.2 the e ciency 

of motion estimation & compensation for video compression if it is performed 

with a su cient accuracy. Section 2.3 presents the hypotheses 

that may be used to su ciently constrain the problem. It also expands

2.1 Image Formation and Motion 41 

Previous 

image(s) 

Source 

image 

at time t 

Estimated 

image 

at time t 

Estimation: 

motion analysis 

Motion 


Previous 

image(s) 

CODER 

Motion 

parameters 

Motion 

parameters 

DECODER 

Representation grid 

Transcription, 

coding 

Transmission channel 

Decoding 

Representation grid 

Figure 2.1: Distinction between motion estimation and compensation 

on the various methodologies that may be adopted to solve it. Section 

2.4 introduces some background techniques in motion estimation 

& compensation (for image coding), while the two particular techniques 

that are used in the present work are fully described in sections 2.5 

and 2.6. Finally, Section 2.7 concludes the chapter. 

2.1 Image Formation and Motion 

An image is a two-dimensional (2D) pattern of brightness resulting from 

the projection of a three-dimensional (3D) scene onto a 2D plane. This 

projection, that may beaperspective one or an orthographic one, 

brings about some loss of depth information, which engenders several 

problems such asaperture, occlusions,... 

If one considers the wayaphysical scene is lmed (Figure 2.2), two coor-


Y 

X 

f 

y 

r 

x 

2D image plane 

R 

3D object 

Z 

camera axis 

Figure 2.2: Perspective projection geometry: from 3D to 2D 

dinate systems have to cooperate: the camera and the image plane. 

The rst is a 3D Cartesian coordinate system with its origin at the 

camera lens and the Z-axis corresponding to the camera axis. The 

second is the 2D (and space-limited) system of the image plane. Let 

R =(XY Z) T be the position vector of a point in the 3D space and 

r =(xy) T be the position vector of the projected point on the image 

plane. Under a perspective projection, the image r of R is the intersection 

of the plane with a ray linking the camera lens and R. If one 

considers the similar triangles in the projection geometry, the following 

relations are obtained: 

f Z f 

= ; 

x X y 

Z 

= ; (2.1) 

Y 

where f is the distance between the camera and the image plane, referred 

to as the focal length, and related to the focus of expansion [51]. The 

perspective projection equation for every point isthus: 

r = 

x 

y 

! 

= f 

Z 

X 

Y 

! 

: (2.2)


Under orthographic projection, the ray also starts in R but is perpendicular 

to the image plane (parallel to the camera axis). The orthographic 

projection equation is: 

r = 

x 

y 

! 

= 

X 

Y 

! 

: (2.3) 

It is clear that perspective projection reduces to orthographic projection 

if the focal length f of the camera is much larger than the depth Z. The 

assumption of orthographic projection is valid also in all cases where the 

variation of the object shape is small or when the pictured object is small, 

both in comparison to the focal length f. In these situations, the depth 

might be considered constant and a kind of parallel projection appears; 

but a scaling factor is still present. The orthographic projection implies 

a further reduction of the information about the 3D scene compared 

with the perspective projection, since the depth information is totally 

lost. 

A more di cult question of image formation is the determination of the 

brightness of a particular point of the image. This very complicated 

problem will not be tackled here since it involves many other data: the 

source of light, the angle of re ection on the object, the re ection properties 

of the object, the spectral sensitivity of the sensor,... [48] 

Nevertheless, the fact that a 3D scene is projected onto a 2D image explains 

that a di erence exists between the apparent and the real motion 

(Section 2.1.1). Moreover, this di erence and some other considerations 

about the nature of images prevent from solving all ambiguities of motion 

estimation (Section 2.1.2). 

2.1.1 Apparent versus Real Motion 

In the image formation process, the 3D scene is projected by the camera 

onto the 2D image plane. Thereafter, the 3D motion information of the 

objects is also projected onto the image plane. The presence of motion 

manifests itself on the image plane by changes of the intensity values of 

the pixels along the time axis. These changes are used to recover the 

motion of the objects. 

The 3D motion of the objects is called the real motion, in opposition 

to the 2D velocity eld that represents the apparent motion of the 

objects on the image plane and is referred to as optical ow [49]. Both


(a) (b) (c) 

Figure 2.3: Illustration of optical ow: (a) Sphere at time t , 1 (b) 

Sphere at time t (c) Optical ow 

are di erent from the velocity eld that would result from the projection 

on the image plane of the true 3D velocity eld. Illuminations change, 

shadow, occlusion are phenomena that are interpreted as motion e ects 

by the optical ow. 

Optical ow is exactly what motion estimation tries to produce as a 

result of its analysis. It is de ned as (from [44], illustration in gure 2.3): 

Optical Flow Image in which the value of each pixel is the estimated 

projected translational velocity arising from a surface point ofan 

object in motion relative to the camera. Some pixels may have 

no optical ow information because the projected velocity is not 

always estimable. 

One of the aims of object-oriented coding [84, 85], in the framework 

of motion estimation & compensation, is to establish 3D models of the 

scene in order to overcome this problem of restricted access to the only 

apparent motion. As they are out of the scope of the present work, they 

will not be detailed, even if they are often close to the parametric models 

(cf. Section 2.4.5). 

2.1.2 Unsolvable Problems of Motion Estimation 

Many di erent algorithms exist to compute 3D and 2D motion from 

image sequences. However, many questions remain open because of the 

boundary e ects that make motion estimation a non trivial problem. 

Hereunder are some well-known phenomena that engender troubles when 

one tries to estimate the true 2D motion eld:


A 

t-1 t 

Figure 2.4: The aperture problem 

Apparent motion vector 

A 

g 

e 

f 

true vector 

True motion vector 

Figure 2.5: The bicycle wheel: ambiguity in the correspondence problem 

because of aliasing 

The aperture problem is illustrated on gure 2.4. Any operation 

that sees the moving edge through a local window A can only 

compute the component of motion perpendicular to the edge. It 

means that on gure 2.4, any of the vectors e; f or g would be 

convenient. The optical ow is therefore not uniquely determined 

by the local information in the changing image. The problem is 

not su ciently constrained: it is ill-posed. 

The correspondence problem which is depicted on gure 2.5 prevents 

estimation algorithms from correctly putting in relation the 

intensity values of successive frames, and results from the spatiotemporal 

sampling achieved during digital image acquisition. In-


(a) (b) 

Figure 2.6: The optical ow is not always equal to the motion eld. (a) 

Null optical ow during non-null motion eld. (b) Reverse situation. 

deed, it is not always possible to respect the Nyquist frequency [52], 

particularly in the case of high spatial frequencies undergoing fast 

motions. A typical illustration of this kind of temporal aliasing 

is the \wheel" ( gure 2.5): if the angular velocity of the wheel is 

greater than ( =n frame rate), where n is the number of rails, 

there will be an ambiguity in the correspondence process and the 

wheel seems to turn in the opposite direction. 

Because motion is estimated by establishing correspondences between 

successive images intensities, any noise (camera noise, quantization 

noise,...) will cause additional di culties. Moreover, the 

illumination changes will be interpreted as motion e ects and 

will distance the optical ow from the true motion eld. Figure 

2.6(a) shows a smooth sphere rotating under constant illumination: 

the projected image does not change, yet the true motion 

eld is non-null. On gure 2.6(b), a xed sphere is illuminated by 

amoving source: shadow changes engender an optical ow, even 

if the true motion eld is null. 

Occlusions between moving objects such as appearance or disappearance 

of objects parts (because of (un)covering), create regions 

where the observed intensities at time t do not haveany correspon-

2.2 Rate Distortion Theory 47 

dence in image t,1. And the partial loss of the depth information 

does not provide enough information to recover the true motion. 

2.2 Rate Distortion Theory 

Before eventually reviewing some techniques of motion estimation & 

compensation, it seems important to highlight its main goal, in a video 

compression context, which is to reduce the amount of data necessary 

to transmit moving pictures across bandwidth-limited channels. The 

various di culties that might arise when estimating motion have been 

presented in the previous section, but motion estimation & compensation 

is still used and regarded as the most compressing tool of a complete 

codec. 

To justify why motion estimation is used in video coders, Tziritas and 

Labit [127] use the Rate Distortion theory (a brief summary of which 

may be found in appendix B). With some additional hypotheses, this 

theory clearly indicates the advantages and limits of motion estimation 

in the video compression framework. 

With the results of appendix B.4, one can obtain a function of the distortion 

D that de nes the rate R required to code a picture in intra 

mode: 

R = 

( 1 

2 log 2 

2 

I 

D if 0 2 

I ; 

I; 

(2.4) 

where 2 

I is the input image variance. 

When inter image coding is performed without motion compensation, 

the image I(x; y; t , 1) is merely used as a prediction of I(x; y; t)). The 

residual coding then requires a bitrate of: 

R = 1 

2 log 2 

2 2 I(1 , e 

D 

p 

,2 f0 u2 +v2 ) 

+1 

! 

; (2.5) 

where (u; v) is the true displacement, (de ned in equation (B.29) of 

appendix B.4) is the temporal correlation of the two images and f0 0:05 p . Equation (2.5) outperforms equation (2.4) only if: 

p 1 

2 2 u + v < ln 2 ( >0): (2.6) 

2 f0


This condition clearly points out that inter image coding without motion 

compensation is only worthwhile if the displacement is small in 

comparison to the spatial variations of the picture. It also depends on 

the temporal correlation: in case of scene cut, or large displacements, 

is either very low ornull and it is better to intra code the new picture. 

Inter image coding with motion compensation is achieved with an estimated 

motion eld (u0 ;v0 ) di erent from (u; v). We denote (dx;dy) the 

estimation error (u0 , u; v0 , v). Such a coding also requires the coding 

of the residues1 . In this case, the rate is 

Z p 

1= 

R = 

0 

f log 2 

2(1 , (fx;fy)) I(fx;fy) 

D 

+1 df; (2.7) 

where (fx;fy) is the characteristic function of the motion estimation error 

(dx;dy) and I(fx;fy) the power spectral density of the images (more 

details are available in appendix B.4). By comparing equations (2.4) and 

(2.7), one can demonstrate that inter image coding with motion compensation 

is more e ective than intra coding only if 

p 

2 , 1 

d < ( 

f0 >0); (2.8) 

where 2 

d is the variance of the motion estimation error (dx;dy). Motion 

compensation techniques are only interesting if their accuracy is high 

enough with regards to the picture content. Finally, motion compensation 

improves inter image prediction (equation (2.5)) if 

d 

These two last conditions not only justify the use of motion compensation 

in video coders whenever the motion estimation is precise enough, 

but also explain why all coders use several modes of transmission. They 

provide criteria to decide which mode should be used according to the 

spatio-temporal activity of the image sequence. 

2.3 Practical Approaches to Motion Estimation 

If one is now convinced of the possible impact of motion estimation & 

compensation for video compression purposes, it is time to reveal how 

this complex problem can be solved. As the problem is an ill-posed 

1 In this calculus, the coding cost of the motion information is neglected.

2.3 Practical Approaches to Motion Estimation 49 

one, constraints have to be added so as to reach a unique solution. The 

type of constraints is a rst criterion that distinguish one technique 

from another. The second criterion is the methodology that is adopted 

to solve the problem. 

Let us remind one that in a video coding context the aim of motion 

estimation is to detect the motion eld M (the optical ow) that is 

present between two successive images at time t , 1, I(x; y; t , 1), and 

t, I(x; y; t), where (x; y) de nes the pixel position in the image. Motion 

compensation then applies this motion eld M to the reference image 

I(x; y; t , 1) in order to obtain a prediction Î(x; y; t) ofI(x; y; t). 

2.3.1 Additional Constraints 

By venturing hypotheses about the nature of the scene objects, additional 

constraints can be formulated. These ones establish an error 

criterion that is exploited by a minimization process. There are two 

main types of constraints: the preservation constraint which assumes 

that the objects properties in terms of re exion and luminance are kept 

constant, and the coherence constraint which is bound to the notion of 

object cohesion. 

2.3.1.1 Preservation Constraint 

This constraint considers that, if a luminous point of the scene is visible 

at time t , 1, it is also visible at time t. Moreover, it assumes that 

the luminance of a pixel is invariant with respect to the motion, i.e. 

that any temporal modi cation of the luminance distribution over the 

pixels is directly attributable to the pixels motion. Such anhypothesis is 

correct when the scene illumination is constant and uniform, and when 

the objects re ectance is Lambertian [133]. The preservation constraint 

has two classical formulations. 

DFD-based formulation. The Displaced Frame Di erence (DFD) 

expresses the di erence between the luminance of the image at time t 

and the luminance of the image at time t + dt having undergone some 

displacement (dx; dy) 2 : 

DFD(x; y; dx; dy; t)=I(x + dx; y + dy; t + dt) , I(x; y; t): (2.10) 

2 dx dy 

( dt ; dt )=(u; v) describes the optical ow at position (x; y).


The preservation constraint consists in assuming that a motion vector 

(dx; dy) that nulli es the DFD exists. If such avector does not exist, the 

aim of the motion estimation is to determine the vector that minimizes 

the DFD. The methods that use such a minimization process are called 

correlation-based. So as to compute the value of the DFD over a 

precise region, the criteria that are most frequently used are: 

the absolute value of the DFD over all region pixels 

X 

region(i;j) 

jDFD(i; j; u; v)j; (2.11) 

the squared value of the DFD over all region pixels 

X 

region(i;j) 

(DFD(i; j; u; v)) 2 ; (2.12) 

both these criteria can be divided by the total number of pixels 

taken into account, and are then respectively called the Mean Absolute 

Error (MAE) and the Mean Square Error (MSE). 

The rst criterion is often used as it requires less computation than the 

others. 

Di erential formulation. If one considers the image function I continuous 

and possessing a derivative, a Taylor expansion (limited to the 

rst order) provides: 

I(x + dx; y + dy; t + dt) 

= I(x; y; t)+Ix(x; y; t):dx + Iy(x; y; t):dy + It(x; y; t):dt; 

(2.13) 

where Ii indicates the partial derivative of I with respect to i. The 

combination of equations (2.10) and (2.13) leads to the optical ow 

equation: 

Ix(x; y; t):u + Iy(x; y; t):v + It(x; y; t)=0: (2.14) 

The optical ow equation only allows one to compute the component 

of motion in the direction of the spatial gradient (cf. the aperture problem, 

Section 2.1.2), and requires additional hypotheses to suppress all 

uncertainties.


2.3.1.2 Coherence Constraint 

This constraint assumes the cohesion of all the elements of a unique object. 

It is valid if the motion variation between the neighboring elements 

of an area is limited. It can be implicitly expressed in two di erent ways 

thanks to the neighborhood information: 

either with a region-based approach (all pixels of the region obey 

the same motion parameters); 

either by the adoption of iterative or recursive solving methods 

that propagate the estimate of the neighbor pixels. 

The coherence constraint can also be explicitly expressed when restrictions 

are formulated about the motion nature (a priori information), or 

when regularization is ensured by smoothing criteria. 

Chicken and Egg Problem. A remark should be made concerning 

the implicit formulation in terms of regions: on the one hand, segmentation 

is required in order to determine the various regions on which 

the coherence constraint should be applied. On the other hand, this 

segmentation should take the motion information into account soasto 

respect motion transitions. Achicken and egg problem arises as motion 

estimation requires segmentation, which requires motion estimation. 

Consequently, emerging techniques try to jointly solve the two 

problems. 

2.3.2 Estimation Methods 

Estimating the motion between successive pictures generally consists in 

minimizing a function that expresses some of the constraints presented 

above. Minimization can be achieved in several ways. One can distinguish 

three main families of methods: 

Di erential methods which are based on gradient measures. Direct 

di erential methods aim at nullifying the gradient of the function 

to be minimized, while indirect di erential methods converge 

towards a solution according to the gradient direction. Iterative 

(Section 2.4.2) and pel-recursive (Section 2.4.3) motion estimation 

algorithms are part of this class of methods. 

Matching methods are based on an explicit search for the best 

matching between two structures (one at time t and another at


time t , 1). Any kind of primitive can be a structure: pixels, 

blocks of pixels, regions, segments,... The search for the best 

matching generally involves trying all the solutions of the search 

space. Block-Matching (Section 2.5.1) and Hexagonal-Matching 

(Section 2.6.1) belong to this class. 

Stochastic methods use random choices to drive the exploration 

of the parameters space. They include Bayesian estimation, Markov 

models (Section 2.4.4) and genetic algorithms. 

These di erent approaches to the motion estimation problem can be 

variously implemented: although only one methodology is generally 

adopted, it may beofinterest to use a hierarchy of models [92]. On 

another way, Section 2.3.2.1 introduces two special types of implementations 

that have been successfully applied to several methods: the multigrid 

and multiscale optimization methods. 

Before reviewing some techniques, Section 2.3.2.2 brie y tackles the 

choice of the sense of estimation. 

2.3.2.1 Multigrid and Multiscale Optimization Methods 

In order to avoid convergence to local minima and to speed up the 

convergence, the motion estimation methods described previously are 

sometimes coupled with multigrid or multiscale optimization techniques 

( gure 2.7). 

Multigrid algorithms operate on a hierarchy of resolution levels (image 

pyramid) that are build with low-pass ltering and sub-sampling by a 

factor 2 in each direction. Multiscale methods are based on the original 

resolution of image data but, like multigrid schemes, they produce a 

pyramid representation of the motion data. 

To develop a multigrid or multiscale algorithm, several components must 

be speci ed: 

the number of levels; 

a restriction operation that maps a solution at a ne level to a 

coarser level; 

a prolongation operation that maps from the coarse to the ne 

level;


Multigrid 

Multiscale 

fine 

fine 

coarse 

medium 

motion pyramid image pyramid 

coarse 

medium 

motion pyramid 

original image 

Figure 2.7: Multigrid and multiscale optimization 

a coordination scheme that speci es the number of iterations


at each level of the pyramid and the sequence of prolongation and 

restrictions. 

The coordination scheme which is most frequently used is a simple 

coarse-tone algorithm, where the prolongated coarse solution is used 

as a starting point for the next ner level. In this case, simpler repetition 

and bilinear interpolation are the most commonly used prolongation 

methods. More sophisticated schemes implement a ne-to-coarse-tone 

[32] sequence in order to further re ne the solution. 

When multigrid methods are applied to motion estimation, low spatial 

frequencies are used to measure large displacements with a low accuracy. 

Higher frequencies information is then used to improve the accuracy of 

the estimation by incrementally estimating small displacements. Besides 

achieving a computationally e cient estimation, this also reduces aliasing 

introduced by high spatial frequency components undergoing large 

motions. 

Multigrid optimization has been successfully applied to iterative orpelrecursive 

approaches as large displacements are generally not reachable 

with these methods. Multigrid has also been applied to the BMA, which 

gives a hierarchical search BMA. But of course the e ectiveness of multigrid 

methods depends on the image content. If the image mainly contains 

high spatial frequencies, then, after low-pass ltering, there maybe 

insu cient information to allow a reliable estimation. To overcome this 

limitation, multiscale methods can be used: the ABMA [109], presented 

in Section 3.1) illustrates this principle. 

2.3.2.2 Forward versus Backward Estimation 

To estimate the motion between two images at time t,1 and t, there are 

two possibilities: forward or backward motion estimation ( gure 2.8). 

The search in image t for a displaced object of image t , 1 is a forward 

search, while the backward search consists in looking for an object of 

the present image t in the previous one. 

The backward sense is generally used with region matching techniques so 

that the displaced regions cover all the image surface. On the contrary, 

the forward sense allows both the coder and the decoder to exploit their 

memory so as to automatically create a partition on the image at time 

t,1. Moreover some interpolation scheme and motion analysis tools use 

both senses of estimation so as to overcome prediction errors resulting 

from occlusions [33].

2.4 Background Techniques for Motion Estimation 55 

One must point out that in the H.263 [96] and MPEG [62] context, 

backward and forward are used in another meaning: when bidirectional 

coding is applied, a macroblock of image t can be obtained with a forward 

prediction (from image t,1) or with a backward one (from image t+1). 

Forward 

Backward 

time t-1 time t 

1 2 3 4 1 2 3 4 

5 6 7 8 

5 6 7 8 

9 10 11 12 

9 

10 

11 12 

1 2 3 4 

5 6 7 8 

9 

10 

11 12 

1 2 3 4 

5 6 7 8 

9 10 11 12 

Figure 2.8: \Forward" versus \backward" motion estimation 

2.4 Background Techniques for Motion Estimation 

2.4.1 Linear Regression 

Linear regression uses both the preservation constraint of equation (2.14) 

and a translational model of displacement for determined regions. The 

resolution over all the region pixels is achieved by a least square method, 

and provides the following solution: 

(û; ^v) =arg min 

X 

(u;v) 

(i;j)2R 

(Ix(x; y; t)u + Iy(x; y; t)v + It) 2 (2.15)


where Ii designs the partial derivative ofI with respect to i. Usually, 

the di erence between the two frames serves as temporal gradient and 

spatial gradients are digitally computed on the previous image. Small 

displacements can be measured with this method. 

2.4.2 Iterative Motion Estimation 

A gradient-based motion estimation was proposed in [49] so as to determine 

the optical ow. It was one of the very rst method established to 

solve equation (2.14). In order to correctly constrain the problem, Horn 

and Schunck haveadded an a priori smoothness condition on the resulting 

optical ow: the value of the gradient module had to be as small as 

possible. The problem then moves to a cost function minimization, with 

the function expressed as: 

ZZ 

((Ixu + Iyv + It) 2 + (u 2 

x + u 2 

y + v 2 

x + v 2 

y) 2 )dx:dy (2.16) 

where ux;uy;vx and vy are the partial rst derivatives of the two optical 

ow (u; v) components, and is a (Lagrange) constant that balances 

the importance of the error in the motion equation and the penalty of 

departure from smoothness. A solution provided to this minimization 

problem is: 

û = um , Ix P 

D 

^v = vm , Iy P 

D 

where um and vm are local average of u; v, and 

(2.17) 

P = Ixum + Iyvm + It; D= + I 2 

x + I 2 

y: (2.18) 

Final determination of the optical ow can then be based on an iterative 

Gauss-Seidel method, re ning (ûi; ^vi) using (ûi,1; ^vi,1) (i is the iteration 

number) until a certain convergence criterion is reached. 

As expected, the resulting motion eld contains only smooth variations 

along space, which is not always correct with regards to divergent object 

motions, as illustrated on gure 2.9. 

Moreover, as such techniques provide a dense motion eld (one motion 

vector for every pixel), the coding cost is very high. This is the major 

reason why it is practically never used for coding but for analysis instead. 

Another way of using this type of motion estimation which avoids the 

extensive transmission cost consists in performing the estimation at the


(a) (b) 

(c) 

Figure 2.9: Limit of iterative motion determination: (a) Sphere at time 

t (b) Applied motion eld on I(t , 1) (c) Detected optical ow


Compensation 

Estimation 

j’’ 

j’’ j’ 

time t-2 

i’’ 

j’’ j’ j 

time t-1 

Figure 2.10: Estimation performed at the decoder 

i’ 

i’’ 

time t 

decoder side between the decoded frames at time t , 2 and time t , 1. 

The computed motion eld can then be applied to image t , 1 in order 

to predict frame t ( gure 2.10). It avoids transmitting the dense motion 

eld but forces a lot of computation to be achieved at the decoder side. 

However, many problems of occlusion and uncovered background can 

occur. 

d 

v 

u 

i 

i’ 

i’’


2.4.3 Pel-Recursive Algorithms 

Pel recursive [91] motion estimation algorithms estimate 2D motion recursively 

on a pixel basis: given an initial estimation for every point, 

di =(ui;vi), a correction is carried out according to the resulting DFD: 

di+1 = di + di; (2.19) 

with di =( ui; vi) the update term of iteration i. The iteration can 

be executed along a scan line or from line to line or from frame to frame; 

the technique is then respectively denoted pel-recursive estimation with 

horizontal, vertical or temporal recursion. The basic assumption of this 

technique is that the DFD converges locally to zero when the estimated 

motion converges to the actual movement of the object point. The aim 

is thus to recursively minimize the squared value of the DFD using a 

steepest-descent (gradient) method: 

di+1 = di , "DF D(x; y; u; v)rdi(DFD(x; y; u; v)) (2.20) 

where rdi is the gradient operator with respect to di and " a positive 

constant. Noting that 

rdi(DFD(x; y; u; v)) = rI(x , u; y , v; t , ) (2.21) 

where rI is the spatial image gradient, we obtain: 

di+1 = di , "DF D(x; y; u; v)rI(x , u; y , v; t , ): (2.22) 

" is a regulating parameter that achieves quick but sometimes oscillating 

convergence if it is high, or slow but accurate estimate if it is small. More 

advanced techniques use a variable " to improve both the convergence 

speed and the solution accuracy. 

Evaluation of pel-recursive methods is very similar to the evaluation of 

iterative methods. In fact, the pel-recursive methodology has been applied 

to interframe coding using the scheme of gure 2.10, which implies 

a lot of computation. In addition to the problem of properly computing 

the gradient, pel-recursive methods also estimate a motion eld generally 

too smooth (like iterative methods). This is why an original approach 

to the problem of discontinuities has been proposed by Gaidon [37]. Although 

it implies very important computation, it is worth presenting 

because of the novelty itintroduces.


pixel site 

line site 

Figure 2.11: The dual lattice for eld segmentation 

2.4.4 Stochastic Estimation Relying on Markov Random 

Field 

In order to manage discontinuities, motion elds can be modeled as 

Markov Random Fields [40] (MRF, summary in appendix C) within a 

Bayesian framework. Therefore, a lattice dual to the one designed by 

the image sites is set up. This dual lattice has its sites located between 

the pixels, and represents the possible discontinuities (edges) of the eld 

(Figure 2.11). 

The sampling grid of this dual lattice has a quincunx structure. Its 

neighborhood can be de ned according to equation (C.8), with kn = 

1=2; 1; 2; 5=2; 4;::: for n = 1; 2; 3; 4; 5; ::: (the distance computed from 

the line center). Figure 2.12 shows the con gurations of the dual lattice 

neighborhood for n = 1 and 2. 

Based on this lattice, Geman and Geman [99] introduced a line or discontinuity 

process, which allows neighboring sites to have di erent 

interpretations. The only cost relies in the introduction of the line process 

which is modeled as a MRF. The elements of this MRF are either 

active (\on") or inactive (\o "): an active line means that a discontinuity 

occurs in the motion eld at the line location. 

Without such a line process, the energy function to be minimized is 

usually something like: 

E(u; v)= (1 , ):Ed(u; v)+ :Ep(u; v) 

= (1 , ): P x;y DFD2 (x; y; u; v) 

+ : P x;y 

1 

4 (u2 

x + u 2 

y + v 2 

x + v 2 

y) 

(2.23) 

that includes both an energy function to ensure the conformity tothe 

data, and a stabilizing (regulating) function to smooth-constraint the 

solution. is a (Lagrange) parameter. 

Instead, in order to include the discontinuities, the function is made 

dependent on both the displacement (u; v) and the line process l (particularized 

here to b; b 0 ;c;c 0 , cf. gure 2.13):


vertical edge 

pixel site 

central line site 

neighbor line site 

horizontal edge vertical edge horizontal edge 

(a) first-order neighborhood (b) second-order neighborhood 

u i,j-1 

Figure 2.12: Dual lattice neighborhood 

u i-1,j 

b 

i,j 

u 

i,j 

c 

i,j 

v 

i,j-1 

c’ 

i,j 

v 

i-1,j 

b’ 

i,j 

v 

i,j 

Figure 2.13: Particular con guration of the line process


(a) (b) 

Figure 2.14: Comparison of motion eld estimation: (a) With smoothness 

constraint (b) With Markov Random Field model 

E(u; vjl)= (1 , ): P x;y DFD2 (x; y; u; v)+ 

2 ( P i;j u 2 

x(i; j)(1 , bi;j)+ P i;j v2 x(i; j)(1 , b0 P i;j)+ 

i;j u2 y (i; j)(1 , ci;j)+ P i;j v2 

y (i; j)(1 , c0i;j )))+ 

: P i;j(bi;j + b 0 i;j + ci;j + c 0 i;j) 

(2.24) 

with the cost of introducing one discontinuity in the motion eld: 

either no discontinuity isintroduced (b; b 0 ;cor c 0 = 0) and the algorithm 

performs as before, either a discontinuity is used (b; b 0 ;cor c 0 = 1) which 

eliminates the smoothness constraint but has an additional cost . 

Although the Markov model induces more computation, the result is 

much closer to reality as gure 2.14 demonstrates. 

2.4.5 Parametric Models of the Motion Field 

All the techniques described up to now may be classi ed as non-parametric 

methods. It this section, parametric methods that explicitly 

describe the motion of individual pixels within a region with a small 

number of parameters are introduced. The problem of motion estimation 

is then equivalent to a problem of parameters estimation. Since all the 

pixels within a region can contribute to this estimation, highly reliable


results may be obtained. The parametric models that are mostly used 

implicitly assume that objects are rigid planar surfaces undergoing 3D 

motion [123, 125, 124]. 

From Two toTwelve and More Parameters. Parametric models 

are often characterized by their number of parameters. Starting from 

the simplest model, the translation hypothesis, the position (x 0 ;y 0 )of 

a pixel with respect to the position (x; y) of the pixel in the reference 

image is: 

x 0 

y 0 

! 

= 

x 

y 

! 

+ 

tx 

ty 

! 

: (2.25) 

Using the classi cation of Jozawa [53], this model may be built up step 

by step. The integration of a unique horizontal and vertical scaling 

factor C gives a three-parameter model 

x 0 

y 0 

! 

= C 

x 

y 

! 

+ 

tx 

ty 

! 

: (2.26) 

One additional parameter can be included, or to separately specify the 

scaling factors with 

x 0 

y 0 

! 

= 

Cx 0 

0 Cy 

! x 

y 

either to take into account a rotation with 

x 0 

y 0 

! 

= C 

cos sin 

, sin cos 

! x 

y 

! 

! 

+ 

+ 

tx 

ty 

tx 

ty 

! 

! 

; (2.27) 

: (2.28) 

Combining both improvements, one obtains a ve-parameter model 

x 0 

y 0 

! 

= 

cos sin 

, sin cos 

! 

: 

Cx 0 

0 Cy 

! x 

y 

! 

+ 

tx 

ty 

! 

: (2.29) 

Finally, a distinction between the x and y axis rotations leads to the 

six-parameter a ne transform:


x 0 

y 0 

! 

= 

= 

Cx cos x 

Cx sin x 

a1 a 2 

a 3 

a 4 

! x 

y 

,Cy sin y 

! 

Cy cos y 

+ 

! 

a5 

a 6 

: 

! 

: 

x 

y 

! 

+ 

tx 

ty 

! 

(2.30) 

The a ne transform results from the orthographic projection of the 

motion of a planar surface. Under perspective projection, an eightparameter 

perspective transform is built: 

x 0 = a 1 + a 2x + a 3y 

1+a 7x + a 8y ; 

y 0 = a 4 + a 5x + a 6y 

1+a 7x + a 8y : 

Another commonly used transform is the bilinear transform: 

x 0 = a 1x + a 2y + a 3xy + a 4 

y 0 = a 5x + a 6y + a 7xy + a 8 

(2.31) 

: (2.32) 

Higher level models also take into account acceleration e ects. Sanson 

[117] for instance proposes a twelve-parameter model: 

x 0 

y 0 

! 

= 

x ax ax y 

ay x 

ay y 

! x 

y 

! 

+ 

x b 2 

x 

bx2 y 

bxy x 

bxy y 

by2 x 

by2 y 

! 0 

B 

@ x2 

xy 

y2 1 

C 

A + 

(2.33) 

Because of the presence of numerous moving and possibly overlapping 

objects in the scene, the above parametric models do not hold, in general, 

throughout the whole image plane. A solution to this problem 

is provided assuming that every object is characterized by its motion 

parameters. It leads to the \chicken-and-egg" combined segmentation 

& estimation problem. Away toovercome it is to use warping techniques 

(cf. Section 2.6) that have successfully implemented a matching 

methodology. 

2.4.6 Within a Transform Domain 

Estimation methods based on spatio-temporal lters over several pictures 

have recently been implemented. They are based on the property 

tx 

ty 

! 

:

2.5 The Block-Matching Algorithm (BMA) 65 

of the Fourier transform to concentrate the energy within a few coefcients 

of the transformed frequential domain. The motion estimation 

can then be achieved by analyzing the importance and the variation of 

the temporal frequency !t. However, the main limitation of the use of 

such techniques for coding purposes is that they require more than two 

successive pictures so to provide a reliable analysis. 

2.5 The Block-Matching Algorithm (BMA) 

Since its introduction by Jain and Jain in 1981, the Block Matching 

Algorithm (BMA, [50]) has emerged as the ME technique achieving the 

best compromise between complexity and quality: a fast estimation procedure 

allows obtaining a block-based motion eld that is transmitted 

at low-cost. An appropriate choice of the block size o ers one a compromise 

between adaptation to small moving objects (performed by small 

blocks) and robustness against noise (performed by large blocks). These 

properties have granted the BMA to be included in most video standards 

like H.263 [96], MPEG-1,2 [62] and MPEG-4 [104]. 

2.5.1 BMA Principle 

The principle of the BMA is to apply a translational motion model to 

subblocks of the image. For every block, the matching measure is based 

on the Displaced Frame Di erence (DFD). Fuh and Maragos [36] thereafter 

consider the BMA and its two sole free parameters (the translation 

vector) as a particular case of more elaborate models. 

Its implementation in most standards follows the backward procedure 

(cf. Section 2.3.2.2): 

1. the image I(t) is divided into a set of subblocks of size (K; L); 

2. for every subblock, the origin of the block in the reference image 

I(t , 1) is searched for within a search area kuk u, kvk 

v (cf. gure 2.15) according to one of the criteria presented in 

Section 2.3.1.1. 

3. compensation is achieved by reconstructing Î(t) with the blocks of 

I(t , 1) designed by the motion vectors.


K+2 Δ 

u 

u 

v 

L+2 Δ 

v 

Δ 

u 

Δ 

v 

Search window 

in the previous frame 

(KxL) block 

in the current frame 

(KxL) block under search 

in the previous frame 

Figure 2.15: The Block Matching Algorithm search space 

2.5.1.1 Search Techniques 

The location of the best match within the search area thanks to an 

error criterion (e.g. the MAE) requires intensive computation when 

it is performed using a full search (FS) procedure. Therefore, several 

complexity-reduced algorithms have been proposed for a faster search 

for the minimum DFD. These algorithms (listed below) are of particular 

interest for software implementations. These are (references are given 

in [108]): 

the three-step algorithm (TSA); 

the 2D logarithmic search (2D LM); 

the modi ed motion estimation algorithm (MMEA); 

the conjugate direction search (CDS); 

::: 

Table 2.1 compares the computational complexity required by all these 

techniques, for a maximum displacement u = v = M (with pel accuracy, 

see next section below).

2.5 The Block-Matching Algorithm (BMA) 67 

Maximum number M 

Algorithm of search points 4 8 16 

FS (2M +1) 2 81 289 1089 

TSA 1 + 8 log 2 M 17 25 33 

2D LM 2 + 7 log 2 M 16 23 30 

MMEA 1 + 6 log 2 M 13 19 25 

CDS 3+2M 11 19 35 

Table 2.1: Fast search algorithms for the BMA: number of positions to 

test 

2.5.1.2 Advanced Possibilities 

Computation of dense motion eld is possible thanks to the BMA. 

In this case, the minimization of the error function is computed for each 

image pixel. In order to reduce the probability of false matches caused by 

noise, the matching criterion (e.g. the MSE) relies on a square window 

around the pixel. 

Subpel Accuracy is possible when the BMA is computed backwards. 

It means that blocks of image t are searched for in image t , 1 with 

a step of half a pel (which requires interpolation functions of course) 

or with lower fraction of a pel. Such half-pel accuracy is often used so 

as to compensate for the interpolation e ects engendered by the small 

displacements of the camera. 

2.5.1.3 Result 

Figure 2.16 presents some results of the Block Matching Algorithm (full 

search, half-pel precision). The smaller K and L are, the better the 

blocks can match the objects but more computations are needed and 

the algorithm becomes more sensitive to noise. 

Despite its simplicity and wide use, the BMA presents one major limitation 

because of its assumption that blocks motion is purely translational. 

This assumption is indeed not true in many instances as di erent moving 

regions that undergo di erent movements within a same block actually 

exist. Moreover, when adjacent blocks use di erent vectors, they form 

straight-line discontinuities in the prediction image, known as blocking


Figure 2.16: BMA results: (top) original images, (center) motion eld 

and result with 8 8 blocks, (bottom) ibid. with 16 16 blocks

2.6 Image Warping Techniques 69 

artifacts, artifacts to which the Human Visual System (HVS) is very 

sensitive. The bigger the blocks are, the more blocking artifacts are visible. 

Two classical solutions exist to solve these problems: overlapped 

BMA, presented in Section 2.5.2, and multigrid or multiscale BMA, an 

example of which isgiven in Section 3.1. 

In order to synthesize the evaluation of the BMA technique, two last 

points have to be considered: motion eld coding and computational 

burden of the compensation. The former is subject to various discussions 

but one could state that, globally, it is possible to e ciently encode 

the resulting motion eld. The latter is easier: the compensation, that 

merely consists in applying the selected motion vector to every subblock, 

is very fast. However, annoying blocking artifacts appear. 

2.5.2 Overlapped BMA 

In order to reduce such artifacts, overlapped block motion compensation 

has been proposed [116, 4] and even included as an option in the 

ITU H.263 standard. The underlying idea is to reconstruct the image 

using overlapping neighbor blocks. The problem of determining optimal 

windows with adequate weights may be formulated as an optimal linear 

estimation problem of pixel intensities. The nal value of a pixel in the 

reconstructed image is thusaweighted sum of the intensities coming 

from the di erent neighbor vectors. Figure 2.17 illustrates a particular 

neighborhood con guration that results in the following weighted 

equation: 

I(x; y; t)=wA:I(x + ,! Ax;y+ ,! Ay;t, 1) + wB:I(x + ,! Bx;y+ ,! By;t, 1) 

+wC:I(x + ,! Cx;y+ ,! Cy;t, 1) + wD:I(x + ,! Dx;y+ ,! Dy;t, 1) 

(2.34) 

where wi are some adequate weights and ,! V the motion vectors. 

2.6 Image Warping Techniques 

Image warping techniques are correlation-based parametric methods. 

They have been initially developed in order to cope with more displacement 

con gurations than the only translation of the block matching (cf. 

Section 2.5.1). They consist of three main steps.


A B 

C 

Figure 2.17: Overlapped block motion compensation 

At rst, the input image is split into small patches. When the patch 

structure (or mesh, orwireframe) is predetermined, the approach is 

called xed mesh motion compensation. Otherwise, if the mesh is 

adaptively built, e.g. according to the image content, it is called adaptive 

mesh motion compensation. In this case, the patch structure is 

adaptively deformed to t the contours of the moving areas or objects. 

No additional information about the patch structure is needed if the 

patch adaptation is applied to the decoded image of the previous frame. 

For such achoice, forward estimation is of course advantageous. It is 

notable that warping techniques may indi erently be applied forwards 

or backwards. 

Subsequently, the motion vectors of the grid points or vertices are estimated. 

Figure 2.18(a) demonstrates the estimation for a quadrangular 

mesh [93]. 

The last step implements the motion compensation ( gure 2.18(b)): the 

displacements of the vertices are sent as motion vectors and the vectors 

for the remaining pixels are obtained with a geometric transformation 

technique (image warping) and some interpolation (e.g. using bilinear 

or bicubic [82, 54]interpolation). The parameters of the transformation 

can of course be determined from the vertices motion vectors. 

While bilinear (equation (2.32)) and perspective (equation (2.31)) transforms 

require quadrilateral patches, triangular patches correspond to the 

a ne transform (equation (2.30)): the two motion components of eachof 

the three triangle tops determine the six parameters of the a ne transform 

(Figure 2.19). The two invariants p and q of the transformation 

x 

D


Previous frame Current frame 

A 

C 

object 

B 

D 

A B 

C 

D 

geometric 

transformation 

C’ 

C’ 

(a) Estimation of the vertices motion 

(b) Warping motion compensation 

A’ 

object 

A’ B’ 

Figure 2.18: Principle of warping motion estimation and compensation 

D’ 

D’ 

B’


q.AC 

C 

A p.AB 

X 

B 

q.A’C’ 

A’ 

C’ 

p.A’B’ 

Figure 2.19: A ne transform between two triangular patches 

help warping any point of a triangle into its a ne-deformed version: 

X = A + p: ,! 

AB + q: ,! 

AC 

X0 = A0 + p: ,,! 

A0B0 + q: ,,! 

A0C X’ 

B’ 

0 (2.35) 

2.6.1 The Hexagonal Matching Algorithm (HMA) 

Using such triangles, Nakaya and Harashima have proposed to estimate 

motion by means of an Hexagonal Matching Algorithm (HMA, [88]). 

Once a regular mesh has been overlaid on the picture, the authors propose 

a two-step algorithm to determine the vector of every mesh vertex: 

1. the displacement of the vertices is rst estimated with a coarse 

BMA; 

2. then an iterative local minimization of the prediction error renes 

this initial displacement. The Hexagonal Matching Algorithm 

(HMA) treats every vertex x sequentially: it xes its neighbor vertices 

(a; b; c; d;e;f in gure 2.20) and searches for the new position 

x 0 that minimizes the reconstruction error of all the attached 

patches. The procedure iterates over all vertices whose position 

has been modi ed in the previous iteration. It ends when all vertices 

are not moved anymore or after a xed number of iterations.


c 

a 

iteration i iteration i+1 

e 

x 

b 

f 

d 

Figure 2.20: Re nement procedure of the Hexagonal Matching Algorithm 

c 

Local convergence is ensured since the solution is always locally 

improved. 

An important feature of such mesh-based compensation schemes is that 

the connectivity of the mesh must be preserved: inconsistent motion 

vectors like the one of gure 2.21, which results in overlapping mesh 

elements, has to be avoided. 

c 

a 

e 

b 

f 

a 

x’ 

e 

x’ 

b 

d overlapping 

area 

Figure 2.21: Illustration of inconsistent motion vector for the HMA 

The result of such a procedure is illustrated on gure 2.22 for the original 

images of gure 2.16(top). As expected, no blocking artifacts are present 

f 

d


in the compensated picture since rotation and zoom e ects are taken 

into account by the a ne transform. However, two drawbacks have to 

be pointed out: 

the estimation requires a lot of computation as it starts with a 

classic BMA followed by an iterative algorithm (that can be parallelized); 

the xed grid does not take anyinterest in the image content. 

2.6.2 Adaptive Hexagonal Matching Algorithm (AHMA) 

In order to overcome the second drawback, Dudon extended this work 

to active mesh that are automatically adapted to the spatial contents 

of the image [29, 30]. In parallel, Dudon also establishes other ways 

of estimating the motion of the vertices (mesh nodes) [28]. Vertices are 

placed on characteristic points that are detected according to the spatial 

activity. During the encoding, memory and temporal gradient can be 

used to concentrate vertices in crucial areas of the image. The mesh is 

then generated thanks to a Delaunay triangulation [17, 119]. The result 

of this adaptation is depicted on gure 2.23 with reference to gure 2.22, 

and the nal result is still better, or equivalent for a lower number of 

vertices (99 in gure 2.23 versus 120 on gure 2.22). 

If one wishes to evaluate both the HMA and the AHMA, it stands out 

that the compensation is not that costly and achieves very pleasant 

visual results as no blocking artifacts are present on the compensated 

image. Yet, the estimation is very computation demanding and the motion 

eld is often spurious and less reliable than the one of the BMA. 

Moreover, in the case of adaptive mesh, the neighborhood relation between 

the vectors is not obvious and the entropy coding of the motion 

eld is less e cient. 

2.7 Conclusion 

The present chapter has introduced the problem of motion estimation 

and compensation in the framework of (VLBR) video coding and has 

tried to clearly distinguish between both estimation and compensation. 

Once the ambiguities of the estimation problem have been expanded 

on, the advantages of using such techniques in video codecs has been 

demonstrated by the Rate-Distortion Theory.

2.7 Conclusion 75 

(a) (b) 

Figure 2.22: Example of hexagonal matching compensation: (a) Fixed 

mesh on reference image (b) Compensated image 

(a) (b) 

Figure 2.23: Adaptive mesh hexagonal matching: (a) Adaptive mesh 

(b) Compensated image 

The present chapter has tackled the classical additional constraints and 

methodologies used to estimate motion in video sequences. A direct 

confrontation of the di erent approaches is quite complex as a method 

may be more pertinent than another in a particular context.


In the context of the present thesis that deals with (VLBR) video coding, 

two peculiar techniques have been identi ed as the best performing 

ones. The Block Matching Algorithm, or derived techniques, provides 

very fast estimation and compensation stages but its nal visual results 

su er from so-called blocking artifacts. This BMA is nonetheless part of 

most video coding standards. The Hexagonal Matching Algorithm, and 

derived techniques, o ers a better visual quality after compensation but 

at the cost of more computational e ort as it uses triangular meshes so 

as to warp images according to the motion information. 

The following chapters will present some improvements we have tried 

to bring into the eld of motion estimation and compensation, having 

always in mind the (very) low bitrate video coding context.

Chapter 3 

Study of the 

Implementation of a 

Multiscale Block Matching 

Algorithm 

In order to minimize or eliminate the most important drawbacks of 

the ordinary BMA (cf. Section 2.5.1), Paula Queluz and Beno^t Macq 

have proposed a new motion estimation algorithm, the Adaptive Block 

Matching Algorithm (ABMA, [110]). Our rst doctoral task has been 

the C++ programming and ne-tuning of the ABMA in order to perform 

simulations in the COMIS scope (cf. Section 1.4.2). The developing environment 

was made of a C++ class for pictures manipulation. We have 

therefore extended it into a class for motion eld manipulation. In this 

de nition process, special attention has been paid to accessing (sub-) 

blocks of a motion eld so as to be appropriate for BMA and ABMA 

computation. 

The ABMA proposes a solution to the lack of adaptiveness of the classical 

BMA to object contours. Moreover, it tries to solve the following 

contradiction: in a normal BMA, small blocks are required to respect 

the uniform motion assumption for each object of the scene, while large 

blocks are necessary to avoid the noise in uence. The functioning principle 

of the ABMA is presented in Section 3.1. 

The chapter also introduces a way of operating a distribution of the 

computational load engendered by the ABMA among several proces-

78 Chapter 3. Multiscale Block Matching Algorithm 

sors. This work has been achieved in the Master thesis of Francois Vermaut 

[131], that we have supervised. Section 3.2 presents his original 

work. 

3.1 Adaptive Block Matching Algorithm 

The overall ABMA algorithm, presented in the following subsections, 

can be summarized as follows: at rst, global camera motion, e.g. 

panning and zooming, is estimated. The reference image is then compensated 

according to the estimated global parameters. A change detector 

compares this resulting prediction with the original image and 

outputs a binary mask that allows the distinction between globally and 

locally moving regions. The third and last step consists in an improved 

version of the BMA: a split-and-merge procedure considers a hierarchy 

structure of block sizes (or scales, as the local part of the ABMA is an 

example of multiscale (cf. Section 2.3.2.1) methodology) so as to overcome 

the problem of having di erent moving objects in the same BMA 

block. 

A few intermediate results of the ABMA are presented on gure 3.1. 

They should be compared with the BMA results of gure 2.16 as they 

both use the same original images. 

3.1.1 Global Motion Estimation 

The apparent motion in most image sequences is the result of the camera 

movement aswell as the movement of the objects in a scene. The part 

chargeable to the camera is generally referred to as global motion, in 

opposition to the local motion of the objects. The camera movement 

can be expressed as a combination of the following categories: 

xed; 

zooming, i.e. change of the camera focal length; 

panning, i.e. rotation around an axis normal to the camera axis; 

rotation, i.e. rotation around the camera axis; 

dollying, i.e. translation around the camera axis; 

tracking and booming, i.e. translation in the plane normal to the 

camera axis, horizontally (tracking) or vertically (booming).

3.1 Adaptive Block Matching Algorithm 79 

Even if only zooming and panning are considered to model global motion, 

Tse and Baker [126] have shown that a two stage global/local ME 

approach improves motion prediction and reduces the amount of motion 

side information. The ABMA uses the following algorithm to carry out 

a suboptimal search for the best pan and zoom: 

1. Division of the picture in large blocks (typically, for QCIF images, 

there is only one block: the picture itself). 

2. Only the blocks with a variance superior or equal Tvar are considered 

in the following steps 4 to 6. 

3. Selection of a set of zooms 1 + fz, with fz = n: fz and n = 

0; 1; :::; nmax. Selection of a BMA search window u; v. 

4. For every valid block, computation of the BMA with every possible 

zoom. Memorization, for every block, of the best combination 

zoom-motion vector (i.e. the one that achieves the lowest MAE). 

5. Selection of the zoom with the highest occurrence among the best 

combinations: fzOPT 

6. fz = fz =2, and repetition of step 4 with f = fzOPT fz. 

7. After maxstep iterations, computation of the BMA for all the 

blocks, with the selected zoom. 

8. Compute the globally compensated picture. 

This globally compensated image will now serve as reference in the next 

steps of the algorithm. 

3.1.2 Change Detection 

The aim of the change detector is to give the position of local (or foreground) 

moving areas. Its result is a picture locating changes between 

the two successive pictures. The value of its pixels can be changed, 

unchanged or uncertain. The nal mask is generated in four steps: 

1. If the di erence between the pixel value in the original image 

and the pixel value in the globally-compensated reference image is 

lower than Tu, the pixel is considered unchanged. If this di erence 

is greater than Tc, the pixel is considered changed. Otherwise, the 

pixel is considered uncertain.


2. Every uncertain pixel having at least six unchanged neighbors becomes 

unchanged. The other uncertain pixels become changed. 

3. A median ltering of size mf mf is applied. 

4. A pixel with six changed neighbors becomes changed. A pixel with 

six unchanged neighbors becomes unchanged. 

Such a resulting mask is depicted on gure 3.1 (a). 

(a) (b) 

Figure 3.1: Some steps of the Adaptive BMA: (a) Change mask (b) 

Local Motion eld (c) Compensated image 

(c)


3.1.3 Local Motion Estimation 

This part of the algorithm aims at estimating the foreground (or \local") 

motion between two successive images. Although the overall procedure 

is block-based, two substantial improvements over the classical blockmatching 

algorithm (BMA, cf. Section 2.5.1) have been implemented. 

At rst, on the basis of a coarse-tone quad-tree split procedure, a 

hierarchical (multiscale, cf. Section 2.3.2.1) structure of block sizes is 

considered. Second, at each level of the tree, a merge procedure is applied 

to correctly propagate the motion vectors from blocks with reliable 

motion to blocks with uncertain motion. 

Thanks to its hierarchical structure, the ABMA performs a segmentation 

of the motion eld much closer to the object boundaries. Both 

improvements help obtaining a compensated image with less blocking 

artifacts and a more consistent motion eld. Figures 3.1 (b) and (c) 

illustrate these points, with regards to gure 2.16. 

The various steps successively involved by the local estimation of the 

ABMA are described hereunder. The algorithm starts with a pre-de ned 

initial size for the blocks and a pre-de ned maximum number of iterations 

(or minimal block size). It is only processed for the blocks possessing 

at least c% ofchanged pixels in the change detection mask (described 

in Section 3.1.2). The steps are: 

Local Motion Determination computes a BMA for all the 

blocks of size (mi;ni) to be processed (i is the iteration number: 

i =1; 2; :::imax). Using a search window of size (s; s), it results, 

for every block labeled k, in an optimal displacement di(k), corresponding 

to a minimal mean absolute error MAEi:min(k). 

Estimated Motion Certitude (MC) is computed for every block 

simultaneously with the BMA determination. It is de ned for a 

block k at iteration i as follows: 

PNs t=1 

MCi(k) = 

(MAEi(k)(t),MAEi:min(k)) 

Ns 

(3.1) 

1+ i:MAEi:min(k) 

with Ns = s s the number of tested values in the BMA. If 

the motion certitude is not high enough (MCi(k)


achieved only if d 0 

i(k) results in a MAE 0 

(1 + i)MAEi:min(k). 

MAE 0 

is the MAE obtained when the motion vector d 0 

i(k) ofthe 

neighbor block is applied to the block undergoing treatment. 

It is important to notice that, in order to get rid of the in uence 

of the direction image is traveled, all blocks are treated at once. It 

means that each block uses the vector value of its neighbor blocks 

before they were treated. 

Non-linear ltering is then applied in order to homogenize the 

motion eld while preserving the contours. The con gurations of 

gure 3.2 are searched for: the block undergoing ltering is the 

central one, and the displacement vectors of neighbor blocks are 

compared to see if the shadowed blocks possess the same one. If 

one con guration occurs, the displacement vector di(k) ischanged 

to the common value of the shadowed neighbor blocks. The new 

MAE must verify MAEnew (1+ i):MAEold to be accepted. The 

two con gurations of gure 3.2(a) have the priority on the four 

con gurations of gure 3.2(b). Here also, all blocks are treated at 

once. If two con gurations of (a) or (b) simultaneously apply, the 

one achieving the best MAE is selected. 

Further splitting of the blocks is performed if necessary. The 

criterion is that if MAEi(k)


(a) 

Figure 3.2: Possible non-linear lter operations in the ABMA 

is not always improved thanks to the ABMA, it is important to notice 

that a segmentation of the motion eld is provided via the multiscale 

representation. Moreover, the ABMA motion eld can in many cases be 

coded more e ciently. 

(b) 

Sequence BMA ABMA 

Akiyo 38.2886 38.1356 

Container Ship 32.7014 32.8145 

Hall monitor 32.3618 34.0889 

Mother & daughter 36.2976 35.3678 

Coast guard 27.3405 27.6874 

Foreman 28.7040 27.8277 

News 31.4820 32.6105 

Silent 32.1818 32.4255 

Mobile 23.0172 24.1174 

Stefan 22.9990 24.3879 

Table tennis 30.1653 30.7733 

Table 3.1: Comparison of BMA and ABMA performances (PSNR) 

Figure 3.3 shows the comparative result between images 1 and 4 of the 

Table Tennis sequence.


(a) Original image at time t-1 (b) Original image at time t 

(c) 8x8 BMA motion field (d) BMA compensation 

(e) ABMA motion field 

Figure 3.3: Table Tennis: BMA vs ABMA 

(f) ABMA compensation

3.2 Distributed Version of the Local Motion Estimation 85 

3.2 Distributed Version of the Local Motion Estimation 

The main objective ofFrancois Vermaut's dissertation [131, 132] was to 

speed up the computation time engendered by the ABMA. In this context, 

the envisioned solution is to distribute the algorithm among several 

processors. Only the local motion estimation part (cf. Section 3.1.3) is 

tackled since this step is the most e ort-demanding 1 . Vermaut's work 

thereafter involved the following steps that are respectively presented in 

subsections 3.2.1 to 3.2.4: it starts with an abstract speci cation of the 

local motion estimation part of the ABMA, which leads to a distributed 

model. Once one is in possession of such a model that resembles a state 

automaton, one has to practically de ne the type of data structures and 

the state transitions of the various elements. Finally, an implementation 

under the Parallel Virtual Machine (PVM) environment demonstrates a 

linear speedup. 

3.2.1 Pseudo-Code of the Sequential Loop 

In order to clarify the problem to be solved, a preliminary step was to 

formalize the algorithm to be distributed. The algorithm pseudo-code 

has thus been rst established. 

Algorithm Local Motion Estimation 

begin 

1 sizeblock := sizemax; 

2 while (sizeblock sizemin) do 

3 begin 

4 for each (untreated blocks) BMA; 

5 for each (untreated blocks) MC Treatment; 

6 for each (untreated blocks) Filtering; 

8 if (sizeblock > sizemin) then 

9 for each (untreated blocks) Split Or Done; 

10 sizeblock := sizeblock /2; 

11 end 

end 

From this pseudo-code, it is possible to better understand how the sequential 

algorithm operates. First of all, it is obvious that the main 

1 The \re nement" in terms of blocks of the change detection mask has also not 

been studied for distribution because this task has a very low computational burden.


\object" manipulated by this algorithm is the block entity, which has 

a \life" that evolves during the algorithm achievement. 

If one has a closer look at the sequential loop of the Local Motion Estimation, 

it appears that: 

1. One resulting motion vector is known for all the blocks that are not 

to treat (including the ones to which the change detection mask 

has associated a motion vector equal to zero). 

2. The BMA computation is rst performed on every block. It is a 

totally independent operation. 

3. For the treatment of the motion certitude (MC Treatment), the 

BMA vector, the MAE and the motion of certitude of all blocks 

to be treated are to be known. In fact, for a particular block, this 

information has to be known only for the block itself and its four 

neighbors. 

4. So as to achieve the Filtering step, the result of the motion 

certitude treatment iswaited for. Here also, a particular block 

could be \ ltered" once the motion certitude results are known 

for the block and its eight neighbors. 

5. All blocks to be treated are always at the same iteration level of 

the sequential loop. They all have the same size. 

3.2.2 Model of Distribution 

From what has been identi ed in the previous section, precise states of 

the block \life" can be distinguished. These states, presented on gure 

3.4, are separated by transitions (computations) that can be carried 

out once precise conditions are satis ed. The local motion estimation 

can then be modeled as a nite state automaton [35]. 

From the pseudo-code of the sequential algorithm (cf. Section 3.2.1), 

four intermediate steps emerge out of the block \life": one after the 

BMA computation, one after the treatment of the motion certitude, one 

after the non-linear ltering and a nal stage when the block is assigned 

a de nitive vector (Done) or is split into four subblocks (Split). However, 

all these steps are not real automaton states. The real states 

that have to be considered are those which prevent a block from being


Unknown 

Known 

Propagated 

Done Split 

Figure 3.4: State Diagram 

totally independent from the other ones, i.e. the states that are reached 

through a transition requiring precise conditions. 

In the initial state, all blocks to be treated are said to be \Unknown": no 

information is known about their possible motion vector. All the other 

blocks, that have not to be treated 2 , are in a state called \Done" for 

their treatment isover. 

The blocks to be treated can quit their initial state. During the transition, 

the BMA is computed for the block. Once the BMA computation 

ends, the block state becomes \Known" because the BMA vector, MAE 

and motion certitude are determined. 

The \Known" state serves as a synchronization point: a particular block 

in this state needs its four neighbors to be in the same state so as to 

exploit the motion certitude information. Since this step requires the 

propagation of information from neighbor blocks, it puts the block in 

\Propagated" mode once the treatment is actually achieved. 

The \Propagated" state is also a synchronization anchor prior to the 

2 Either because they have been assigned a null motion vector by the change detection 

mask of Section 3.1.2, or because they have been successfully treated.


ltering. However, if one has a close look at the sequential loop (Section 

3.2.1), one can notice that no intermediate state is needed after the 

ltering operation: no additional information is needed for the algorithm 

to decide whether the block must be split or not, and no synchronization 

with the neighbor blocks is needed. This step can therefore be consecutively 

carried out. After the \Propagated" state, the block directly ends 

its \life": either it reaches the \Done" state, where its nal motion vector 

value is known, either it reaches the \Split" state, where the block 

does not exist anymore as it is divided into four new blocks. 

The subblocks for which the inherited vector satis es the nal conditions 

directly reach the \Done" state. Otherwise, the subblocks are 

\Unknown" and will run the loop during a new iteration. 

Such a model gives the possibility to create a sequential version as well 

as di erent distributed versions that will still produce the same results 

as the sequential one. A simple distributed structure has been chosen: 

only the BMA, which is the most e ort-demanding, is distributed among 

several processors. This step can therefore be carried out for several 

blocks in parallel. 

A classical master and slave structure is then set up ( gure 3.5), where 

the slaves only perform one action, i.e. BMA computation, while the 

master is used to distribute the di erent blocks, to receive the results 

of the computation of the slaves and to perform motion certitude treatment, 

ltering and decision. The di erence with the sequential 

version is that the three last steps are not applied to all blocks 

at once but as soon as a block is ready. It implies one knows what 

the precise conditions for a transition to occur are, which is described 

in the next section. 

3.2.3 Practical Implementation 

Slaves are allowed to achieve only one action, i.e. the BMA computation. 

The master provides them with the position and the size of the 

\Unknown" block they have to treat (cf. gure 3.5). The master then 

gathers the results, that is to say avector, the associated MAE and the 

associated motion certitude. 

The master also performs the motion certitude treatment, the ltering 

and the decision procedure as soon as possible. This idea allows the 

block to be an independent computation unit in constant relation with 

neighboring blocks.


block position 

block size 

3.2.3.1 Data Structures 

Master 

Motion Certitude 

Slave Slave ...... 

Slave 

Figure 3.5: Master-Slaves structure 

best vector 

MAE 

It is rst assumed that the master and all the slaves possess the two 

images at time t,1 and t. This is the only information slaves have access 

to. The master manages four data structures: the multigrid, which is 

the nal result of the ABMA, a list with the \Unknown" blocks to 

be treated, a list with the available slaves and one last structure of 

couples (slave id,block id) (one couple for every block undergoing BMA 

estimation). 

3.2.3.2 State Transitions 

To describe the state transitions, i.e. the conditions that enable the 

related computation to be performed, an exemplary block \life" is presented. 

As soon as a slave isavailable, a block inthe\Unknown" state is sent 

to that slave for BMA completion. When the slave returns the results, 

the block goesinto the \Known" state. 

The process becomes more complex when the master wants to treat the 

motion certitude ( gure 3.6): not all vectors for all multigrid blocks are 

needed but only the vectors of four speci c neighbors. The transition 

can occur once these four blocks are also in the \Known" state. Some 

of the neighbor blocks can already have their certitude treated and be 

in the \Propagated" state. In order to obtain the same result than the 

sequential version (where all blocks are treated at once), it is the vector 

value before motion certitude treatment that must be used (and stored 

to this purpose).


K 

K K 

P 

K 

K 

K 

K 

K 

K 

K 

K 

K 

P 

K 

K 

K P 

Figure 3.6: Known to Propagated transition conditions: all four 

neighbors must already be known 

P D 

K K 

Figure 3.7: Known to Propagated transition conditions: larger blocks 

have to be split or de nitively treated (Done) 

However, these conditions are only su cient during the rst iteration 

of the loop. At this stage, all blocks have the same original size. During 

the next iterations, blocks of various sizes coexist. A block in the 

\Known" state can have as neighbors larger blocks in \Done" or \Propa- 

K 

K 

P 

K 

K 

K 

K


gated" states because the previous iteration for these neighbors has not 

completed yet. One has to wait for \Propagated" blocks to fall into the 

\Done" or \Split" state ( gure 3.7). In the latter case, the transition 

condition is determined by the appropriate subblock. 

Similarly to what has just been presented, the non-linear ltering of a 

block (and the decision algorithm) can start as soon as its 8 neighbors are 

either in the \Propagated" state, either in the \Done" or \Split" state. 

In every case, it is the vector value resulting from the \Propagated" state 

that is to be used. Once again, some neighbor blocks can have a larger 

size. They can only be neighbors via the diagonal because horizontal 

and vertical blocks had to be over the \Propagated" state during the 

previous step. One has to wait until such larger blocks fall into the 

\Done" or \Split" state. 

So as to somehow complete this informal description of the transition 

conditions, some additional properties should be described. Figure 3.8 

presents an impossible situation where a block has as neighbors bigger 

blocks in \Known" or \Unknown" state. This case is impossible because 

the block undergoing treatment results from a bigger block that has 

been split after the motion certitude treatment and the ltering (what 

implies that all its neighbors were at least in the \Propagated" state). 

P 

P 

K 

Figure 3.8: Impossible situation 

But there can be more than one di erence of size between two neighbor 

blocks. If it is the case, the biggest blocks must absolutely be in the 

\Done" state as illustrated on gure 3.9. 

To achieve the practical implementation, Vermaut has developed all the 

necessary structures to manage the block data and to keep in memory 

the block position in the image, its size, its present state and related 

information, and a list of neighbors. E cient ways of performing the 

wakening of a block have also been erected.


D 

D 

Figure 3.9: Two neighbors with more than one step of size between them 

3.2.4 Experimental Results 

Such a model of distributed ABMA has been implemented on a network 

of conventional workstations, using a PVM platform. Tests have been 

conducted on an Ethernet linking 10 SUN computers. The parameters 

were the numberofslaves and the complexity of the test sequence in 

terms of motion information (cf. appendix A). Generally speaking, 

a linear speedup with an e ciency of 50% was observed. Figure 3.10 

illustrates one of the benchmarks. 


The present chapter aimed at brie y introducing a scheme that have 

been developed in order to improve the performances of one out of several 

motion estimation techniques. The ABMA does actually allow one to 

obtain a motion eld that better matches the data than the one provided 

by a classical BMA. 

However, this improvement of the result is only possible with an increase 

of the computational burden. It was then proposed to Francois Vermaut 

to tackle this problem and to try distributing the load among several 

computational units. An implementation of the established distributed 

model demonstrates the possibility to perform the ABMA in real-time 

using several processors. 

D


20 

18 

16 

14 

12 

10 

8 

6 

4 

2 

o Computational Computation Times Time 

+ Speedup 

0 

1 2 3 4 5 6 7 8 9 10 11 

Number of slaves 

Figure 3.10: Computation time and speedup for the \Table Tennis" 

sequence

Chapter 4 

Image Pre-Processing for 

VLBR Video Coding 

In the introduction of Chapter 1 to video coding, the particularity of 

very-low bitrate (VLBR) coding has been highlighted: VLBR compression 

schemes generally have to debase more information than the only 

irrelevant part of the signal; drastic debasement of the pictures has to 

be achieved in order to reach the requested rates. 

In this context, the COMIS scheme (cf. Section 1.4.2) has chosen the 

following approach: instead of letting (the bitrate regulation part of) 

the coder automatically perform a strong quantization, the images are 

voluntarily simpli ed prior to coding. It is then expected that these 

simpli ed images will be easier to encode as only their irrelevancies would 

have to be removed. 

As the insertion of such a pre-processing within COMIS has not allowed 

the coder to surpass its usual performances, the present chapter intends 

to analyze whether VLBR transmissions could bene t or not from 

this treatment and thereafter achieve a better quality (in comparison to 

the original images, before pre-processing) at equivalent rates. 

Section 4.1 introduces the hypothesis about the possible gain of preprocessing. 

Section 4.2 considers it in a rate-distortion framework and 

theoretically settles the conditions for improving the coder performances. 

The next section (Section 4.3) experimentally analyses the behavior and 

the actual improvement brought about by such a pre-processing when 

inserted in a precise very-low bitrate coder: H.263 [96]. Finally, the last 

section draws some conclusions.

96 Chapter 4. Image Pre-Processing for VLBR Video Coding 

4.1 Intuitive Rationale 

The hypothesis that constitutes the core of this chapter is the following: 

it could be more pertinent to operate the motion estimation between two 

images that have been \simpli ed" in the same way so as to raise their 

correlation. Thereafter, it is also expected that the resulting residues 

could be encoded at a lower cost, bringing about the whole ratio of 

bitrate versus quality to be improved. An example will help clarify this 

hypothesis. 

Because of its very nalized status, H.263([96], cf. Section 1.4.1) has 

been chosen to test the validity of the idea. Like many other standards, 

H.263 performs its motion estimation thanks to the Block-Matching Algorithm 

(BMA [50], cf. Section 2.5.1). Let us just remind one that the 

BMA is a correlation-based method which assumes that changes between 

successive images result from a local translational motion. With Tziritas 

and Labit [127] (in their chapter 4), one may de ne the correlation 1 

between two successive pictures I(x; y; t , 1) and I(x; y; t): 

= E[I(x; y; t):I(x , u; y , v; t , 1)] 

E[I 2 (x; y; t)] 

(4.1) 

where E designs the mathematical expectation and (u; v) is the displacement 

vector between (blocks of) the two consecutive images. It 

has been proven that the e ciency of the BMA directly depends on this 

correlation coe cient. 

One may then use this coe cient as a rst indication about the BMA 

performances in various coding con gurations. 

Figures 4.1 (a) and (b) depict two original images of the Akiyo sequence 2 . 

Their correlation is computed after that image (a) has been compensated 

via a 16 16 BMA at pel precision: its value is =0:998. The 

motion eld detected by the BMA is presented on gure 4.1 (c): as only 

the head of the speaker moves (slightly), only three blocks receive a 

non-zero vector. The residual image (or DFD), i.e. the di erence image 

between image (b) and the motion compensation of image (a) by motion 

eld (c), is presented on gure 4.1 (d). It is characterized by avery low 

variance equal to 2 =9:25. 

1 Of course only depends on the variable t. x and y are present in equation 4.1 

to indicate that the expectation is computed over all pixels of the image. 

2 Appendix A brie y introduces the various test sequences that are used.

4.1 Intuitive Rationale 97 

(a) (b) 

(c) (d) 

Figure 4.1: Temporal correlation - rst case: (a) and (b) Original Akiyo 

images # 101 and 104, (c) Detected motion eld between (a) and (b), 

(d) Residual image after motion compensation 

But, in the coding process, H.263 performs its motion estimation (and 

compensation) between the already decoded version of the previous image 

and the new original one. At very-low bitrates, the reference image 

is highly debased because of the limited channel capacity. Instead of 

possessing I(x , u; y , v; t , 1), the motion estimation uses the reconstructed 

image I(x , u; y , v; t , 1), which is only a rough version of the 

original. The correlation factor with the next original image is thereafter


reduced. Figure 4.2(a) shows the coded version of gure 4.1(a). The 

next image to be coded remains the same ( gure 4.2(b)). In this case, 

the correlation falls down to =0:984 and the resulting motion eld is 

noisier ( gure 4.2 (c) with respect to gure 4.1 (c)). The residual image 

is also much more important: its variance raises to 2 =47:66. 

(a) (b) 

(c) (d) 

Figure 4.2: Temporal correlation - second case: (a) Akiyo image # 101 

when H.263 coded at 10 kbits=s, 10Hz, (b) Original Akiyo image # 

104, (c) Detected motion eld between (a) and (b), (d) Residual image 

after motion compensation 

We here consider that it could be more pertinent to operate the mo-

4.1 Intuitive Rationale 99 

(a) (b) 

(c) (d) 

Figure 4.3: Temporal correlation - third case: (a) Akiyo image # 101 

when H.263 coded at 10 kbits=s, 10Hz, (b) Akiyo image # 104 intracoded 

with H.263 (QP=11), (c) Detected motion eld between (a) and 

(b), (d) Residual image after motion compensation 

tion estimation between the decoded version of the previous image and 

a pre-processed version of the new one 3 . The underlying assumption is 

that very-low bitrate coding does not simply introduce additive white 

3 Originally, coders would perform their motion estimation between two original 

images. Then, experiments demonstrated that it was interesting to use the decoded 

image as reference image. It is here suggested to go one step further and to \simplify" 

the (new) image to be estimated.


noise but rather deteriorates the image in a way that is dependent on 

the image itself: the noise is correlated with the original signal. The goal 

of the pre-processing would thereafter be to \simplify" the new picture 

in a similar way. For instance, if one rst intra-codes the new incoming 

image ( gure 4.3 (b)) and tries to predict if from the previously 

(de)coded one ( gure 4.3 (a)), the correlation raises back to =0:990, 

and the estimated motion eld appears simpler to encode ( gures 4.3 

(c)). However, the complexity of the residual image has not been reduced: 

its variance is equal to 2 =50:66. The whole coding process 

will only bene t from the pre-processing if the gain of the motion eld 

coding is more important than the possible loss of the residual coding. 

It has to be pointed out that, although the H.263 residual coding that 

leads to the image of gure 4.3 (a) is achieved with a quantization step 

of 16, the new coming image (b) is pre-processed as an intra image with 

a quantization step of only 11: it seems logical to have a pre-processing 

that enables quality improvement whenever the bitrate allows it. 

Before establishing a theoretical framework to formally set the problem, 

it is important to stress the di erence between the proposed preprocessing 

and other processings which also aim at reducing the bitrate. 

These are: 

Image (temporal and/or spatial) downsampling. Typically, H.263 

encodes QCIF sequences at 8:33 Hz instead of full screen images 

at 25 Hz. The proposed pre-processing is applied in addition to 

that type of image simpli cation. 

Selective ltering of the residual information (it can consist in a 

mere threshold) according to the motion pertinence [43] ortothe 

relevance of the residues [90]. Our initial aim is to directly simplify 

the whole image without making any selection. 

Low-pass ltering applied to the compensated image so as to smooth 

it prior to the DFD computation, like the loop lter of H.261 [95]. 

Here, ltering (pre-processing) is directly applied to the original 

image. It therefore acts upon both the motion eld and the residual 

image.

4.2 Rate Distortion Conditions 101 

4.2 Rate Distortion Conditions 

In its presentation of the Rate-Distortion theory [6] (a summary of which 

may be found in appendix B), Berger points out that \[...] in systems designed 

to transmit pictures and telemetry data -namely after a complex 

encoding and decoding technique is developed at considerable expense 

in order to transmit data over a channel with high reliability at rates 

approaching capacity, it is not clear just what information should be 

sent." This assertion reinforces the idea to pre-treat the images prior to 

inter-coding: it seems irrelevant to predict data (i.e. the nest details 

of the images) that have already disappeared in the previously coded 

image which serves as a reference. 

The present section aims at analyzing this hypothesis with the help 

of the Rate-Distortion theory. Once an analytical model of images is 

chosen (Section 4.2.1), the speci city ofvery-low bitrate is highlighted 

(Section 4.2.2). The various modes of transmission are then compared 

and the conditions for a possible gain thanks to the pre-processing are 

highlighted. 

4.2.1 Image Model 

With O'Neal and Natarajan [98], one will suppose that images are zeromean, 

without any generality loss. Images are described by a Gaussian 

(isotropic) model of spatial covariance. The covariance of the image at 

time t , 1, It,1 = I(x; y; t , 1), is 

,It,1 = E[I(x 0 ;y 0 ;t, 1):I(x 0 , x; y 0 , y; t , 1)] = 2 

p 

, : x2 +y2 I:e ; (4.2) 

where E designs the mathematical expectation and 2 

I is the image variance. 

One obtains the spectral density of the image as a function of the 

spatial frequencies !x and !y: 

It,1(!x;!y) = 

Z +1 Z +1 

,1 

,1 

where ! = q ! 2 x + ! 2 y and ! 0 = . 

,It,1e ,j!xx e ,j!yy :dx:dy = 2 ! 0 2 

I 

(! 2 0 + ! 2 ) 3 ; 

2 

(4.3) 

Tziritas and Labit [127] consider digital (sampled) pictures as bandlimited 

random elds. The use of an ideal and invariant by rotation


low-pass lter allows to use It,1 as de ned by equation (4.3) only when 

! p 2 (! is a normalized frequency). It,1 is equal to zero otherwise. 

I t-1 

motion (u,v) 

Figure 4.4: Model of video sequence: image at time t is a displaced 

version of the previous one with some additional noise. 

The next image in the video sequence can be modeled according to 

gure 4.4: It results from a displacement of image It,1 and some additive 

white noise which represents the illumination changes,::: The motion 

e ect is de ned as a Dirac function (x , u; y , v), and the spectral 

density of the image at time t is 

N(t) 

It(!x;!y) = It,1(!x;!y)+ 2 

N ; (4.4) 

where 2 

N is the variance of the white noise. The direct correlation 

between successive pictures is de ned as: 

0 = E[I(x; y; t):I(x; y; t , 1)] 

E[I 2 (x; y; t)] 

I 

t 

(4.5) 

and is linked to the motion-compensated correlation of equation (4.1) 

by: 

Of course, 

0 . 

4.2.2 Intra Coding of Images 

p 

0 ,!0 u2 +v2 = :e : (4.6) 

One can compute the Rate-Distortion function of intra-coding It,1. It 

can be achieved by using polar coordinates [47]. A simpler expression 

is obtained when the restrictive hypothesis of memoryless coding is applied 

[6]. The rate R necessary to encode the image It,1 with a distortion 

D is then: 

R = 

( 1 

2 log 2 

2 

I 

D if D 2 

I ; 

0 otherwise: 

(4.7)


Such an equation is used [127] in order to de ne theoretical limits that 

allow a coder to automatically decide when inter-coding should be used 

instead of intra-coding. However, what is needed here is to derive a 

model of images after intra-coding so as to determine whether it is useful 

or not to pre-treat the images prior to motion estimation. 

Intra-coding is modeled in a twofold way: 

At rst, coding achieves some selection about the information to 

transmit. Schemes like H.263 [96] or MPEG-4 [42] have chosen 

to transmit as many low-frequencies as possible in order to rst 

reconstruct a rough approximation of the image. Higher frequencies 

(i.e. the image details) are only transmitted afterwards, if the 

bitrate allows it. One may thus consider the reconstructed image 

as a low-pass version of the original one: very-low bitrate uses a 

relatively strong low-pass lter, while this e ect is almost null at 

high bitrates. 

Secondly, some coding noise is added to the image. It is considered 

to be white. It is the unique e ect of coding present at high 

bitrates, but which exists at any bitrate. 

The spectral density ofanintra-coded image It,1 is thus: 

It,1 (!x;!y) = 

8 

>< 

>: 

It,1(!x;!y)+ 2 

C if ! !cut; 

2 

C if !cut


I t-1 

I t 

I t-1 

I t 

Intra-coding 


Pre-treatment 

(a) 

(b) 

I t-1 

I t-1 

I t 

- 

+ 

- 

+ 

Error signal 

E t 

Error signal 

Figure 4.5: Inter-coding of It without motion compensation: (a) usual 

scheme versus (b) pre-processing scheme. 

4.2.3.1 Usual Scheme 

The spectral density of the prediction error to be encoded is 

E = ( It,1 + It , 2 It,1:It 

2 

= 

It,1(1 , 0 )+ 2 

N + 2 

C if ! !cut; 

It,1 + 2 

N + 2 

C if !cut


D (constant quality scheme). 

4.2.3.2 Proposed Scheme 

The scheme depicted on gure 4.5 (b) assumes that the new image It has 

been pre-processed. So as to simplify the new picture in a way that is 

similar with the intra-coding of the reference frame, the pre-processing 

will consist in applying a low-pass lter ( ! 0 cut) to the new image. As this 

pre-processing may not be perfect, it also produces some noise C 0 . Of 

course, !cut ! 0 cut (cf. Section 4.1), and the coding noise C 0 is in no way 

correlated with the noise C that a ects It,1. The residual information 

of the pre-processed signal is characterized by: 

02 

E ' !2 cut 

[2(1 , 4 0 ) 2 

I + 2 

N + 2 

C + 2 

+ !0 2 

cut ,!2 cut [ 2 2 2 2 

I + N + C + C0] + 4 ,!02 

[ 2 

C + 2 

C0] : 

4 

cut 

4 

C 0] 

(4.11) 

From the comparison between equations (4.10) and (4.11), one can conclude 

that the proposed method is more e ective if: 

2 

C 

cut 

4 , !02 

0 < 

4 

2 

I + 2 

N : (4.12) 

One may argue that this comparison is valid only if one considers the 

error signals E and E 0 to be encoded with the same distortion, which 

should not be the case. Actually, in the proposed scheme, the error signal 

is expected to be encoded with nearly no errors as it helps reconstructing 

an already debased picture. This would result in a rate approaching 

in nity! Nevertheless, as ! 0 cut is chosen larger than !cut, E 0 can still be 

distorted to maintain a constant quality. 

Considering equation (4.12), it appears that some gain can be expected 

only if the low-pass e ect of pre-processing is predominant with regards 

to the additive noise. In this case, the reduction of the error signal is 

logical as the spectral density of both signals has been limited. 

4.2.4 Inter Coding with Motion Compensation 

Inter-coding with motion compensation exploits motion estimation in 

order to obtain a prediction of the new image and to reduce the error 

signal. Nevertheless, motion estimation rarely successes in detecting the 

real motion (u; v) and guesses it with an error d =( u; v). The motion


I t-1 

I t 


Pre-treatment 

I t-1 

I t 

Motion compensation 

Figure 4.6: Inter-coding with motion compensation 

- 

+ 

Error signal 

compensation box of gure 4.6 is therefore modeled by a Dirac function 

(x , u + u;y, v + v). 

According to Tziritas and Labit [127], one can consider the estimation 

error d to be centered, isotropic and Gaussian. Its characteristic function 

is then: 

d(!x;!y) = 

1 

(( d:! 

2 )2 +1) 3 2 

E t 

; (4.13) 

with d the variance of this error. The mutual spectral density oftwo 

consecutive images is (note the use of instead of 0 in order to take 

into account the motion): 

It,1:It = (1 + d:!0 

2 ) 2 : It,1: (4.14) 

If one considers that the proposed pre-processing has no in uence on motion 

estimation precision, the condition of equation (4.12) is still valid. 

However, it is expected that the pre-processing enables the motion estimation 

to better perform, which results in a error variance 0 d < d. The 

proposed scheme is then more interesting once: 

2 

C 

0 < 4 ,!02 

4 

+ !2 cut 

2 

cut 

[ 2 

I + 2 

N ] 

2 

I : :! 0:( d , 0 d ): 

4.2.5 Theoretical Conclusion 

1+ ! 0 

4 ( 0 d + d) 

(1+ d:! 0 

2 

) 2 (1+ 

0 

d :! 0 

2 

) 2 

: 

(4.15) 

From all these rate-distortion equations, three major deductions, that 

will need to be validated through experiments, can be made. They are:

4.3 Experimental Results 107 

Pre-processing (low-pass ltering) of the new image decreases the 

rate required to encode it. 

In the assumption that pre-processing improves the precision of 

the motion estimation, the overall gain of pre-processing will be 

reinforced. 

The theoretical model announces an improvement of the ratedistortion 

ratio. This means that, at an equivalent rate, the quality 

of the encoded pre-processed sequence with respect to the original 

pre-processed sequence will be higher than the quality of the 

encoded original one with respect to the original sequence. This 

does not ensure that the quality of the encoded pre-processed sequence 

surpasses the quality of the encoded original one, both with 

respect to the original sequence. 

4.3 Experimental Results 

In order to experimentally (in)validate the present hypothesis and the 

relevance of equation (4.15), a preliminary experiment has consisted in 

2 

computing the correlation , the variance of the residual error E, and 

the variance of the motion eld components ( 2 

u and 2 

v). H.2634 [96] 

has been used. Table 4.1 presents the results on several MPEG-4 test 

sequences. 

The test conditions are the following: every sequence has been coded at 

variable bitrate with a constant quantization step of 10. It results in a 

wide variety of bitrate according to the sequence content: starting from 

18 kbit=s for Akiyo or 22 kbit=s for Mother & Daughter, upto235kbit=s 

for Stefan. For every image to be inter-coded, twoschemes are compared: 

the rst line of table 4.1 (for a precise sequence) presents the correlation 

and the variances when motion estimation & compensation is performed 

between the new original image and the reference coded one. The second 

line presents the same results where all the originals images have rst 

been intra-treated with a quantization factor of 5. 

The results of table 4.1 are somehow in concordance with equation 

(4.15): the correlation raises thanks to the pre-processing and, in most 

cases, the energy of the residual error decreases. On the other hand, one 

4 Thanks to software tmn 2.0 provided by Telenor at 

http://www.nta.no/brukere/DVC/


Sequence E u v 

Akiyo 0.985642 6.158601 2.760048 3.000152 

pre-processed 0.987657 6.030156 0.938596 1.278907 

Container ship 0.969350 8.288046 9.128978 2.802829 

pre-processed 0.972665 8.045071 9.664501 3.490324 

Hall monitor 0.967627 9.043882 3.149510 3.888562 

pre-processed 0.970485 8.873044 3.612144 3.903613 

Mother & daughter 0.967063 6.634926 3.421437 5.444000 

pre-processed 0.969207 6.386776 3.909316 5.994232 

Coast guard 0.939101 12.353096 6.217164 3.344785 

pre-processed 0.940386 12.219138 6.736460 3.714005 

Foreman 0.944067 12.883052 8.774803 7.610945 

pre-processed 0.944958 12.811791 9.066337 7.844493 

News 0.964210 10.356456 6.839849 5.447062 

pre-processed 0.966480 10.236923 6.309831 1.970497 

Silent 0.964399 10.359665 4.286384 4.248433 

pre-processed 0.967709 10.071909 4.306487 4.285518 

Mobile 0.902015 19.373144 1.879478 0.847039 

pre-processed 0.901617 19.488099 1.879190 0.844782 

Stefan 0.781131 20.427418 12.814752 5.307830 

pre-processed 0.781557 20.438889 12.876472 5.701028 

Table tennis 0.914535 11.021105 7.547181 7.476545 

pre-processed 0.919952 10.767871 7.911386 8.445316 

Table 4.1: Correlation and variances with or without pre-processing 

cannot really maintain that the motion eld appears to be more coherent: 

for almost all sequences, the motion variance raises, which indicates 

a more sparse (and probably more di cult to encode) motion eld. Only 

Akiyo, News and Mobile (with an increase of the residual error) present 

a \better" motion eld. As is has been theoretically demonstrated, the 

reduction of E is probably the result of the low-pass ltering implicitly 

applied by the intra coding. 

The second experiment then consisted in achieving the pre-processing 

inside the coding loop so as to measure the real impact of this operation. 

As the intra pre-processing (with a quantization step of 10, 20 or 30) is 

probably not the one that achieves the best compromise between low-


Mean PSNR 

34 

33 

32 

31 

30 

29 

28 

27 

26 

25 

Akiyo 

Normal coding 

Gaussian 0.5 

Gaussian 0.9 

Gaussian 1.2 

Median 3 

Median 5 

Median 7 

Morpho 3 

Morpho 5 

Morpho 7 

Intra Q10 

Intra Q20 

Intra Q30 

24 

0 5 10 15 

Bitrate (kbit/s) 

20 25 30 

Figure 4.7: Results of various pre-processings on \Akiyo" 

pass ltering and additive noise, three other types of pre-processing have 

also been tested: 

Gaussian ltering with a standard deviation of 0.5, 0.9 or 1.2 

(which results in separable lters of size 3, 5 or 7). 

Median ltering 5 with a square window of size 3, 5 or 7. 

Morphological ltering: open-close with reconstruction 6 with square 

structuring element of size 3, 5 or 7. 

After pre-processing of all images, every sequence has been H-263 coded 

with a quantization step of 10, 20 and 30. Figures 4.7 presents the mean 

5 \Median ltering consists of a sliding window encompassing an odd number of 

pixels. The center pixel in the window is replaced by the median of the pixels within 

the window. The median of a discrete sequence a1;a2; :::; aN for N odd is that 

member of the sequence for which (N , 1)=2 elements are smaller or equal in value, 

and (N , 1)=2 elements are larger or equal in value". (out of [105], p. 330) 

6 For the de nitions of morphological operators like open and close, please refer 

to [45, 118].


Size of the lter 3 5 7 

Gaussian 37.60 29.71 27.86 

Median 30.66 26.87 25.72 

Morpho 29.73 28.66 27.32 

Table 4.2: PSNR of \Akiyo" pre-processed sequences 

Peak Signal-to-Noise Ratios (PSNR 7 )versus the bitrates for the Akiyo 

sequence. Curves are similar for all the other sequences. 

What directly emerges from these diagrams is that pre-processing is 

never able to surpass the performances of the \normal" (original) coding. 

This is foreseeable since the pre-processed sequences, prior to any 

coding, already have avery low PSNR when compared to the original 

sequence (cf. table 4.2). However, such a conclusion would be too 

simple. In order to better understand the e ect of pre-processing, table 

4.3 introduces some details regarding the pre-processed coding of one 

exemplary sequence: Akiyo. PSNR 0 names the Peak SNR with respect 

to the equivalent pre-processed sequence prior to coding, while PSNR 

designates the Peak SNR with respect to the original sequence (without 

any pre-processing). The column entitled \Vector" presents the average 

amount of bits devoted to the coding of the motion information. 

Pre-processing PSNR 0 PSNR Vector Bitrate (kbit=s) 

Original 27.88 27.88 57 4.16 

Intra Q10 28.48 27.89 57 4.23 

Gaussian 0.5 29.32 27.12 55 3.65 

Median 3 30.16 26.28 54 3.40 

Morpho 3 30.39 25.67 49 3.20 

Table 4.3: Coding of \Akiyo" with various pre-processings 

7 The Peak SNR (PSNR) between two images I and I 0 

10 log10( 2552 

2 ), where 

D 

is de ned as PSNR = 

. 

2 

D is the energy of the di erence image D = I , I 0


From this table, it appears that: 

As it was expected from equation (4.15), PSNR 0 raises. On the 

other hand, PSNR decreases. The constant quantization step 

used here accounts for these e ects. 

One may notice a crescendo in the simpli cation brought about 

by the various lters (from the intra pre-processing to the morphological 

ltering): each time the PSNR 0 is improved, whereas 

the bitrate is lowered. This tends to prove that the sequence to 

encode becomes simpler. This is illustrated on gures 4.8 and 4.10 

which present the result of the various pre-processings (all lters 

with a size of 3), for Coast Guard and Akiyo respectively. 

It is interesting to notice that the amount of bits necessary to encode 

the motion information is reduced according to the degree of 

simpli cation. However, it is di cult to assert that it directly results 

from the simpli cation itself (which would be in contradiction 

with the experiment of table 4.1). This reduction is most probably 

due to the global cost optimization of H.263 which checks whether 

it is more appropriate to send some motion vectors with residues 

or directly encode a particular block with the DCT. Simpli ed 

images of course have avery-low cost when DCT-coded. 

It appears thus obvious from gure 4.7 that, at an equivalent bitrate, 

the pre-processing never achieves a more performing coding in terms 

of PSNR. One may argue then that the PSNR is an objective measure 

which, at very-low bitrates, is quite far from the real subjective impression 

of the viewer 8 . 

Figures 4.9 and 4.11 present the various pre-processed images of gures 

4.8 and 4.10 after coding with a quantization step of 30 9 . The visual 

inspection con rms the verdict of the PSNR curves: the images are too 

8 Research for an \objective criteria of subjective quality" for instance tries to 

better match the subjectivity of the viewer by using models based on the Human 

Visual System (HVS), like for instance a decomposition into perceptual channels [16] 

or an extension of these channels to the temporal component [129]. 

9 Unfortunately, we had to perform our H.263 simulations with a xed quantization 

step. It would have been more appropriate for subjective testing to present the various 

types of encoding at an equivalent bitrate. The fact is that H.263, so as to perform 

its bitrate regulation, changes both the quantization step and the number of skipped 

frames. It would be meaningless to compare images from two sequences coded at 

5 kbit=s if one contains 100 coded frames and the other 95.


Sequence Original Selective Original Selective 

Q=30 Q=30 Q=20 Q=20 

Akiyo 4.16 4.10 6.79 6.65 

Coast Guard 20.55 16.89 36.45 26.27 

Table 4.4: Comparative bitrate of selective pre-processing of the background 

much debased and the gain in terms of bitrate is not worth the quality 

loss. 

However, this assertion should once more be nuanced. Indeed, the quality 

is unsatisfactory for the crucial part of the images, like the speaker, 

but is, in most cases, su cient as far as the background is concerned. 

The idea is then to introduce a selective simpli cation where appropriate. 

For instance, gures 4.12 and 4.13 present a selective pre-processing 

of the images along with their encoded versions. In both cases, an openclose 

morphological ltering with reconstruction and a square structuring 

element of size 7 has been applied to the background (only the water 

in Coast Guard). The quality is not only subjectively equivalent but the 

bitrate is also drastically reduced in case of global motion (background 

movement). Table 4.4 demonstrates it for Coast Guard: thanks to the 

simpli cation of the background, the bitrate can be devoted to encode 

the main objects of the scene with a better quality. 

This brings us to the concept of the bitrate regulation scheme [19] of 

COMIS (cf. Section 1.4.2) which gives priority to the regions that are 

considered subjectively more important. This implies that the coder 

can automatically segment the images into objects, track these segments 

along time 10 and decide which segments are relevant. Such analysis tools 

still need improvements so as to reinforce their precision and robustness: 

preliminary results of selective pre-processing with automatic segmentation 

and tracking tools demonstrate a lot of instability, which is visually 

very annoying. 

10 For the examples of gures 4.12 and 4.13, hand-made segmentations have been 

used.


Original 

Median 

Intra 

Gaussian 

Morphological 

Figure 4.8: \Coast Guard" sequence: e ect of the various preprocessings


20.55 kbit/s 

35.76 kbit/s 

62.85 kbit/s 

Original 

Median 

Intra 

34.40 kbit/s 

34.10 kbit/s 

Gaussian 

Morphological 

Coding Q=30 

for all images 

Figure 4.9: \Coast Guard" sequence: Result of coding after preprocessing


Original 

Median 

Intra 

Gaussian 

Morphological 

Figure 4.10: \Akiyo" sequence: e ect of the various pre-processings


4.16 kbit/s 

9.09 kbit/s 

12.68 kbit/s 

Original 

Median 

Intra 

9.33 kbit/s 

9.92 kbit/s 

Gaussian 

Morphological 

Coding Q=30 

for all images 

Figure 4.11: \Akiyo" sequence: Result of coding after pre-processing


Original Original 

Q=30 

Selective Selective 

Q=30 

Selective 

Q=20 

20.55 kbit/s 

16.89 kbit/s 

26.25 kbit/s 

Figure 4.12: Result of coding with selective pre-processing for the \Coast 

Guard" sequence


Original Original 

Q=30 

Selective Selective 

Q=30 

Selective 

Q=20 

4.16 kbit/s 

4.10 kbit/s 

6.65 kbit/s 

Figure 4.13: Result of coding with selective pre-processing for the 

\Akiyo" sequence



The hypothesis that, at VLBR, image pre-processing prior to intercoding 

can provide some coding gain has rst been formally described 

in a Rate-Distortion framework: based on a Gaussian model of images, 

VLBR transmission has been considered to apply low-pass ltering to 

the images and to add white noise. The theoretical result of the preprocessing 

is then a reduction of the residual error. This error is demonstrated 

to be additionally reduced if the pre-processing can be considered 

as improving the pertinence of the motion eld. 

However, experiments prove that it is not exactly the case and that the 

simpli cation does not always allow the motion estimation to be more 

precise. Moreover, the bitrate gain o ered by some pre-processing is not 

worth the resulting loss of quality, but for subjectively less important 

regions like the scene background. This proves that a coder like H.263 

is very well optimized for coding low resolution images (QCIF) at (very) 

low bitrates. Or the QCIF format can be viewed as a speci c preprocessing 

well-matched to the H.263 coder. A pre-processing on higher 

resolution pictures for other coders running at unusually low bitrates (for 

instance, MPEG-1 at 100 kbit=s) should provide more obvious gains. 

In case of global camera motion, it can be relevant to automatically 

segment and track the image background so as to simplify it: several 

kbit=s can be won without lowering the overall subjective quality of the 

sequence. According to the target application, one could even go further 

and decide to voluntarily suppress all camera motion that is not relevant 

to the scene interpretation, or to only transmit the motion parameters 

in relevant areas. 

As far as the present theoretical and experimental framework are concerned, 

they could be improved by: 

De ning a theoretical link between the variance 2 

I of an image and 

the variance of its coded version. It would allow a closer analysis of 

the implications of the Rate-Distortion equations. These equations 

could also be re-developed using other models of images than the 

one used here (see for instance [14]). 

Designing pre-processing lters which take the need for temporal 

coherency along the sequence into account, and which are adapted 

on-line to the visual degradations present on the reference image.

Chapter 5 

Mesh-Based Motion 

Compensation 

In Chapter 2, two techniques for motion estimation have been extensively 

presented: the Block Matching Algorithm (BMA [50], cf. Section 

2.5) and warping techniques (cf. Section 2.6). 

While BMA is the technique that is mostly used in standard codecs, 

its compensation stage provides a predicted image that su ers from 

so-called blocking artifacts. In order to reduce such artifacts, overlapped 

block motion compensation has been proposed [116, 4] (cf. Section 

2.5.2). 

The triangular mesh of Nakaya and Harashima [88] (cf. Section 2.6.1) 

performs a better compensation stage, because the triangles implicitly 

use the a ne transform (which is able, what with six free parameters, 

to tackle rotations and zoom e ects in addition to the translation). Although 

it can perform in real-time on speci c hardware, such awarping 

solution has a computational burden that largely exceeds that of the 

BMA, mainly because of its iterative nature. Dudon extended this work 

to active meshes that are automatically adapted to the spatial contents 

of the image [29] (cf. Section 2.6.2) and also establishes other ways of 

estimating the motion of the vertices (mesh nodes) [28]. Wang and Lee 

propose to perform a global optimization [136], while Altunbasak and 

Tekalp focus on how to optimally represent a dense motion eld [3]. 

In addition to the computational burden, a major disadvantage of performing 

motion estimation with meshes (triangular or quadrilateral in 

nature) is the transmission cost of the motion parameters, which remains 

higher than the cost of transmitting the BMA motion eld.

122 Chapter 5. Mesh-Based Motion Compensation 

The adaptation of the vertices location to the spatial contents enables increasing 

the relevance of the estimated motion eld by ana priori liaison 

between the spatial and the temporal information. Other advantages of 

such adaptive mesh structures is the huge domain of applications and 

added functionalities they permit to cover: some examples are 3-D modeling 

[10, 61] or the link with fractal models for intra-coding [18], but 

also video editing e ects like synthetic object trans guration [122], augmented 

reality and other functionalities directly related to Synthetic and 

Natural Hybrid Coding [26] of MPEG-4 (see e.g. the Core Experiment 

M2 [121]). 

The challenging idea of the present chapter is to combine the advantages 

of both previously described techniques: the classical BMA and the implicit 

a ne motion model developed by the triangular mesh (which isin 

fact a wireframe). The principle is as follows: nothing is changed as long 

as the motion estimation is concerned; a classical BMA is applied and 

the resulting motion vectors are transmitted as such. The aim of this 

chapter is research inhow to modify the compensation (reconstruction) 

stage: the motion vectors are not merely used in order to displace blocks 

of the reference frame but rather to serve as an information set to warp 

a mesh that has been built on this reference image. Such a combination 

seems to be signi cant inseveral ways: rst, it enables to heighten the 

subjective quality of the reconstructed pictures by taking spatial information 

into account (adaptive vertices location) and allowing a richer 

motion eld parameterization (a ne transform) 1 . This is performed 

with no increase either in the computational burden of the estimation 

or in transmission cost. Second, and most important, it improves the 

representation stage and allows for the use of new functionalities without 

losing its compatibility with existing standards, since the bitstream 

is not modi ed 2 . 

This idea of an asymmetric scheme has already been touched upon by 

Li et al. in their review of video coding and motion estimation techniques 

[63]: \it seems that it is wise to use wireframe models for image 

synthesis but not for image analysis". This chapter aims at implemen- 

1 Although it is expected to improve the subjective quality, itisinnoway expected 

to raise the PSNR (i.e. objective quality) of the reconstructed pictures. As 

the BMA is a correlation-based technique which optimizes its estimation for a blockbased 

reconstruction, another type of reconstruction can of course not be expected 

to achieve a better correlation result. 

2 This issue will be discussed in the conclusion of the chapter, Section 5.5.

5.1 Estimation, Transcription and Reconstruction 123 

ting such an asymmetric scheme. 

Several aspects of image processing will here be dealt with: ltering 

used to select image features serving as vertices of a content-based mesh; 

interpolation necessary to transpose the block-based motion information 

to the mesh representation; and basic computer graphics involved in the 

warping of the mesh. 

In order to indicate the objectives of the chapter, Section 5.1 uses an 

example to highlight the key features and speci cations of the proposed 

reconstruction method. Later sections focus on details of the scheme: 

corner extraction required to automatically generate the mesh, and the 

inverse kriging operation, which computes the motion vectors of every 

vertex of the mesh, are presented in sections 5.2 and 5.3, respectively. 

Finally, Section 5.4 applies the scheme to several images so as to comment 

results of the proposed scheme. Some conclusions are drawn in the 

last section, 5.5. 

5.1 Estimation, Transcription and Reconstruction 

The aim of this section is to introduce the asymmetric scheme while 

outlining the reconstruction process. The section ends with the identi - 

cation of speci c problems to be tackled in order to reach a satisfactory 

solution. 

In the following sections, the images of gure 5.2 (identical to the ones 

of Chapter 2) will be used as sample images at time t and t , 1. It 

has to be noted that the images are noised: the square is thereafter not 

made up of at texture and its contours are rather fuzzy. Figure 5.2 (c) 

displays the luminance values of the pixels of the 50 th line of the image 

I(t , 1). 

5.1.1 The Proposed Reconstruction 

The scheme aims at enhancing the reconstruction (compensation) step 

of the BMA thanks to the injection of a priori spatial information. Instead 

of merely applying the motion vectors detected for every block, 

the scheme of gure 5.1 proposes to modify the motion reconstruction 

in a twofold way so as to apply it to a mesh structure that has been 

automatically designed upon the reference image:


CODER 

Reference image 

at time t-1 

New image 

at time t 

BMA estimation 

TRANSMISSION CHANNEL 

BMA motion 

field (backward) 

DECODER 

Reference image 

at time t-1 

Mesh creation 

Determine 

vertices motion 

(forward) 

Mesh warping 

Predicted image 

at time t 

Figure 5.1: Outline of the proposed reconstruction scheme. 

the transcription of the motion information is altered (interpolated) 

in order to de ne the motion components of every vertex of 

the mesh structure; 

the compensation stage then consists in warping the mesh according 

to the newly determined motion information. 

The expected result of such a procedure is presented on gure 5.3 for 

the sample images. What emerges in comparison to gures 2.16, 2.22 

and 2.23 is that the blocking artifacts are suppressed by the use of a 

mesh that takes into account the object boundaries. Moreover, the 

computational burden at the decoder end is not drastically increased 

as the motion vectors are not to be estimated but rather computed on 

a limited number of vertices. 

5.1.2 Problems to be Addressed 

The sections to come will attempt to detail the main points of the proposed 

scheme but it seems important to rst identify the problems to



intensity value 

160 

140 

120 

100 

80 

60 

40 

20 

0 

0 20 40 60 80 100 120 140 160 180 

(b) Histogram of line 50 

pel number 

Figure 5.2: The two original sample images. 

be solved in order to obtain a stable and viable solution. Three speci c 

parts of the algorithm necessitate some preliminary explanation as well 

as a precise schedule of conditions. 

5.1.2.1 Mesh Vertices 

The rst step of the reconstruction algorithm is the automatic design 

of a mesh on the reference image. Meshes may be designed in several 

ways. Dividing the image into a priori equal patches provides a regular 

mesh. Such meshes are not adapted to our purposes as they do not 

re ect the scene content and one patch can contain multiple motions 

(because overlaid on several distinct objects). Hierarchical meshes [122] 

are not needed either as they are intended for re nement with time. 

Knowledge-based mesh design that exploits a priori information (like


(a) Optimal wireframe (b) Warped wireframe 

Figure 5.3: Toy example: expected result of the new reconstruction 

process. 

for facial animation in videophony [115]) is useless here since the scheme 

intends to deal with any given video sequence. 

Content-based mesh, that aims at matching boundaries of patches with 

important scene features, appears to be the solution. Such an adaptation 

of the mesh may be implemented in several ways: based on spatial and 

temporal activity [28], or in order to minimize a function of several 

constraints [136]. One of these constraints can be the reconstruction 

error. However, the aim here is to exploit information already accessible 

at the decoder end and to t as much information as possible to the 

objects. The choice is thus to detect some corners -or feature pointsof 

the image. The set up of the triangular active mesh can then be 

implemented through a Delaunay triangulation [119] of the previously 

detected points. 

Constraints about the number and the location of vertices of the mesh 

may be pointed out: 

Their number should be high enough so as to surround all objects 

present in the scene and to allow these objects to undergo independent 

movements. However, if there are too many vertices, the 

mesh could reproduce the discontinuities of the block-based vectors, 

and thereafter keep the blocking artifacts, which should be


avoided. In other words, the number of vertices should be limited 

in order to smoothen the motion eld and suppress the blocking 

artifacts. 

They should be correctly located on object boundaries and corners, 

as appears experimentally from gure 5.4 (out of [138]): considering 

two original images at time t , 1 and t ( gure 5.4(a)), one 

can see the e ect of mesh warping when the vertices are arbitrarily 

placed (b), located on object edges (c) or correctly located 

on corners and edges (d). Only the latter case prevents spatial 

degradations. 

time t-1 

time t 

(a) (b) (c) (d) 

Figure 5.4: Experiment on the importance of the location of mesh vertices. 

5.1.2.2 Motion Interpolation 

The second point of the transcription change is to interpolate the motion 

eld: the BMA information can be considered as a coarse subsampling 

of a dense motion eld. The motion of the vertices represents another 

subsampling of the same dense eld. The problem is that the dense motion 

eld is not known. Combined with the inversion problem mentioned 

hereunder, an appropriate interpolation technique should be found. It 

should also remain as simple as possible so as not to increase the computational 

burden too much.


5.1.2.3 Reversing the Motion Information 

While adapting the motion information to the active mesh, an important 

point must be taken into account: this information should often be 

reversed. The BMA, as implemented in most standards, is e ectively 

computed \backwards", while the mesh, implemented on the image to 

be motion compensated, needs \forward" vectors. It means that the 

motion vectors of the BMA indicate where a block ofI(t) comes from 

in I(t , 1), whereas the mesh is designed on I(t , 1) and the motion 

vectors must indicate where every node must arrive I(t). 

5.2 Mesh Design 

The retained method to automatically design the adaptive mesh thus 

consists of i) detecting image corners to use them as mesh vertices; and 

ii) building the mesh by Delaunay triangulation. The latter is well known 

in the literature [119] and and is brie y presented in Appendix D. The 

developed corner detector will receive some explanation here. 

The literature proposes two classes of algorithms for the extraction of 

corners. The techniques of the rst class work directly on the greylevel 

image. A \cornerness" measure is rst computed for each pixel 

of the image through measurement of gradients and surface curvatures. 

Cornerness C is de ned as the product of gradient magnitude and the 

rate of change of gradient direction. After that, the corners are extracted 

by applying a threshold on the used measure. Most known detectors of 

this kind are: 

the one of Beaudet [5] who proposed a detector which looks for 

extrema of a rotationally invariant operator DET, i.e. the determinant 

of the Hessian of the image: 

DET = IxxIyy , I 2 

xy; (5.1) 

where Ii designs the partial derivative ofI with respect to i, and 

Iij a second partial derivative; 

the operator proposed by Dreschler and Nagel [27] relies on a combination 

of maximum and hyperbolic points of the Gaussian curvature 

of the intensity surface 

C = 

DET 

(1 + I 2 x + I 2 ; (5.2) 

2 

y)

5.2 Mesh Design 129 

the cornerness measure of Kitchen and Rosenfeld [56] de ned as 

the change of the gradient direction along an edge contour multiplied 

by the local gradient magnitude 

C = 

2 

IxxIy + IyyI 2 

x , 2IxyIxIy 

I 2 x + I 2 y 

; (5.3) 

the operator of Zuniga and Haralick [139], which is based on a facet 

model approach where images are considered as bi-cubic polynomial 

surfaces; ... 

the detector of Harris [46] which makes use of the local autocorrelation 

function to simultaneously detect corners and edges, 

Deriche [25] has highlighted the behavior of such well-known cornerness 

measures so as to correct their faulty localization. It results 

in a measure that combines Beaudet's measure and the zerocrossing 

of the Laplacian. 

Recently, Rohr [114] has proposed an analytical model of grayvalue 

corners in order to further study the properties of direct 

corner detectors. 

The second class of algorithms at an initial stage explicitly extracts the 

edges as chain codes. The corners are then found as points belonging 

to those edges which have a high curvature and whose curvature is a 

local maximum. Edges are often detected by means of the Cany [79] 

operator, while the curvature along an edge may be estimated by nite 

di erentiation [79,83] of the chain of neighboring pixels or by computing 

the partial derivatives [24] using intensity images. 

In order to get rid of the image noise as well as to select only the 

\strongest" corners, techniques belonging to the rst class are usually 

applied to a ltered version of the images, which results in corners displacement. 

On the other hand, the Cany edge detector used in the second 

approach is unable to detect complicated shapes such as locations 

where many contours meet. Deriche's technique [25] skirts both these 

problems as it is able to detect multiple points and tries to correct the 

corner displacement caused by ltering. However, the later technique 

has revealed itself very demanding and precise only at subpel accuracy. 

As precise location on real edges seemed a priority (cf. Section 5.1.2.1), 

eventually the authors opted for a second class algorithm. At rst, the


half boundaries detector of Noble [94] is used to detect edges as well 

as multiple points. It is then combined with the nite di erentiation 

technique of Najman and Vaillant [87] to select high curvature points. 

5.2.1 Detecting Edges 

With respect to our application, one of the great improvement brought 

by Noble's edge detector is that it does not propose to detect edges in 

a classical way, but rather to nd half-boundaries. Moreover, it makes 

use of morphological operators [45, 118] which perform very fast. 

Although Noble's algorithm is originally made out of three distinctive 

steps, namely feature enhancement, boundary following and boundary 

stitching, only the two rst parts have been used in our implementation. 

While the very rst step has not really been modi ed, some improvements 

have been brought to the second in order to better ll our needs. 

The following subsections detail the working principles of both steps. 

5.2.1.1 Enhancing Image Boundaries 

Real edges are characterized by a second derivative zero-crossing, i.e. 

their response to a second order derivative operator, like the Laplacian, is 

null at the edge location. However, such edges are often located between 

the pixels (cf. gure 5.5(a)), which forces algorithms to estimate them by 

di erentiating the response values from two points on either side of the 

zero crossing in the direction of maximum change. Such a measurement 

is too local to allow the algorithm distinguishing between real edges and 

noise. This is why Noble proposes to track the edges in regions adjacent 

to boundaries, which she calls half-boundaries (cf. gure 5.5). The 

edge tracking is therefore not based on the boundary strength anymore 

but on the shape of operator response. 

The main requirement to track half boundaries is then to have anoperator 

that can distinguish between the sides of a boundary. Such an 

operator is the so-called signed max dilation-erosion residue, which 

acts as a second derivative lter. If one de nes the erosion 3 residue 

fer(f) =f , (f B); (5.4) 

3 For the de nitions of morphological operators like erosion and dilation, please 

refer to [45, 118].


0 0 0 

0 

0 

0 

0 0 0 

0 

0 

0 

0 0 0 

negative 

half-boundary 

0 0 0 

0 

0 

0 

0 0 0 

0 

0 

0 

0 0 0 

negative 

half-boundary 

H 

H 

H 

H 

H 

H 

H 

H 

H 

H 

H 

H 

H H 

real edge 

(a) Ideal Step 

2H 

2H 

2H 

2H 

real edge 

H H 

2H 

2H 

2H 

2H 2H 

H 

H 

H 

H 

positive 

half-boundary 

2H 2H 

2H 

2H 

2H 

2H 

(b) Narrow ramp or blurred step 

positive 

half-boundary 

Figure 5.5: Location of real edges and half-boundaries. 

where f is the image function and B the structuring element, and the 

dilation residue 

fdr(f) =(f B) , f; (5.5) 

then the max dilation-erosion residue is de ned by: 

fmaxder(f) =max[fer(f);fdr(f)]: (5.6) 

This operator is not intended to be used as a morphological edge detec-


tor (it would be as much noise-sensitive as the Laplacian). It is used 

to provide connected response regions on either side of a boundary by 

classifying pixels according to the type and magnitude of the response. 

Positive or negative half-boundaries are the lines formed by connected 

pixels with a (positive or negative) residual value and that are closest 

to the edges. 

Concretely, the boundary enhancement is performed by rst computing 

the fmaxder operator for the whole image (for purpose of computational 

burden, the used structuring element is a square of length 5 pixels). 

Then the pixels are classi ed according to the type of residual information 

they possess: 

Pixels where none of the (dilation or erosion) residues are signi - 

cant are assigned a background label. 

If one of the residues is signi cant (i.e. non zero), the pixel receives 

a label indicating its localization with respect to edges: 

{ If the dilation residue is larger than the erosion one, it means 

that the points is adjacent to the boundary on the darker side 

of the intensity edge (cf. gure 5.5). The pixel then receives 

a `n' label, where n stands for negative. 

{ Similarly, `p' labels are assigned to positive pixels whose erosion 

residue is larger than the dilation one. 

{ If both residues are di erent from zero but equal in amplitude, 

the pixel is an internal or ramp one, i.e. it is located either 

on a real edge either on a narrow ramp (cf. gure 5.5(b)). It 

receives a `r' label. 

So as to distinguish between low amplitude response due to real edge 

features and non zero response due to noise, a threshold is applied. The 

so-called candidate boundary points are those points which are characterized 

by a residual value superior to a threshold T 1 and which are 

connected to a point of opposite type (`n' vs `p', or `r') whose residue is 

also larger than T 1.Internal pixels are a special case of candidate points. 

One major advantage of Noble's operator, followed by thresholding, is 

that it does not require any pre-processing. The edges and corners are 

therefore not displaced (cf. the introduction of Section 5.2) and the 

detected half-boundaries are correctly located along real edges.


(a) 

(c) 

(b) 

Background points 

Negative points 

Positive points 

Internal ’r’ points 

Figure 5.6: Boundaries enhancement: (a) fmaxder after histogram equalization 

(b) Pixel classi cation (c) Candidate boundary points 

Whereas Noble uses a rst-order neighborhood (North, South, East, 

West) while searching for connected points of opposite types, we here 

perform the search among 8 neighbors (N, S, E, W plus diagonals). The 

reason for this choice will be detailed in Section 5.2.1.2. Another change 

is that it seems Noble keeps all internal pixels as candidate points. We 

decided to keep only the `r' points whose residual value is superior to 

T 1.


Figure 5.6 demonstrates all the intermediate results of this rst step 

with respect to original image 5.2(a). Comparing gure (b) with gure 

(c), one can see that the selection of candidate points already rejects 

several points as background ones. 

5.2.1.2 Following Half Boundaries 

Tracking the half-boundaries is then the next stage of the algorithm. 

This procedure involves two steps: initialization and half-boundary following. 

The initialization stage consists in selecting candidate points that are 

relevant enough so as to start a tracking procedure. These points are 

selected according to the following criteria: 

the residual value of the point isabove a threshold T 2 (of course 

T 2 >T 1); 

the point is connected to at least one other candidate point of the 

same type (`p' or `n') which has not yet been traced before and 

which also has a residue greater than T 2. 

Whereas Noble uses 4-connection for the second criterion, we still use 8connection 

as it will be explained later. Noble also adds a third condition 

which is that none of the neighbor points can have already been traced. 

The reason for this is to avoid beginning a contour in a junction region 

(where many edges meet). As this condition makes the whole process 

more complex and does not bring a real improvement, we suppressed it 

in our implementation. 

Noble then proposes to perform an oriented tracking of half-boundaries. 

The contour following direction is de ned so that boundaries are traversed 

in the forward direction keeping the darker side of intensity to 

the right (i.e. negative labels are tracked in the forward direction keeping 

the positive labels on the left). Starting from a valid initial candidate 

point, contour tracking is rst performed forwards and then backwards. 

Noble uses a set of twelve 3 3 templates ( gure 5.7) allowing to follow 

straight lines and 90 right or left angles. In the meantime, end of 

contours (gaps) as well as multiple junctions (when the next point has 

already been traced) are detected. 

Moreover, in order to avoid the problems of wrong association in case 

of contours passing through an arrow or `W'-junction, highly curved


n n 

x 

x 

n 

n 

n 

n 

x 

x n 

n x 

n n n 

x n n n n x n n x x 

n 

n n 

x 

x 

n 

n 

n 

n 

n 

n 

n 

n 

n 

n 

n 

n 

n 

x x 

Figure 5.7: Templates for the direction of tracking for negative labeled 

pixels. Crosses indicate location of opposite label types (`p' or `r'). The 

central pixel (shaded) is assigned the forward and backward directions 

as indicated by the arrows. 

boundaries or thin bars, Noble proposes to perform the tracking on the 

boundary shape image (the image of candidate points) expanded by 

a factor of 2. Figure 5.8 depicts the improvement brought to tracking 

thanks to such an expansion, which has been used in our implementation. 

However, Noble still uses a neighborhood of size 1 (top, right, bottom, 

left). It means that every time a half-boundary undergoes a 45 angle, 

it can not be tracked anymore. Of course, another half-boundary will 

most certainly be created for the remainder of the points. If such a 

splitting of a real half-boundary is not annoying for a contour detector, 

it is very disturbing for the next step of our algorithm, i.e. curvature 

measurement. This step requires contours (half-boundaries) as long as 

possible so as to assess its own measure. Since tracking is performed on 

an expanded image, the addition of only eight new templates ( gure 5.9) 

subsequently allows tackling 45 angles. These new templates use a 

neighborhood of size 2. Purpose of coherence throughout the whole 

process justi es the use of such a neighborhood in the previous steps of 

x 

x 

n 

x 

n 

n


End of contour 

due to thin bar 

Expand (x2) 

Figure 5.8: Improvement of the boundary tracking: (a) Wrong association 

in regular dimension (b) This problem is overcome by tracking on 

an expanded version of the boundary shape image 

the algorithms. It allows to obtain more complete boundaries as they 

are no longer split into several pieces every time a 45 angle arises, as 

gure 5.10(c) demonstrates with respect to gure 5.10(b).


n 

x n 

x 

n 

n 

n 

n 

n 

n 

n 

n 

x n x 

n 

n 

n 

x n n x 

n 

n n 

n 

n n 

Figure 5.9: Additional templates for 45 direction of tracking for negative 

labeled pixels 

Figure 5.11 illustrates the result of the improved version of Noble's halfboundaries 

nder. What is important is their precise location along real 

image borders and the fact that for instance the whole exterior shape 

of the square is detected as a unique contour. Of course, additional 

contours are detected due to the noisy character of the image texture 

(cf. gure 5.2 (c)). 

5.2.2 Corner Extraction 

Every edge is then traveled in order to determine the high curvature 

points. The procedure proposed by Najman and Vaillant is applied. 

If the chain of pixels forming half-boundary is denoted fpig (with i = 

1; ::; n), then for every pixel pi and for k = mink; ::; m, one computes: 

,! 

ai;k = ,,,! 

pipi+k; ,! 

bi;k = ,,,! 

pipi,k; cosi;k = 

x 

x 

,! 

ai;k: ,! 

bi;k 

j ,! 

ai;kj:j ,! 

bi;kj 

n 

n 

: (5.7) 

For every pi, the highest length k = kmax that matches the following 

equation is retained: 

cosi;m < cosi;m,1


(a) 

(b) 

(c) 

Figure 5.10: Improvement of taking into account 45 angles for halfboundaries 

tracking: (a) Original image and zoom (b) Half-boundaries 

with the initial 12 templates (every color indicates a di erent halfboundary) 

(c) With the 8 additional templates 

satis es two conditions: i) it is above a given threshold, and ii) it is 

greater than the curvature of all pixels of the edge whose distance to pi 

is lower than kmax=2. 

Gaps and junctions already detected in the previous step are added to 

the corner list. For the purpose of mesh design and in order to better


(a) Positive (white) and negative 

(black) half-boundaries 

(b) Contours surimposed on original 

Figure 5.11: Example of half-boundary detection 

surround objects, \corners" are automatically added every time a (piece 

of) contour exceed l (typically, l = 45) pixels. 

Finally, only the pairs of adjacent corners (one on a positive and one 

on a negative half-boundary) are retained. Corners are indeed detected 

on both the positive and negative half-boundaries. This double information 

is used to assess the corner extraction and get rid of the noise 

in uence since a corner on a positive half-boundary has to be neighbor 

to a \negative" corner. Then it is up to the application to keep double 

corners (the purpose of keeping pairs of corners will be presented in the 

next section about the motion transcription) or select only the `n' or `p'


points. Alternatively, the points located on a real edge in-between two 

corners of opposite type. 

Figure 5.12 presents the results of the overall procedure, as well as the 

automatically generated mesh once the Delaunay triangulation has been 

applied. One may argue that no point is detected on the left side of the 

square: it is due to the high irregularity of this side that the algorithm 

is prevented from performing well. Note however that all the other 

sides are correctly detected as regular edges and thereafter correctly 

surrounded. Some false corners are of course also extracted from the 

contours due to texture noise (cf. gure 5.2 (c)). 

(a) Detected corners (b) Delaunay triangulation 

Figure 5.12: Example of corner detection. 

On the computational point of view, the morphological operator of No-

5.3 Motion Transcription 141 

ble performs very fast, and the following contour detection and corner 

extraction only involve one image scan and two contour trackings. 

5.3 Motion Transcription 

The aim of the motion transcription is to determine the motion vector to 

be applied to every vertex of the mesh designed in the previous step. The 

problem has been stated as being of double nature: rst the \backward" 

sense of the BMA estimation should be reversed, and second the values 

interpolated. 

5.3.1 Reversing the Sense of the Motion Information 

As already stated in the schedule of conditions, the mesh is designed on 

I(t , 1) and the motion vectors must indicate where every vertex must 

arrive I(t). It means that, in case of backward BMA, the sense of the 

motion information should be reversed. 

If one considers the estimation performed by the BMA as a coarse 

subsampling of a dense motion eld, the estimated vector (û; ^v) (cf. 

Section 2.5.1) can be assigned to the center of the block, at location 

( K,1 L,1 ; ). This means that, in the forward direction, it is equivalent 

2 2 

to a forward vector (,û; ,^v) located at position ( K,1 L,1 +û; 2 

2 +^v). 

5.3.2 Interpolating the Motion Values 

Once motion samples, placed on an irregular grid in case of backward 

BMA, have been obtained, the goal is next to estimate the values of 

other samples, placed on an(other) irregular grid. 

The simple process one may think of in case of forward BMA estimation 

is displace every mesh vertex according to the motion vector of the block 

it belongs to. In case of backward BMA estimation, it is equivalent toa 

nearest neighbor estimation, i.e. the vertex is assigned the value of the 

nearest known vector. Figures 5.13 and 5.14 depict the result, respectively 

for forward and backward estimation. If the result is of course 

more pleasant in case of forward estimation (there are no more discontinuities 

in the compensated image), it is though not entirely satisfactory. 

The solution is not very smooth and, even if the objects contours are 

already better taken into account, sudden irregularities arise around the


(a) 8 x 8 BMA compensation (b) DFD resulting from (a) 

(c) 8 x 8 BMA, mesh compensation (d) DFD resulting from (c) 

(e) 16x16 BMA, mesh compensation (f) DFD resulting from (e) 

Figure 5.13: Nearest neighbor interpolation after forward estimation


(a) 8 x 8 BMA compensation (b) DFD resulting from (a) 

(c) 8 x 8 BMA, mesh compensation 

(e) 16x16 BMA, mesh compensation 

(d) DFD resulting from (c) 

(f) DFD resulting from (e) 

Figure 5.14: Nearest neighbor interpolation after backward estimation


vertices (see the right side of the square on gure 5.14(c) and (e)). Moreover, 

if one of the used block has been badly estimated the compensated 

image is drastically debased. It is for instance the case for the top left 

corner of the square on gure 5.13(e) because the displacement of the 

corresponding 16 16 block has been wrongly estimated by the BMA. 

A more global interpolation technique which would take the whole available 

information into account and not only a few arbitrary vectors should 

be considered. It would also have toachieve some smoothing so as to 

avoid local traps. Figure 5.15 4 presents some very well-known interpolation 

techniques. The original data set consists in assigning each pixel 

the sum of the squares of its x and y coordinates (represented in the 

color spectrum - low value is blue, high is red). Nearest neighbor interpolation 

has already been commented on. Linear interpolation is a 

technique that works best for datasets with a small percentage of unknown 

values and is well-suited for separable problems. Interpolation of 

motion vector elds is not separable. Kernel smoothing is a technique 

which, as its name suggests, smoothes the data so as to provide a data 

set free of irregularities. Although one here wants to lower the impact 

of wrong vectors, it is also important tokeep changes of motion along 

object borders. Moreover, kernel smoothing does not preserve original 

data values. Both these drawbacks are solved by weighted interpolation 

which takes the various known data points into account according to a 

distribution function which relies on a physical distance between these 

points and the point to be estimated. 

Another theory of 2-D interpolation has been established by the South 

African mining engineer Krige [58]. In the eld of geostatistics and hydrosciences, 

this technique is referred to as kriging. The theory has been 

further developed by Matheron [78]. Basically, kriging is a weighted interpolation 

technique which looks at statistical distances rather than 

physical ones. Kriging is a very powerful method for taking into account 

individual elements, like the \clustering e ect" of the red point of 

gure 5.16. 

However, kriging requires a model of statistical properties (a variogram 

in the kriging terminology) so as to de ne the a priori \shape" of the 

data covariance. Such avariogram would be very di cult to establish 

for motion vectors since one simultaneously wishes to obtain a smooth 

4 Images come from the following Web site: http://www.fortner.com.


Sample data 

from original 

Nearest neighbor Linear interpolation 

Kernel smoothing Weighted interpolation 

Figure 5.15: Various interpolation techniques 

Sample data Kriging 

Weighted interpolation 

Figure 5.16: Comparison between kriging and weighted interpolation 

vector eld within objects while allowing motion changes from one object 

to another. 

Inspired by kriging, Decenciere, de Fouquet and Meyer have de ned a 

technique of inverse kriging [22] that has already been successfully


applied to motion vector interpolation problems [21] in the framework 

of the MORPHECO [84] project. 

Intuitively, inverse kriging solves the problem in a roundabout way: the 

criterion to determine the unknown values of the searched samples is 

that, if those were in turn interpolated, they would have to provide 

known points with known values. 

For this purpose, it uses a kriging operator that maps the unknown 

vector eld on the known one. In the present situation, this operator 

is directly o ered by the a ne transform implicitly implemented by the 

mesh structure. If one considers a known vector located on a position 

X of gure 2.19, whose value is ~ Vexp(X) ( ~ Vexp(X) =X 0 , X), a direct 

relation with the unknown motion eld ~W of the tops of the surrounding 

triangle ABC can be found thanks to equation (2.35): 

~W (A)+p:( ~ W (B) , ~ W (A)) + q:( ~ W (C) , ~ W (A)) = ~ Vexp(X) (5.9) 

As every vector of the (reversed) BMA motion eld belongs to one triangle 

of the mesh, such an equation can be iterated. According to the 

relative numbers of BMA vectors and mesh vertices and also to the pertinence 

of all inverse kriging equations, the system may be under-, overor 

simply determined. In the two rst cases, no exact solution exists, and 

one is in front of a linear least square problem that may be solved using 

Singular Value Decomposition (SVD [107]). Nevertheless, one has to be 

aware that the resolution of a least-square problem through SVD will be 

much more coherent and stable if the number of constraints su ciently 

exceeds the number of unknowns. 

Although the number of equations may be quite large (99 when the 

BMA is performed on every 16 16 block ofaQCIF image), it has to 

be noted that only three parameters are non zero for every equation. 

The massive presence of null parameters in the system matrix fasten 

the so-called QR decomposition with column pivoting involved by the 

SVD algorithm. 

5.3.3 Mesh Connectivity 

Finally, itmust be checked whether the motion vectors do not enforce 

the mesh to violate its connectivity constraint (cf. gure 2.21). The 

troubling motion vectors are recti ed as an interpolation of the motion 

vectors of the neighboring nodes (equations (47) and (48) of [3]).


5.3.4 First results 

Figure 5.17 presents the compensated image and the DFD resulting from 

the inverse kriging vectors determined for the mesh of gure 5.12 (b), in 

comparison with the result of a classical BMA compensation. 

Figure 5.17: Result of inverse kriging for the example images (estimation 

with 16 16 blocks: (top) BMA compensation (bottom) Asymmetric 


One can directly notice that the asymmetric compensation provides an 

image which is visually more pleasant but, if one has a close look at the 

DFD of the asymmetric scheme, one can see that the square rotation is 

only partial. This apparently surprising result was expected: the global 

interpolation provided by inverse kriging has smoothed the vector eld 

so as to avoid the local instabilities of gures 5.13 and 5.14. Yet, the


major point is that all blocking artifacts have been suppressed. 

So as to restrict this smoothing e ect, a solution could be to use pairs of 

corners, as provided by the half-boundaries. The Delaunay triangulation 

will then be modi ed and result in the generation of small triangles along 

all objects borders as depicted on gure 5.18(a). As it is not very likely 

that a BMA motion vector is located inside such a small triangle, one 

may hope to have more or less two distinct systems (here one for the 

square and one for the background) that will allow every part to follow 

its own motion. In fact, even if the compensated image of gure 5.18(b) 

is already somehow improved with respect to square rotation, the system 

is still not able to provide a perfect solution because the SVD algorithm 

is applied at once to the whole system. One should consider supervising 

the asymmetric compensation with a Markov process. This suggestion 

is detailed at the end of our conclusion (Section 5.5). 

5.4 Results 

In order to demonstrate the validity of the proposed scheme 5 , its results 

will be compared to those of a classical BMA compensation. Figure 5.19 

summarizes the various steps of the process with the Akiyo images: the 

original images 27 (a) and 30 (b) are motion estimated via a classical 

\backward" BMA. Instead of merely applying the resulting motion eld 

(c), which would result in image (d), an adaptive mesh is surimposed 

(e) on the original image (a). Thanks to inverse kriging equations, this 

mesh is warped in order to obtain the nal compensation (f). Images 

(d) and (f) result from the compensation stage. No residues have been 

added to them. 

In terms of Peak Signal-to-Noise (PSNR) ratios, the proposed reconstruction 

is 1dB inferior to the BMA one. The increase of complexity 

introduced by the asymmetric reconstruction is therefore mainly justi ed 

by the availability to produce more pleasant reconstruction (without 

blocking discontinuities) and the possibility to directly interact with the 

content of the picture. The latter point is of major importance for the 

new framework o ered by the MPEG-4 standard. In this framework, the 

mesh-based compensation could bene t from the decomposition of the 

scenes into several AVOs (cf. Section 1.4.3.2) and the coding of residuals 

with matching pursuits [90, 42]. One separate mesh could be adapted 

5 Details of the implementation of the scheme are to be found in Appendix D.

5.4 Results 149 

Figure 5.18: Asymmetric compensation with double corners: (a) Delaunay 

triangulation and zoom (b) Resulting compensation 

(a) 

(b)



(c) Backward motion field from BMA 

(d) BMA compensation 

(e) Automatic mesh design (f) Asymmetric compensation using mesh 

Figure 5.19: Some steps of the asymmetric process on Akiyo sequence


to every VOC in order to perform the compensation, while matching 

pursuits would be better suited to encode the residuals (which are not 

anymore located according to a priori blocks). 

Figure 5.20 zooms on the results of gure 5.19 in order to highlight some 

improvements brought by the wireframe reconstruction: the BMA block 

on the right part of the chin is suppressed and the superior lip of the 

mouth is not anymore doubled. The rst correction re ects the signicance 

of taking into account some a priori spatial information while 

the second results from the mesh connectivity that prevents duplicating 

information. 

(a) Zoom on the BMA compensation 

(b) Zoom on the asymmetric reconstruction 

Figure 5.20: Comparative zoombetween BMA and proposed scheme 

Figure 5.21 presents two other Akiyo images (10 and 30). As more time 

separates the images, the motion eld that links them is more sparse


and important. Thereafter, the quality of the compensation is directly 

altered. It is intended to demonstrate once again the advantage of using 

an a priori mesh: its connectivity prevents from debasing too much 

the image content (left eye and mouth). The asymmetric reconstruction 

( gure 5.21 (f)) is of course not able to interpolate every lost information. 

In the worst cases, annoying blocking artifacts are replaced by \funny" 

deformations. In this case, one may wonder about the pertinence of the 

connectivity constraint imposed by the mesh: the motion vectors to be 

applied to both lips are so contradictory that it would probably be better 

to break this connectivity (provided that an edge of the mesh passes 

through the lips). A hole would then appear in the mouth and would 

have to be lled with some residual information. A way of managing 

such a split of the mesh structure is proposed at the very end of this 

section. 

Since the aim of the asymmetric reconstruction is to allow further manipulation 

of the images while allowing a better subjective reconstruction, it 

is non-sense to propose any table of objective numbers, like the PSNR. 

However, in order to demonstrate the capabilities of the scheme with 

various types of content, some other sequences are demonstrated: gure 

5.22 presents images 50 and 53 of Hall Monitor, gure 5.23 presents 

images 40 and 42 of Silent and gure 5.24 presents images 1 and 3 of 

Table Tennis. One can once more see the cancelling of the blocking artifacts. 

Of course, in the Hall Monitor case, one can notice that some 

moving objects (the leg of the man) have been immobilized. It is the 

drawback of suppressing all blocking artifacts but it is not always acceptable. 

Moreover, jerky e ects would arise if the scheme is inserted in 

a decoding loop. 

The result of inserting the scheme in a (de)coding loop is presented on 

gure 5.25. Of course, a sequence with low motion activity (Akiyo) 

has been chosen in accordance with previous comments. The lowest 

possible bitrate is used, i.e. only the motion information is exploited 

and no residues are transmitted. One can directly notice an absence 

of blocking artifacts on the mesh-compensated image. The drawback is 

that successive warping operations strongly interpolate the image and 

result, after a certain time, in a relatively blurred image. Some residual 

information should easily correct this e ect but, once again, an adaptive 

technique such as the the matching pursuits one should be chosen. 

One last, but important, comment is that all images of the present chapter 

have been generated with the same set of parameters: no particular



(c) Backward motion field from BMA (d) BMA compensation 

(e) Automatic mesh design 

(f) Asymmetric compensation using mesh 

Figure 5.21: Results with sparse motion eld due to more unpredictable 

movement






Figure 5.22: Asymmetric process on Hall Monitor






Figure 5.23: Asymmetric process on Silent






Figure 5.24: Asymmetric process on Table Tennis


Original #31 Original #61 

BMA in the loop #31 

Mesh in the loop #31 

BMA in the loop #61 

Mesh in the loop #61 

Figure 5.25: Comparison of results within a decoding loop


tuning has been achieved on the algorithms according to the type of 

input images. These parameters are provided in Appendix D. 


The aim of this chapter was to combine the advantages of two separate 

techniques used for motion estimation & compensation. On the one 

hand, the Block Matching Algorithm has been selected for its e cient 

estimation as well as for its compact representation of the motion eld. 

On the other hand, a ne models o er a more sophisticated representation 

of the motion that avoids blocking artifacts on the objects contours. 

When a ne models are implemented via active mesh that establishes an 

a priori link with spatial information, they provide the user with new options: 

3-D modeling, video editing, trans guration, augmented reality, 

etc. 

An asymmetric scheme has thus been proposed: after a classical BMA 

estimation of the motion eld, the compensation stage is implemented 

via mesh warping. In so doing, solutions to several problems have been 

o ered: a fast and accurate corner nder has been set up, the motion 

information has been reversed and it has been interpolated thanks to the 

inverse kriging technique. All these innovations allow to automatically 

change the representation of the motion information, from block-based 

rigid motion elds to active ones designed on an adaptive mesh. 

The proposed scheme has been told compatible with the bitstream of 

existing compression standards. We mean that the bitstream structure 

has not to be modi ed because we still use a BMA estimation. Of course, 

if one wants to insert such ascheme in the decoding loop, the asymmetric 

process should also be included in the coder, what would in turn modify 

the value of the residual coding. Nevertheless, the scheme could be for 

instance used to edit and manipulate MPEG video sequences stored in 

a database without having to recalculate all the motion information. 

In addition to the new options the scheme o ers, it has been shown to 

somehow improve subjectively the quality of the compensated images. 

As a summary, table 5.1 compares the performances of the proposed 

scheme with those of Block Matching (cf. Section 2.5.1), Hexagonal 

Matching (cf. Section 2.5.1) and Adaptive Hexagonal Matching (cf. 

Section 2.5.1). What directly emerges from this table is the increase 

of computational burden at the decoder side. This goes against the


BMA HMA AHMA Asymmetric 

PSNR + ++ +++ + 

Interactivity - + ++ ++ 

Compliance with 

standard bitstream Yes Possible No Yes 

Coder complexity low high high low 

( xed) ( xed) ( contents) ( xed) 

Decoder complexity low medium medium high 

( xed) ( xed) ( contents) ( contents) 

Table 5.1: Comparison between di erent schemes 

philosophy of many video algorithms and standards but is necessary to 

o er more functionalities (in terms of manipulation) to the user. 

Further improvements 

In addition to the prospects already mentioned with the results, an 

improvement one may directly think of is to improve the corner direction 

by taking into account the information of the previous frames of the 

sequence: previous vertices location, spatio-temporal activity,... 

A second improvement could be to use a constrained triangulation instead 

of a Delaunay one. Since object edges (half-boundaries) have been 

detected, prior to corners, they could be exploited by forcing the triangles 

to coincide with them. This would ensure a better link with the 

spatial contents of the image. A practical solution is to consider a graph 

made out of segments corresponding to strong edges which can then be 

extended to triangulation thanks to the geometrical Delaunay criterion 

(cf. Appendix D). 

Another improvement is related to gure 5.18(a) which presents a wireframe 

generated with double corners along objects boundaries. The 

result of applying the asymmetric compensation to this wireframe is not 

as good as expected. We think it could be possible to pro t more from 

this speci c wireframe structure if one allows di erent parts of the wireframe 

to undergo totally di erent movements. This could be introduced 

thanks to a Markov model (cf. appendix C): a Markovian wireframe 

would be authorized to \split" itself into several sub-meshes if some 

contradictory motion arises along an edge between two vertices. This is 

intended to improve the quality, segment the motion eld and automa-


tically detect the areas with discovered background. The Markov model 

could be the following: 

the sites are the wireframe vertices; 

the dual lattice is the triangulation (set of wireframe edges) that 

links the sites; 

the neighborhood system is de ned around every vertex: two edges 

that are related to a same vertex are neighbors; 

a clique is made of any possible combination of such neighbor edges 

around a vertex; 

the aim of the Markovian process is to determine a line process to 

know which edges must be \broken"; 

two potential functions ensure the link with the data: 

{ the change of the image gradient intensity along the edge 

because a discontinuity is more probable along object edges, 

and 

{ the divergence of the motion constraint at the two extreme 

vertices of the edge; 

one potential function introduces a regularization term. The two 

most probable con gurations are: no discontinuity at all around 

the vertex, or two (surrounding of an object corner). Only one 

discontinuity is highly unlikely. Three, four, ve,... discontinuities 

(corresponding to a corner common to three, four, ve,... objects) 

are also less probable. 

Inside the di erent areas determined by the Markov model, the motion 

values could be independently computed by inverse kriging. However, 

one major problem related to mesh connectivity arises: as the mesh is 

not unique anymore, its global connectivity can not be ensured but only 

the own connectivity ofevery sub-mesh. Thereafter, \holes" as well as 

multiple prediction will appear in the compensated image. Away of 

dealing with these artifacts should be designed. 

Yet another improvement, in terms of complexity, would be to restrict 

the use of the asymmetric reconstruction to areas where complex movement 

arises: the e ect of the wireframe is e ectively useless (but for 

further manipulation of the images) in at areas that obey the translational 

model of the BMA.

Conclusion 

With the emergence on the market of low-cost cards allowing customers 

to decode the ISO MPEG-1 and MPEG-2 standards, digital video coding 

is becoming a very common functionality ofmultimedia personal 

computers. For several years, video coding has moved towards verylow 

bitrates (under 64kbits=s). A major result of this trend is the ITU 

H.263 standard which also has the advantage that the software version 

of its decoder performs in real-time on a PC. The future ISO MPEG-4 

standard goes one step further as it o ers the possibility to the user to 

somehowinteract with the contents of the pictured scene. Because of the 

convergence of the target bitrates with the ever-increasing capabilities of 

existing networks, some people already claim that these two standards 

are putting an end to research in \pure" video coding. 

Among the tools used to achieve compression of the video signal, motion 

estimation is probably the one o ering the greatest compression ratio. A 

lot of research has been devoted to it during the last two decades. It not 

only resulted in great outcomes in video coding but also in other elds 

where motion analysis plays a key role, like semantic interpretation or 

target tracking. The exploitation of motion in a video coding scheme 

typically involves three steps: estimation, transmission and compensation. 

The video coding context and the state of the art in related motion 

estimation techniques have respectively been presented in Chapter One 

and Chapter Two. Some emphasis was put on the very-low bitrate 

environment and on the most used motion estimation techniques (BMA 

and warping techniques). In this sense, the present thesis contributes to 

the investigation of the possibility to enhance the various steps of the 

motion chain by somehow taking into account the spatial contents of the 

image(s). 

Chapter Three modi es the estimation performed by a classical BMA

162 Conclusion 

so as to adapt the size of the blocks to the object boundaries. It results 

in an split-and-merge procedure that manages a multiscale algorithm. 

The so-called Adaptive BMA outputs a quad-tree structure which enables 

condensing the motion information in spatial areas where complex 

movements arise. The adaptation to the contents turns out to be very 

bene cial for the motion estimation as it allows obtaining a much more 

reliable motion eld. Unfortunately, such an adaptation increases the 

computational burden. Chapter Three therefore proposes a model to 

distribute the load among several processors. Preliminary results demonstrate 

a linear speed-up thanks to the distribution using a \mastersalve" 

structure. However, it would be very interesting to pursue the 

e ort and implement the parallel software on an architecture dedicated 

to Digital Signal Processing (DSP), like Texas Instrument TMS C80. 

One could then see whether real-time software performance, as it was 

more or less the case for the BMA, is possible or not. 

Chapter Four evaluates the impact of images pre-treatment on the coding 

process. Although the theoretical model developed in the Rate- 

Distortion framework announces a possible reduction of the residual 

error and then of the bitrate, experimentation proves that it directly 

results into a drastic loss of quality. Pre-treatment seems thus to be 

at deadlock. However, prospective experiments demonstrate that some 

gain can be obtained for speci c parts of the video signal. Those parts 

generally belong to moving backgrounds. They are made out of textures 

that are irrelevant to the human eye, but relevant enough to be considered 

by the quantizer of the coder. If one wants to further investigate 

the impact of pre-treatment on video coding, one should probably rst 

tackle the problem of automatic segmentation and tracking of objects 

(and especially background) in a video sequence. Then, a psycho-visual 

analysis of the texture could be used to determine which parts of the 

signal can be pre-treated without altering the overall subjective quality. 

In relation with Chapter Four, one could also consider developing the 

Rate-Distortion framework while introducing a more explicit model as 

far as motion estimation is concerned. The model could then be reversed 

so as to theoretically determine which pre-treatment would achieve the 

best results. 

Finally, Chapter Five focuses on compensation: it presents an asymmetric 

scheme whose aim is to achieve the compensation of a BMA motion 

eld thanks to a contents-adapted mesh. The solution includes a novel 

corner detector for the mesh design and the use of inverse kriging for

Conclusion 163 

interpolation purposes. The scheme provides a rst demonstration of 

the subjective improvement o ered by such an asymmetric scheme. However, 

the scheme would be worth a lot of additional research. Of course 

it would be interesting to direct the scheme with a Markov process as 

suggested in the conclusion of the chapter. Though the initial aim is to 

improve the subjective quality of the reconstructed pictures, the insertion 

of the scheme in a complete codec would enable objectively quantifying 

its performance. In this case, we would suggest to use Matching 

Pursuits as the residual coding method because of their localized (and 

subjective) nature that is in concordance with our scheme. 

The signal-adapted compensation o ered by the mesh warping could 

also be used to reconstruct motion elds estimated by other methods 

than the BMA, e.g. parametric models. In this case, one has to de ne 

the kriging operator that maps the information from one structure to 

another. 

The present thesis has thus analyzed three possibilities of improving the 

quality of existing video schemes. The chosen approach was to inject as 

much signal-adapted information as possible inside existing algorithms, 

like the BMA. If the experiments indicate that this idea is not always 

applicable as such, they also highlight the possible gain in speci c situations. 

Every user might expect future systems to be as much interactive 

as possible. Since one usually wants to interact with depicted objects 

that have a semantic meaning to him/her, it appears important that 

existing systems should take into account the contents of the data they 

are manipulating. Bits and bytes are indi erent to the user. 

By way of nal conclusion, an attempt is made to answer the question 

raised in the introduction, namely: for which part(s) of the motion exploitation 

chain (estimation - transmission - compensation) is it useful 

to take the spatial contents of the images into account? It is obviously 

very useful in the estimation phase, and it is probably also the place 

where it is the easier to achieve. New algorithms in motion estimation, 

segmentation and tracking should allow coders to selectively transmit 

the only \interesting" parts of the signal. In this respect, a lot of research 

should be carried out in order to stabilize the existing techniques 

and to further investigate the human perception of images. One could 

then conceive algorithms that are able to automatically characterize the 

content of images. The challenge of the new ISO MPEG-7 work item 

is here of great interest. As far as compensation is concerned, quality

164 Conclusion 

gain is not as obvious as added-value in terms of editing and interaction. 

Nevertheless, emphasis was put on the possibility for the decoder 

to exploit the information in a way that is di erent from the one usually 

applied. This can be of great relevance in a context of indexing of 

large visual databases: for instance to perform a global analysis of coded 

material without having to strictly decode it. 

My very last word is that I share the general feeling that research in 

\pure" compression is not a very \hot" research topic anymore. I do 

believe that research is deeply necessary in signal analysis in order to 

o er cleverer services to the user. One mayeven hope that one day terms 

like \multimedia" or \interactivity" will not be hackneyed anymore but 

will really encounter the user's expectations.

Appendices

Appendix A 

VLBR-like video sequences 

This appendix brie y presents the video sequences that are used to 

demonstrate the designed algorithms and illustrate the text. These sequences 

have been distributed in the framework of MPEG-4, for the 

tests held in Dallas, in November 1995[97]. 

Eleven di erent sequences are used, coming out of three classes of sequences. 

Class A ( gure A.1) addresses sequences with low motion and 

low spatial texture: \Akiyo" is a speaker presenting the news; \Container" 

is a short lm showing a container boat sailing slowly through 

the screen; \Hall Monitor" is a security camera controlling what happens 

in a corridor and \Mother & Daughter" shows two people having 

a video phone conversation. 

Class B ( gure A.2) is made of sequences including either a larger 

amount ofmovement or more textures. The \Coastguard" watches the 

activity onariver; the \Foreman" speaks in front of a building with a 

very unstable mobile camera; two speakers present the \News" with a 

movie in the background and \Silent Voice" shows a disabled woman 

telling her friends with the sign that she is going to Paris. 

Class C ( gure A.3) is normally intended to be coded at higher bitrates 

but is used here because of its pertinence to test motion algorithms. 

All the sequences contain intensive motion and high spatial activity. 

\Mobile & Calendar" is a child room scene, while \Stefan" and \Table 

Tennis" are two sport movies. 

All these images are presented and used (sometimes not but then mentioned 

within the text) in QCIF format (176 144 pels). This quarter

168 Chapter A. VLBR-like video sequences 

Akiyo Container 

Hall Monitor 

Mother & Daughter 

Figure A.1: The MPEG-4 \Class A" test sequences 

CIF format has been speci cally designed for very-low bitrate communication 

(cf. table 1.1).

VLBR-like video sequences 169 

Coastguard 

Foreman 

News Silent Voice 

Figure A.2: The MPEG-4 \Class B" test sequences

170 Chapter A. VLBR-like video sequences 

Mobile & Calendar Stefan 

Table Tennis 

Figure A.3: Some MPEG-4 \Class C" test sequences

Appendix B 

Rate Distortion theory 

As the title of Berger's book announces it, Rate Distortion theory o ers 

a mathematical basis for data compression [6]. Its is indeed de ned as: 

Rate Distortion Theory The theoretical discipline that treats data 

compression from the point of view of information theory. 

Information Theory Mathematical theory dealing with the more fundamental 

aspects of communication systems. 

In practice, Rate Distortion theory aims at addressing two problems: 

What information should be transmitted? 

How should it be transmitted? 

From the scheme of gure B.1, it is clear that Rate Distortion theory 

is concerned with the relation between the channel capacity C and the 

distortion of Y with regards to X. The problem may then be formulated 

as follows: \given the source, the user and the channel, under what 

conditions is it possible to design a system that reproduces the source 

output for the user with an average distortion that does not exceed some 

speci ed upper limit D?" The answer is a curve like the one of gure B.2 

that ensures a quality superior or equal to D if and only if C>R(D). 

In order to account for the appearance and the properties of the R(D) 

curve, appendix B will rst revise some basic notions of Information 

Theory [38]. Then, the Rate Distortion function will be introduced in 

the context of discrete memoryless sources and single-letter distortion, 

according to Berger's book [6]. Some properties of the curve will also

172 Chapter B. Rate Distortion theory 

Source 

User 

X 

Source 

encoder 

Y Source 

decoder 

Encoder 

Decoder 

Channel 

encoder 

Channel 

decoder 

X 

Channel of capacity C 

Figure B.1: Typical transmission chain 

R 

R(0)

B.1 Information Theory 173 

B.1 Information Theory 

Let us consider an alphabet AM of size M, with letters aj: 

AM = fa 0;a 1; :::; aM,1g = f0; 1; :::;M , 1g: (B.1) 

The probability distribution of this alphabet is a function p(.): 

p(:) : AM ! [0; 1] 

P M,1 

j=0 p(j) =1: 

(B.2) 

Based on the nite ensemble (AM;p), a random variable (r.v.) may 

be de ned: the identity r.v. X(:) (X(j)=j assumes the letter j with 

a probability p(j)). A r.v. is a real r.v. if it ranges in (a subset of) the 

real line. The expected value of a real r.v. f(:) is de ned as: 

E[f]= 

M,1 X 

j=0 

p(j):f(j): (B.3) 

The information contained in a message is related to the probability of 

occurrence of this particular message. More precisely, the information 

is proportional to the inverse of the message probability. For instance, 

to hear a neighbor telling you that the moon has fallen last night brings 

much more information than to hear him/her saying \Hello". The probability 

of the rst event, namely the falling of the moon, is indeed very 

low and it o ers a lot of (surprising because improbable) information in 

comparison to the other one. The self-information I of an event j is 

thus de ned as: 

I(:):I(j) = log 1 

= , log p(j); (B.4) 

p(j) 

and is expressed in nat if the neperian logarithm is used, or in Shannon 

if log 2 is used. In the latter case, the bit is also commonly used. The logarithm 

is used to make the information provided bytwoevents inversely 

proportional to the product of their probabilities. 

The expected value of the information is de ned as the entropy of the 

source: 

X 

M,1 

H(X)=, p(j) log p(j) (B.5) 

j=0


measures the average a priori uncertainty onX. 

Considering two alphabets AM and AN , one can de ne a product space 

AMN: 

AMN : f(j; k)jj 2 AM;k2 AN g : (B.6) 

The joint distribution p(j; k) and the joint ensemble (AMN;p(j; k)) 

help de ning the marginal distributions: 

p(j)= 

N,1 X 

k=0 

p(j; k); q(k) = 

and the conditional distributions: 

p(jjk)= 

p(j; k) 

q(k) 

M,1 X 

j=0 

p(j; k); (B.7) 

p(j; k) 

; q(kjj)= : (B.8) 

p(j) 

A r.v. Z that assumes the event (j; k) with a probability p(j; k) maybe 

de ned as: Z =(X; Y ) with X : p(j) and Y : q(k). The conditional 

self-information is the information one receives when told that the 

event X = j has occurred if one already knows the occurrence of the 

event Y = k: 

and the conditional entropy is: 

I(jjk)=, log p(jjk); (B.9) 

H(XjY )=, X 

p(j; k) log p(jjk): (B.10) 

The mutual information of both r.v.'s X and Y is thus: 

j;k 

I(j; k)=I(j) , I(jjk)=I(k; j); (B.11) 

and the average mutual information or average amount of transmitted 

information: 

H(X; Y )=H(X) , H(XjY )= X 

p(j; k) 

p(j; k) log : (B.12) 

p(j)q(k) 

j;k

B.2 Discrete Memoryless Sources and Single-Letter Distortion 175 

B.2 Discrete Memoryless Sources and Single- 

Letter Distortion 

Let consider a discrete parameter family of r.v.'s fXt; t =0; 

and its probability distribution pt(:). Its entropy is 

1; 2;:::g 

H(Xt) =, X 

pt(j) log pt(j): (B.13) 

j 

If t becomes a time index t =(t1;t2;t3; :::; tn), then pt(x) denotes the 

probability of the vector r.v. Xt =(Xt1;Xt2; :::) to equal the vector 

x =(x1;x2; :::). The corresponding entropy is: 

H(Xt) =, X 

pt(x) log pt(x): (B.14) 

all x 

The random sequence fXtg is called a discrete source and AM is called 

the source alphabet. x (x 2 A n M)isasource word of length n. Such 

a source is known to be stationary if 8k; t; x; n : pt+k(x) =pt(x). The 

probability may then be simply written p(x) and the entropy H(X). 

The entropy rate of a stationary source is de ned as: 

H = lim 

n!1 n ,1 H(X) = lim 

n!1 n ,1 H(X 1; :::; Xn): (B.15) 

The sequence of r.v.'s fZtg = fXt;Ytg also constitutes a discrete source. 

If such a joint source is assumed stationary, then its entropy rate is: 

H 0 = lim 

n!1 n ,1 H(Y ) + lim 

n!1 n ,1 H(XjY ) (B.16) 

where H(XjY ) is the average uncertainty, and limn!1 n ,1 H(XjY ) the 

equivocation or average rate at which the information is lost while 

going through the system. 

With regards to gure B.1, a discrete memoryless channel (d.m.c.) 

with input alphabet A ~M and output alphabet A ~N is described completely 

by specifying for every ordered pair (j; k) the conditional or transition 

probability ~q(kjj) that the letter k appears at the channel output when 

the input letter is j. Achannel is said to be memoryless if it processes 

the successive letters of an input word ~x =(~x 1; :::; ~xn) independently 

from one another:


~q(~yj~x) = 

nY 

t=1 

~q(~ytj~xt): (B.17) 

Every probability distribution ~p(:) of the input alphabet de nes a joint 

input-output distribution ~p(j; k) =~p(j)~q(kjj) and also an output pro- 

P ~M,1 

bability distribution ~q(k) = j=0 ~p(j; k). The capacity of the channel 

is de ned by the relation 

C = max H( ~ X; ~ Y ) = max X 

~p(j; k) 

~p(j; k) log ; (B.18) 

~p(j)~q(k) 

j;k 

where the maximum is taken with respect to all possible choices of the 

input distribution ~p(:). The Shannon's \noisy coding theorem" 

states that \for a discrete memoryless channel of capacity C and a discrete 

stationary source with entropy H, the source may be encoded over 

the channel with an arbitrary small number of errors if H C; while 

if H>C, it is impossible to encode it with an equivocation less than 

H , C". 

A discrete memoryless source (d.m.s.) is a stationary source that 

satis es one additional requirement: 

p(x) = 

nY 

t=1 

p(xt): (B.19) 

It means that the successive letters generated by a d.m.s. are independent 

and identically distributed (i.i.d.) r.v.'s. To evaluate the 

quality of the reconstruction of the r.v. fXt;pg, aword distortion 

measure n(x; y) that speci es the penalty charged for reproducing the 

source word x by the output vector y must be designed. n(x; y) isa 

non-negative cost function. A delity criterion is a sequence of word 

distortion measures in which: 

F = f n(x; y); 1 n 1g: (B.20) 

A delity criterion is called a single-letter delity criterion if 

n(x; y)= 1 

n 

nX 

t=1 

(xt;yt): (B.21)

B.3 Rate Distortion Function 177 

B.3 Rate Distortion Function 

The Rate Distortion function R(D) (cf. gure B.2) speci es the minimum 

rate that enables one to reproduce the source with an average 

distortion that does not exceed D. To design the R(D) of a d.m.s. with 

respect to a single-letter delity criterion F , one needs to know the 

average distortion d(q) associated with the transition probabilities 

q(kjj) of the channel: 

d(q)= X 

p(j)q(kjj) (j; k): (B.22) 

j;k 

The channel de ned by q(kjj) is called D-admissible if and only if 

d(q) D. QD denotes the set of all D-admissible transition probability 

assignments: QD = fq(kjj)jd(q) Dg. In parallel, every assignment 

gives rise to an average mutual information: 

H(q) = X 

j;k 

p(j)q(kjj) log q(kjj) 

: (B.23) 

q(k) 

The rate distortion function R(D) offXt;pg with respect to F is then 

de ned by: 

R(D) = min H(q); (B.24) 

q2QD 

with D 2 [0; 1). Since the source is given and not the channel, such 

an equation determines the minimum rate at which the information 

about the source must be conveyed to the user in order to achieve a 

prescribed delity (the reverse - given channel - generates a distortion 

rate function). It means that R(D) C is a necessary condition for 

the existence of a communication system that operates with 

delity D. 

As illustrated on gure B.2, R(D) is a monotonic, decreasing, convex U 

function in the interval of interest from D =0toD = Dmax. R(D) =0 

for D = Dmax. Dmax is the average distortion achieved when only the 

statistics of the source are known but when nothing at all about the 

particular realization of the source has been transmitted. In this case, 

the decoder can only guess what the source could have been.


signal s 

Predictor 

Figure B.3: Basic predictor 

B.4 Extension to Moving Pictures 

+ 

- 

error e 

prediction s 

The Rate Distortion function that has just been presented may beextended 

to various sources [6]. Since it is our interest here, one will detail 

how and under which assumptions it is possible to apply this theory to 

(moving) images. But, let us begin with a note on the basic structure of 

predictive coding and its main properties (with Tziritas and Labit [127]). 

B.4.1 Note on Predictive Coding 

Avery simple predictor, like the one of gure B.3, o ers a prediction 

gain 

Gp = 

2 

s 

2 

e 

; (B.25) 

where 2 

s (resp. 2 

e) is the variance of the signal s (resp. the error signal 

e). If the predictor is optimum, such a gain is always greater or equal to 

1. The entropy gain for the prediction error e with respect to the entropy 

of signal s is 1=2 log Gp (because : log x is commonly admitted as an 

approximation for the entropy H). 

In order to obtain a higher compression ratio, a quantizer must be included. 

As quantization introduces distortion, the decoder will not be 

able to reconstruct the signal s, but a signal s instead. The quantizer 

is thereafter included in the prediction loop so as to avoid error propagation 

and to obtain the same prediction at the coder and the decoder 

(Figure B.4). 

The distortion induced by the quantization is equal to 

" = s , s = e +^s , (e +^s) =e , e: (B.26)

B.4 Extension to Moving Pictures 179 

s 

+ 

s 

- 

e e 

Quantizer Coder 

Predictor 

Figure B.4: Predictive coder with quantization 

This equation means that the coding distortion is equal to the quantization 

error. Quantizing with N quantization levels a variable whose 

variance is 2 achieves a mean quadratic distortion 

D = A ; (B.27) 

2 N 

with A a constant parameter depending on the source probability distribution. 

With the approximation that As Ae, and for a speci c bitrate 

(i.e. Ns = Ne), the distortion ratio is equal to the prediction gain: 

Ds 

De 

+ 

2 

+ 

s 

Gp: (B.28) 

Predictive coding reduces the distortion by a factor equal to the prediction 

gain. 

Considering now two successive images I(x; y; t , 1) and I(x; y; t), one 

can advance the hypotheses that the image signal is stationary and zeromean, 

and that the displacement vector (u; v) is a constant parameter 

over the whole image. One assumes that the covariance function of 

I(x; y; t) is separable in time and space. Figure B.5 introduces the 

generic structure of a temporal prediction, where the lter h(x; y) is 

linear and shift-invariant. 

I(fx;fy) isthepower spectral density of the spatial part of I(x; y; t). 

A single temporal correlation coe cient exists then between the two 

images, and is equal to: 

c


I(x,y,t) 

I(x,y,t-1) 

δ (x-u,y-v) 

h(x,y) 

Figure B.5: Generic structure of temporal prediction 

= E[I(x; y; t)I(x , u; y , v; t , 1)] 

+ 

- 

e(x,y) 

E[I 2 : (B.29) 

(x; y; t)] 

The power spectral density function of e(x; y) is 

e(fx;fy) =(1+jH(fx;fy)j 2 , 2

B.4 Extension to Moving Pictures 181 

B.4.2 Intra Images 

Considering memoryless coding, i.e. without spatial processing, and a 

mean quadratic criterion, the rate distortion function of an image coded 

with only intra quantization is given by: 

R = 

( 1 

2 log 2 

2 

I 

D 0 2 

I 

B.4.3 Inter Images without Motion Compensation 

In the same framework of memoryless coding, the function becomes 

R = 1 

2 log 2 

2 2 I(1 , e 

D 

p 

,2 f0 u2 +v2 ) 

+1 

! 

(B.34) 

; (B.35) 

where f0 0:05 p (equation (4.23) out of Labit and Tziritas [127]). One 

can notice that the prediction gain is greater than 1 only if 

2 ,I(u; v) > 2 

I ; (B.36) 

where ,I(x; y; ) is the spatial covariance of I(x; y; t). 

B.4.4 Inter Images with Motion Compensation 

If motion compensation is used, the rate distortion function is (equation 

(4.30) out of Labit and Tziritas [127]) 

R = 

Z 1= p 

0 

f log 2 

where f = q f 2 x + f 2 y , and 

2(1 , (fx;fy)) I(fx;fy) 

D 

(fx;fy) = 

1 

(( df) 2 +1) 3 2 

+1 df; (B.37) 

: (B.38) 

Such acharacteristic function is obtained by assuming an isotropic probability 

density function of the motion estimation error (dx;dy): 

p(dx;dy) = 2 

2 

d 

p 

2 d2 x +d 

, 

e 2 y 

d : (B.39)

Appendix C 

Markov Random Fields 

The use of Markovian techniques in digital image processing is based 

on the possibility of modeling the image texture, the noise, and even 

some other criteria, like stochastic processes. Since their remarkable 

use in image restoration by Geman and Geman [39], the theory of 

Markovian processes [99] has been extended to Markov Random Fields 

(MRF) [40], and their eld of application extended to images [137] and 

motion elds [37, 109]. 

The present appendix aims at giving a quick overview of MRF and the 

di erent concepts involved by their use. It will rst present Bayesian 

modeling, which is often related to MRF. Then, image description and 

neighborhood relationship will introduce the de nition of MRF. The 

link with the Gibbs distribution is also tackled. Finally, the possible 

algorithms to solve the overall minimization problem are referred to. 

C.1 Bayesian Model 

Bayesian modeling assumes a priori statistical distribution for the solution 

of estimation problems. This a priori model (or prior model), 

p(u), is a probabilistic description of the solution u or of its properties 

before any sense data is collected. 

A second component, the sensor model (or likelihood model), p(dju), 

is a description of the noisy or stochastic processes that connects the 

original (unknown) state u to the sampled input image or sensor values, 

the data d. Using Bayes' rule, both can be combined in order to obtain 

an a posteriori model (or posterior model), p(ujd), which describes

184 Chapter C. Markov Random Fields 

the estimate of the solution u according to the collected data d: 

where p(d)= P u p(dju):p(u). 

p(ujd)= p(dju):p(u) 

p(d) 

(C.1) 

Bayesian modeling is often used to determine the Maximum A Posteriori 

(MAP) estimate, i.e. the value of the solution u that maximizes 

the conditional probability p(ujd). The Maximum Likelihood (ML) 

estimate does not statistically describe u: it considers it as a vector of 

parameters. The ML is a special case of the MAP, in which the a priori 

model p(u) is constant (uniform distribution). The aim then becomes 

to maximize the conditional probability p(dju). 

Working with the logarithm of the posterior density, one obtains: 

log(p(ujd)) = log(p(dju)) + log(p(u)) , log(p(d)); (C.2) 

where the last term does not depend on u and can be neglected in 

maximization processes with respect to u. The MAP estimate is thus: 

[ @log(p(dju)) 

@u 

+ @log(p(u)) 

]ju=ûMAP =0; (C.3) 

@u 

where u =ûMAP is the solution at the maximum. Similarly, the ML 

estimate is, with u =ûML the solution at the maximum: 

C.2 Markov Random Fields 

[ @log(p(dju)) 

]ju=ûML =0: (C.4) 

@u 

Let consider an image I, in which pixels de ne a lattice S of sites s: 

S = fs =(i; j)g (C.5) 

To every site s is associated a random variable [99] As, whose values 

as belong to . For instance, = f0; :::; 255g represents the possible 

luminance values of a black and white picture and = f0; :::; 255g q the 

values that can be associated with any pixel of a mutispectral image 

with q channels.

C.2 Markov Random Fields 185 

n = 1 n = 2 n = 3 

Figure C.1: Homogeneous neighborhoods of order 1,2,3 

The image can then be considered as a random vector A =(As;s2 S) 

whose vector a =(as;s2 S) is its particular realization. 

p(A = a) =p(a) 

is the probability of con guration a. In fact, it is a joint probability 

p(As = as;s2 S). 

p(As = as) =p(as) 

is the marginal law ofAs. 

A neighborhood system N =(Ns;s2 S) is made out of parts Ns of S 

with the following properties: 

s 62 Ns 

s 2 Nt , t 2 Ns 

(C.6) 

(C.7) 

The set Ns is called the neighborhood of s; t is said neighbor of s if 

t 2 Ns. Among the di erent types of neighborhood, the homogeneous 

ones ( gure C.1) are characterized by their order n: 

N n 

s = ft jks , tk 2 

kn;t6= sg (C.8) 

with kn taking the values 1,2,4,5,8,... for n =1; 2; 3; 4; 5;::: 

A random eld A is a Markov Random Field (MRF) associated with 

the system N if and only if:

186 Chapter C. Markov Random Fields 

n = 1 

n = 2 

Figure C.2: Cliques for homogeneous neighborhoods of order 1 and 2 

p(a) > 0; (C.9) 

p(asjat;t2 S ,fsg) =p(asjat;t2 Ns): (C.10) 

IT means that the marginal law ofany site only depends on the small 

numbers of its neighbors. 

A clique c is a subset of S which is related to the neighborhood system 

N according to the following properties: 

c is a singleton (pixel site), or 

any two pixels of c are neighbors, with respect to N (8s; t 2 c : 

s 6= t ) s 2 Nt). 

C is the set of all cliques of S. The order of a clique depends on the 

number of its elements. The cliques for homogeneous neighborhoods of 

order 1 and 2 are depicted on gure C.2. 

C.3 Gibbs measure and Markov elds 

Considering the neighborhood N = fNs;s2 S; Ns Sg and the set C of 

cliques de ned on N, the Hammersley-Cli ord theorem [7] demonstrates 

that a random eld A is a Markov eld associated with the neighborhood

C.4 System solution 187 

system N if and only if its probability distribution p(A=a) is a Gibbs 

measure de ned by: 

8a 2 ; p(a) = 

,Ep (a) 

e Tp 

Zp 

; (C.11) 

where Tp is a temperature that controls the degree of \peaking", Zp is a 

,Ep (a) 

e Tp ), and Ep 

normalization constant (\partition function", Zp = P a 

a function called \energy", de ned by: 

Ep(a) = X 

Vc(a=c): (C.12) 

c2C 

The potential functions Vc are arbitrary functions which only depend 

on the elements a belonging to the clique c (which is referred to as the 

notation a=c). They help the system designer to clearly de ne its a priori 

knowledge about the neighborhood of the Markovian model. Generally, 

there are two kinds of potential functions combined so as to obtain a 

good model for the problem to solve: 

Potential functions that ensures the link with the data. They 

de ne the properties that the pixels of a same clique must have. 

Potential functions that introduce regularization constraints inside 

cliques in order to obtain a smooth and coherent nal solution. 

C.4 System solution 

Several methods to obtain the MAP estimate exist. They are only referred 

to here, and the problem of choosing the right one is to be tackled 

with concrete problems. Simulation methods include the Gibbs sampler 

introduced by Geman and Geman [39] and the Metropolis algorithm [81], 

while optimization methods are the simulated annealing (SA, [55]), the 

Iterated Conditional Modes (ICM, [8]) and the High Con dence First 

(HCF, [15]). Deterministic methods (like the two last ones) are often 

preferred because of their fast computation. Nevertheless, one has to be 

aware that they do not guarantee to reach the absolute maximum but 

only some local maximum.

Appendix D 

Complements to Chapter 5 

D.1 Triangulation 

A triangulation process allows one to obtain the partition of an image 

from a given set of points. Nevertheless, several di erent triangulations 

can be produced from the same set of points. Some criteria have thus to 

be chosen in order to de ne the triangulation in a unique and optimal 

way. These criteria can be purely geometrical, like in the Delaunay 

case, or take some other initial data into account (reconstruction error, 

surface energy, convexity, :::). 

This section aims at brie y introducing the basic de nitions of triangulation 

as well as the properties of the Delaunay triangulation that is 

used in the present work. 

D.1.1 De nitions 

De nition 1 G =(V;S) designs a graph made of the sets: 

V = fvij1 i Ng, which is the set of points (or nodes or 

vertices); 

S = fsij1 i mg, which contains a set of edges such that 

T 

si sj 2fV;;g. 

De nition 2 A triangulation T (G) of a given graph G =(V;S) is a 

graph G 0 =(V;S 0 ), where S S 0 . 

Lemma 1 The triangulation T (G) of a given graph G =(V;S) includes 

2(N , 1) , B triangles, and 3(N , 1) , B edges, where B is the number 

of vertices of the convex envelope of the set of points V .

190 Chapter D. Complements to Chapter 5 

D.1.2 Delaunay Triangulation 

A Delaunay triangulation D of V is the geometric dual of the Dirichlet 

(or Vorono or Thiessen) tessellation P constructed on V . Such a 

tessellation divides the plane into polygonal regions, called tiles. Each 

tile Pi contains all the points of the plan closer to the tile generating 

point vi than to other generating points vk (in the sense of the Euclidian 

distance): 

Pi = fN 2< 2 ; 8k 6= i; d(N; vi) d(N; vk)g: (D.1) 

Graph 

Delaunay triangulation 

Figure D.1: Graph and associated (constrained) Delaunay triangulation 

Delaunay triangulation is optimal from the interpolation point of view, 

for its triangles are as much equiangular as possible (\locally equiangular"). 

It avoids having \ at" triangles which are not good for spline 

interpolation where the approximation error depends on the triangle 

\thickness." The Delaunay triangulation is de ned as: 

De nition 3 The generalized Delaunay triangulation of G =(V;S) 

is a triangulation T (G) =(V;S 0 ) for which the circumscribed circle of 

every triangle 4vivjvk does not contain any other vertex visible from vi, 

vj or vk. The edges of the set S 0 , S are called the Delaunay edges. A 

vertex u is visible from a vertex v if the segment [u; v] does not cut any 

edge of the set S on an interior point. 

Based on this property of the circumscribed circle and on the fact that 

a locally optimal triangulation in the sense of Delaunay is globally a

D.2 Pseudo-code of the implementation 191 

Delaunay triangulation, several implementations have been designed (see 

for instance [17, 119]). 

Figure D.1 presents a graph G and its (constrained) Delaunay triangulation. 

Dashed lines represent the Delaunay edges. 

D.2 Pseudo-code of the implementation 

The aim of the present section is to provide implementation details of 

the system discussed in Chapter 5. It consists some kind of pseudocode 

of the main routines along with the typical thresholds and other 

parameters. Our implementation uses C++ classes to describe images 

(class picture), motion vector elds (class motion) and mesh structure 

(class wireframe). The latter classes are derived from the rst one. In the 

pseudo-code presented here, only a few very intuitive member functions 

have been kept (height, width, (i,j)). A lot of declarations have also 

been omitted We hope this code will help interested people. 

D.2.1 Compensation scheme 

The overall scheme is made out of ve consecutive steps: 

corner detection whose code is provided in Section D.2.2, 

Delaunay triangulation which is implemented according to [119], 

establishment of the equation system for inverse kriging interpolation, 

presented in Section D.2.3, 

resolution of the system with SVD using the routines provided 

by [107], 

image warping which involves some very basic Computer Graphics 

to determine the value of the pixels of the new image after warping. 

Here is the pseudo-code of the scheme. A small trick is to perform all 

operations on an extended picture in order to easily manage vectors that 

point out of the picture.


/* Routine that implements the mesh-based compensation of 

a reference image from a block-based motion field */ 

picture mesh_compensation(picture pReference, motion mField) 

{ 

/* So as to deal with vectors pointing out of the picture 

(as authorized within the H.263 and MPEG-4 standards), 

the reference image is appropriately extended. 

The rest of the algorithm will always manipulate 

images of this size */ 

// extend picture 

mField.give_limits(MinVector,MaxVector); 

if ((MinVector


/* Solve the system with SVD, routines svdcmp and svbksb are 

provided in [107] */ 

svdcmp(A,#vector*2,#corner*2,W,V); 

Wmax= 0.0; 

for (i = 0; i < #corner*2+1; i++) 

{ 

if (W[i] < 1.0e-6) 

W[i] = 0; 

if (W[i] > Wmax) 

Wmax = W[i]; 

} 

Wmin = Wmax*1.0e-6; 

for (sI = 0; sI < #corner*2+1; sI++) 

if (W[i] < Wmin) 

W[i]= Wmin; 

svbksb(A,W,V,#vector*2,#corner*2,B,x); 

/* Displace vertices while ensuring the connectivity of the mesh. 

It just involves some basic Computer Graphics to implement the 

affine transform and interpolate the new pixel values. Cubic 

spline interpolation [54] has been used in the present work */ 

displace(wF,x); 

// extract image from wireframe 

pExtendedOutput = (picture) wF; 

// return to the normal picture size 

pOutput = extract(pExtendedOutput,MaxVector); 

return pOutput; 

} 

D.2.2 Corner detection 

The corner detector of Section 5.2 involves many di erent routines. Only 

the main ones are presented hereunder. Implementation of Mathematical 

Morphology operators is best described in the literature [130]. The 

section presents rst the used data structures and the overall routine for 

corner extraction. It then introduces the core of edge detection along 

with the recursive function for tracking according to the twenty templates. 

One important comment is that the order in which the various congurations 

of the templates of gures 5.7 and 5.9 are searched for is 

important. According to the type of tracked half-boundary (positive or 

negative), it should be made in accordance with gure D.2. 

/* define types of border points */ 

#define B 255 // Background 

#define N 0 // Negative 

#define P 100 // Positive 

#define R 200 // inteRnal


Negative half-boundary Positive half-boundary 

1 

2 

3 

4 4 

1 

Figure D.2: Order of use of tracking templates 

/* define types of edge feature */ 

#define O 0 // bOundary 

#define G 50 // Gap 

#define S 100 // Self-intersection 

#define I 150 // Intersection 

#define C 200 // Closed loop 

/* define the structure of a point on 

a half-boundary */ 

typedef struct point* Point; 

typedef struct point{ 

short contour; // contour number 

short vpos; // vertical position in the image 

short hpos; // horizontal position in the image 

Point next; // pointer to the next point of the half-b. 

Point previous; // pointer to the previous point of the half-b. 

double angle; // for angle measurement 

short href; // for extraction of highest curvature points 

short feature; // type of point: O,G,S,I,C 

}; 

/* define the structure of a half-boundarty */ 

typedef struct contour { 

short number; // reference number of the half-boundary 

short type; // type (P or N) of the half-boundary 

Point first; // pointer to the first point of the contour 

}; 

/* Routine that implements the corner detector based on the 

description of Chapter 5, i.e. a modified version of Noble's 

edge detector combined with Najman and Vaillant measure of 

angles. 

Typical parameters are given in the calling routine. */ 

3 

2


picture detect_corner(picture pImage,short Threshold1,short sThreshold2, 

short MinLength,short MaxLength,short Ag,short Step); 

{ 

/* initialize images for morphological operators, candidate points, 

and detected corners*/ 

picture pFDER (pImage.height(),pImage.width()); 

picture pCandi(pImage.height(),pImage.width()); 

picture pCorner(pImage.height(),pImage.width()); 

/* define a structure for memorizing edges */ 

contour *edge; 

edge = (contour*) malloc(0); 

// express angle in Radian 

double Angle = cos(((double)Ag)*PI/180); 

/* Compute the Morphological Erosion Dilation residue 

as expressed in [94]; place the absolute value in pFDER, 

and achieve a preliminary classification into positive P, 

negative N or internal (ramp) R points in pCorner. */ 

Erosion_Dilation_Residue(pImage, pFDER, pCorner); 

// select candidate points according to Threshold1 

for (all pixels at position (i,j)) 

{ 

if ((pCorner(i,j) == N) && (pFDER(i,j) > Threshold1)) 

{ 

// then check if valid P point or R point in the neighboring 

test = 0; 

for (every neighbor (k,l) of (i,j)) 

if (((pCorner(k,l) == P) || (pCorner(k,l) == P)) && 

(pFDER(k,l) > Threshold1)) 

test = 1; 

// if no valid neighbor, the point becomes part of the 

// background points 

if (test) 

pCandi(i,j) = N; 

else 

pCandi(i,j) = B; 

} 

else if ((pCorner(i,j) == P) && (pFDER(i,j) > Threshold1)) 

{ 

// then check if valid N point or R point in the neighboring 

test = 0; 


if (((pCorner(k,l) == N) || (pCorner(k,l) == P)) && 

(pFDER(k,l) > Threshold1)) 

test = 1; 

// if no valid neighbor, the point becomes part of the 

// background points 

if (test) 

pCandi(i,j) = P; 

else 


}


else if ((pCorner(i,j) == R) && (pFDER(i,j) > Threshold1)) 

pCandi(i,j) = R; 

else 


} 

/* Then, the edge detection proposed by Noble has to 

be performed with the 20 configurations templates for tracking 

The code of this routine is provided below. */ 

track_contours(pCandi,pFDER, Threshold2, &edge); 

/* A measure of angles has to be performed along edges, and 

the highest curvature point must be retained. This is strictly 

based on Najman and Vaillant [87] technique with the only 

addition of a MinLength parameter. */ 

pCorner = determine_highCurvature(edge,Minlength,Maxlength,Angle); 

/* According to the application, one may then choose to automatically 

add a corner every Step pixels along an edge. At the present stage 

of the algorithm, corners are detected on both the positive and 

the negative half-boundaries. If wished, only pairs of corners may 

be retained and, among pairs, only the positive or the negative 

corners finally used. 

*/ 

return pCorner; 

} 

/* Routines which implements the edge tracking according to the 

12 original templates of Noble [94] + 8 additional ones */ 

track_contours(picture pCandi, picture pFDER, short Threshold2, 

contour **edge); 

{ 

/* Extend image and associated structure by a factor 2 so 

as to improve the tracking */ 

picture pDouble = upsize (pCandi,2); 

contour *edged; 

edged = (contour*) malloc(0); 

// image to memorize points that are already tracked 

picture pTrace(2*pCandi.height(),2*pCandi.width()); 

picture pTraced(2*pCandi.height(),2*pCandi.width()); 

// contour number 

short num; 

// No points traced yet 

pTrace = 0; 

pTraced = 0; 

num = 1; 

// selecting starting points for tracking 

for (all pixels (i,j)) 

// look for points that are not yet traced 

if ((!pTrace(i,j)) && ((pCandi(i,j) == N) || (pCandi(i,j) == P))) 

{


// define contour type and opposite type 

type = pCandi(i,j); 

if (type == N) 

typer = P; 

else 

typer = N; 

test = 0; 

// CRITERION 1: value > Threshold2 ? 

if (pFDER(i,j) > sThresh2) 

test = 1; 

// 2. CRITERION 2: no neighbor of same type already traced 


if (test) 

if ((pTrace(k,l)) && (pCandi(k,l) == type)) 

test = 0; 

// CRITERION 3. at least 1 valid neighbor (> Threshold2)? 

if (test) 

{ 

test = 0; 


if ((!test) && (pFDER(k,l) > Threshold2) || 

(pCandi(k,l) == type)) 

{ 

test = 1; 

neighbi = k; 

neighbj = l; 

} 

} 

// Tracking of this contour can start 

if (test) 

{ 

// locate init point on the double-size image according 

// to location of point of opposite type in normal image 

if (neighbi = i+1) 

k = 2*i+1; 

else 

k = 2*i; 

if (neighbj = j+1) 

l = 2*j+1; 

else 

l = 2*j; 

// register init point in the contour structure 

pTraced(k,l) = num; 

edged = (contour *) realloc(edged,num*sizeof(contour)); 

edged[num-1].number = num; 

edged[num-1].type = type; 

edged[num-1].first = new point; 

(*edged[num-1].first).contour = num; 

(*edged[num-1].first).vpos = k; 

(*edged[num-1].first).hpos = l; 

(*edged[num-1].first).next = NULL; 

(*edged[num-1].first).previous = NULL;


// start tracking on the double-size image 

// find next and preceding points of the first one 

// look for configurations in a precise order! 

// type N 


{ 

test = 0; 

// 1. check for turn left configurations 

if (...) {test = 1; fea=O; posvNext=.; 

poshNext=.; posvPrev=.; poshPrev=.;} 

// 2. check for 45 degree left configurations 

if ((!test) && ...) {test = 1; fea=O; posvNext=. ; 


// 3. check for straight line configurations 



// 4. check for turn right configurations 



// 5. if no configuration is found yet, it means that 

// the init point is a GAP -> look for GAPs (only one 

// neighbor. 

if ((!test) && ...) {test = 1; fea=G ; posvNext=-1 ; 

poshNext=-1; posvPrev=.; poshPrev=.;} 

if ((!test) && ...) {test = 1; fea=G ; posvNext=. ; 

poshNext=.; posvPrev=-1; poshPrev=-1;} 

} 

// type P 

if (type == P) 

{ 

// Ibidem with the other order of search 4,3,2,1,5 

} 

// update point information 

// point 

(*edged[num-1].first).feature = fea; 

// next point 

if (posvNext != -1) 

{ 

(*edged[num-1].first).next = new point; 

(*(*edged[num-1].first).next).contour = num; 

(*(*edged[num-1].first).next).vpos = posvNext; 

(*(*edged[num-1].first).next).hpos = poshNext; 

(*(*edged[num-1].first).next).next = NULL; 

(*(*edged[num-1].first).next).previous=edged[num-1].first; 

} 

// previous point 

if (posvPrev != -1) 

{ 

(*edged[num-1].first).previous = new point; 

(*(*edged[num-1].first).previous).contour = num; 

(*(*edged[num-1].first).previous).vpos = posvPrev; 

(*(*edged[num-1].first).previous).hpos = poshPrev; 

(*(*edged[num-1].first).previous).next=edged[num-1].first; 

(*(*edged[num-1].first).previous).previous = NULL; 

}


} 

num--; 

// loop for the next points: this function is provided below 

if ((*edged[num-1].first).next != NULL) 

test = track_forward(edged,type,typer,num,pDouble,pTraced, 

(*edged[num-1].first).next); 

// loop for the previous points if not a Closed loop 

if (((*edged[num-1].first).previous != NULL) && (test != C)) 

track_bacward(edged,type,typer,num,pDouble,pTraced, 

(*edged[num-1].first).previous); 

// go back to the normal dimension image and contours 

reduction_contour(edge,edged,num,pTrace,pTraced); 

// increment contour number 

num++; 

} 

// free all the 'edged' structure 

...; 

return; 

} 

/* Recursive tracking of half-boundaries in forward direction */ 

short track_forward(contour *edged,short type,short typer,short num, 

picture pDouble, picture pTraced, Point current) 

{ 

short i,j,k,l; 

short posv, posh; 

short test; 

/* Determine where the contour comes from in order to 

only test adequate templates */ 

k = (*current).vpos; 

l = (*current).hpos; 

i = k - (*(*current).previous).vpos; 

j = l - (*(*current).previous).hpos; 

test = 0; 

// track contour according to templates (types on pDouble) 

case i = .. 

case j = .. 

{ 

// type N 


{ 

// turn left? 

if (...) {test=1; posv=.; posh=.;} 

// 45 degree left? 

if ((!test) && ...) {test=1; posv=.; posh=.;} 

// straight on? 

if ((!test) && ...) {test=1; posv=.; posh=.;}


// turn right? 

if ((!test) && ...) {test=1; posv=.; posh=.;} 

} 

else 

// type P: Ibidem from right to left 

{} 

} 

// point is now traced 

pTraced(k,l) = num; 

if (test) 

{ 

// self - intersection? 

if (pTraced(posv,posh) == num) 

{ 

// closed loop? 

if (((*edged[num-1].first).vpos == posv) && 

((*edged[num-1].first).hpos == posh)) 

{ 

(*current).feature = C; 

// next point is the first one 

(*current).next = edged[num-1].first; 

// update previous point fo the first one 

delete (*edged[num-1].first).previous; 

(*edged[num-1].first).previous = current; 

return C; 

} 

else 

{ 

(*current).feature = S; 

// next point is already in the contour 

(*current).next = new point; 

(*(*current).next).contour = num; 

(*(*current).next).vpos = posv; 

(*(*current).next).hpos = posh; 

(*(*current).next).next = NULL; 

(*(*current).next).previous = current; 

return S; 

} 

} 

// intersection 

else if (pTraced(posv,posh)) 

{ 

(*current).feature = I; 

// next point belongs to another contour 


(*(*current).next).contour = pTraced(posv,posh); 





return I; 

}


// normal case 

else 

{ 

(*current).feature = O; 

// next point 


(*(*current).next).contour = num; 





return track_forward(edged,type,typer,num,pDouble,pTraced, 

(*current).next); 

} 

} 

else 

// the point is a gap 

{ 

(*current).feature = G; 

// no next point 

(*current).next = NULL; 

return G; 

} 

return 0; 

} 

/* Recursive tracking of half-boundaries in backwardd direction */ 

short track_backward(contour *edged,short type,short typer,short num, 

picture pDouble, picture pTraced, Point current) 

{ 

// similar to track_forward routine but for the 

// (*current).previous point 

} 

D.2.3 Inverse Kriging System 

/* Determine the system matrix A and the solution vector B 

for the Inverse Kriging Interpolation, as exposed in Chapter 5 */ 

determine_system(wireframe wF, motion mField, double **A, double *B); 

{ 

// Initialization 

A = 0; 

for (every block #i of the motion vector field) 

{ 

/* Backwards motion field to be reversed 

Determine origin of the motion vector according to the 

center position (posv[i],posh[i]) of the block */ 

VertOrigin = MaxVector+posv[i]+vertical_component(mField,i); 

HoriOrigin = MaxVector+posh[i]+horizontal_component(mField,i); 

/* Determine into wich triangle, made out of vertices 

(Top1V,Top1H), (Top2V,Top2H), and (Top3V,Top3H), is located 

the origin of the motion vector */


} 

reference_triangle(VertOrigin, HoriOrigin, wF, Top1V,Top1H, 

Top2V,Top2H,Top3V,Top3H); 

/* Compute the invariants p and q of the related affine 

transform */ 

Heigh_12 = Top2V - Top1V; 

Width_12 = Top2H - Top1H; 

Height_13 = Top3V - Top1V; 

Width_13 = Top3H - Top1H; 

denominator = Height_12 * Width_13 - Width_12 * Height_13; 

p = Height_13*(Top1H-HoriOrigin)+Width_13*(VertOrigin-Top1V); 

q = Height_12*(HoriOrigin-Top1H)+Width_12*(Top1V-VertOrigin); 

p /= denominator; 

q /= denominator; 

/* Insert the equation in the system */ 

// vertical component of the vector 

A[i*2+1][indice_of_top1]= (double)1-p-q; 

A[i*2+1][indice_of_top2]= p; 

A[i*2+1][indice_of_top3]= q; 

// horizontal component of the vector 

A[i*2+1+1][indice_of_top1]= (double)1-p-q; 

A[i*2+1+1][indice_of_top2]= p; 

A[i*2+1+1][indice_of_top3]= q; 

// independent term 

b[i*2+1] = -vertical_component(mField, i); 

b[i*2+1+1] = -horizontal_component(mField, i); 

}

Bibliography 

[1] COST 211ter Simulation Subgroup. Focus document. Ankara, 

October 1996. Cost 211ter Workshop. SIM(96)41. 

[2] N. Ahmed, T. Natarajan, and K.R. Rao. Discrete cosine transform. 

IEEE Transactions on Computers, 23:90{93, January 1974. 

[3] Y. Altunbasak and M. Tekalp. Closed-form connectivitypreserving 

solutions for motion compensation using 2-d meshes. 

IEEE Transactions on Image Processing, 6(9):1255{1269, September 

1997. 

[4] C. Auyeung, J. Kosmach, M. Orchard, and T. Kalafatis. Overlapped 

block motion compensation. volume 1818, pages 561{572, 

Boston, November 1992. SPIE Visual Communications and Image 

Processing. 

[5] P.R. Beaudet. Rotational invariant image operators. pages 579{ 

583. Int. Conf. Pattern Recognition, 1978. 

[6] T. Berger. Rate Distorsion Theory. A mathematical basis for data 

compression. Prentice Hall, Englewood Cli s, N.J., 1971. 

[7] J. Besag. Spatial interaction and the statistical analysis of lattice 

systems. Journal of Royal Statistics Society, 2:192{236, 1974. 

[8] J. Besag. On the statistical analysis of dirty pictures. Journal of 

Royal Statistics Society, 48:259{302, 1986. 

[9] G. Bjontegaard. Very low bitrate video coding using h.263 and 

foreseen extensions. pages 825{838, Louvain-la-Neuve, May 1996. 

European Conference on Multimedia Applications, Services and 

Techniques (ECMAST).

204 Bibliography 

[10] G. Bozdagi, M. Tekalp, and L. Onural. 3-d motion estimation 

and wireframe adaptation including photometric e ects for modelbased 

coding of facial image sequences. IEEE Transactions on 

Circuits and Systems for Video Technology, 4(3):246{256, June 

1994. 

[11] O. Bruyndonckx, E. Hanssens, B. Macq, X. Marichal, M.P. 

Queluz, and B. Simon. Coding on multigrids image sequences. 

page section B1, Berlin, October 1994. WIASIC. 

[12] Rec. ITU-R BT.601-5. Studio encoding parameters of digital television 

for standards 4:3 and wide-screen 16:9 aspect ratios. Technical 

report, ITU-R, Geneva, Switzerland. 

[13] C. Chen and T.R. Hsing. Review: Digital coding techniques for 

visual communications. pages 1{15, 1991. 

[14] C.F. Chen and K. Pang. The optimal transform of motioncompensated 

frame di erence images in a hybrid coder. Journal 

of Royal Statistics Society, 40(6):393{397, June 1993. 

[15] P.B. Chou and C.M. Brown. The theory and practice of bayesian 

image labeling. International Journal of Computer Vision, 4:185{ 

210, 1990. 

[16] S. Comes. Les traitements perceptifs d'images numerisees. PhD 

thesis, Universite catholique de Louvain, June 1995. 

[17] Y. Correc and E. Chapuis. Fast computation of delaunay triangulations. 

Adv. Eng. Software, 9(2):77{83, 1987. 

[18] F. Davoine. Compression d'images par fractales basee sur la triangulation 

de Delaunay. PhD thesis, Institut National Polytechnique 

de Grenoble, December 1995. 

[19] C. De Vleeschouwer, T. Delmot, X. Marichal, and B. Macq. A 

fuzzy logic system for content-based bitrate allocation. Signal Processing: 

Image Communication, pages 115{142, July 1997. 

[20] C. De Vleeschouwer, X. Marichal, T. Delmot, and B. Macq. A 

fuzzy logic system able to automatically detect interesting areas. 

volume 3016, pages 234{245, San Jose, February 1997. SPIE Human 

Vision and Electronic Imaging II.

Bibliography 205 

[21] E. Decenciere and P. Salembier. Morpheco deliverable. chapter 

3: Application of kriging to image coding. Morpheco Consortium, 

R2053/UPC/GPS/DS/R/016/b1, January 1996. 

[22] E. Decenciere Ferrandiere, C. de Fouquet, and F. Meyer. Applications 

of kriging to image sequence coding. Accepted for publication 

in Signal Processing: Image Communication, 1998. 

[23] T. Delmot, C. De Vleeschouwer, B. Macq, and X. Marichal. The 

comis scheme: an approach towards very-low bit-rate image coding. 

pages 883{902, Louvain-la-Neuve, May 1996. European Conference 

on Multimedia Applications, Services and Techniques (EC- 

MAST). 

[24] R. Deriche and O. Faugeras. 2-d curve matching using high curvature 

points: Application to stereovision. pages 567{576. Int. Conf. 

Pattern Recognition, 1990. 

[25] R. Deriche and G. Giraudon. A computational approach for corner 

and vertex detection. International Journal of Computer Vision, 

10:101{124, 1993. 

[26] P.K. Doenges, T.K. Capin, F. Lavagetto, J. Ostermann, I.S. 

Pandzic, and E.D. Petajan. Mpeg-4: Audio/video and synthetic 

graphics/audio for mixed media. Signal Processing: Image Communication, 

9(4):433{464, May 1997. 

[27] L. Dreschler and H.H. Nagel. On the selection of critical points 

and local curvature extrema of region boundaries for interframe 

matching. pages 542{544. Int. Conf. Pattern Recognition, 1981. 

[28] M. Dudon. Modelisation du mouvement par Treillis Actifs et methodes 

d'estimation associees. Application au codage de sequences 

d'images. PhD thesis, Universite de Rennes I, Decembre 1996. 

[29] M. Dudon, O. Avaro, and G. Eude. Object-oriented motion estimation. 

pages 284{287, Sacramento, September 1994. Picture 

Coding Symposium (PCS). 

[30] M. Dudon, G. Eude, and C. Roux. Motion estimation and triangular 

active mesh. Revue HF, (4):47{53, December 1995.


[31] F. Dufaux and F. Moscheni. Motion estimation techniques for 

digital tv: a review and a new contribution. Proceedings of IEEE, 

83(6):858{876, June 1995. 

[32] F. Du aux. Multigrid Block Matching Motion Estimation for 

Generic Video Coding. PhD thesis, Ecole Polytechnique Federale 

de Lausanne, 1994. 

[33] D.P. Elias and K.K. Pang. Obtaining a coherent motion eld dor 

motion-based segmentation. pages 541{546, Melbourne, March 

1996. Picture Coding Symposium (PCS). 

[34] M. Etoh, C.S. Boon, and S. Kadono. An object-based image coding 

scheme using alpha channel and multiple templates. pages 

853{871, Louvain-la-Neuve, May 1996. European Conference on 

Multimedia Applications, Services and Techniques (ECMAST). 

[35] R.W. Floyd and R. Beigel. The language of Machines - An Introduction 

to Computability and Formal Languages. Computer 

Science Press, 1994. 

[36] C.-S. Fuh and P. Maragos. A ne models for image matching and 

motion detection. pages 2409{2412, Toronto, Canada, May 1991. 

Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). 

[37] T. Gaidon. Quanti cation vectorielle algebrique et ondelettes pour 

la compression de sequences d'images. PhD thesis, Universite de 

Nice-Sophia Antipolis, Ecole Doctorale Sciences Pour l'Ingenieur, 

December 1993. 

[38] R.G. Gallager. Information Theory and Reliable Communication. 

John Wiley & Sons, Inc., New-York, 1968. 

[39] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, 

and the bayesian restoration of images. IEEE Transactions 

on Pattern Analysis and Machine Intelligence, 6(6):721{741, 

November 1984. 

[40] C. Gra gne. Approche region: methodes markoviennes. In J.P. 

Cocquerez and S. Philipp, editors, Analyse d'images: ltrage et 

segmentation, chapter XI, pages 281{304. Masson, Paris, October 

1995.


[41] R.M. Gray. Vector quantization. IEEE ASSP Magazine, pages 

4{29, April 1984. 

[42] MPEG Video group. MPEG-4 Video Veri cation Model, version 

10.1, ISO/IEC SC29WG11 (M3464). Tokyo, March 1998. 

[43] E. Hanssens, B. Chupeau, J.D. Legat, and B. Macq. Selective 

prediction of error transmission using motion information. Signal 

Processing: Image Communication, 12(1):71{81, March 1998. 

[44] R.M. Haralick and L.G. Shapiro. Glossary of computer vision 

terms. In Computer and Robot Vision,volume 2, chapter 21, pages 

571{614. Addison-Wesley Publishing Company, 1993. 

[45] R.M. Haralick, S.R. Sternberg, and X. Zhuang. Image analysis 

using mathematical morphology. IEEE Transactions on Pattern 

Analysis and Machine Intelligence, 9(4):532{550, July 1987. 

[46] C. Harris and M. Stephens. A combined corner and edge detector. 

pages 147{151, Manchester, 1988. Proc. Alvey Vision Conference. 

[47] J.F. Hayes, A. Habibi, and P.A. Wintz. Rate distortion function 

for a gaussian source model of images. IEEE Transactions on 

Information Theory, IT-16(4):507{509, July 1970. 

[48] B. Horn. MIT Press, Cambridge, Massachusetts, the mit electrical 

engineering and computer science series edition, 1986. 

[49] B. Horn and B. Schunck. Determining optical ow. Arti cial 

Intelligence, (17):185{203, 1981. 

[50] J.R. Jain and A.K. Jain. Displacement measurement and its application 

in interframe image coding. IEEE Transactions on Communications, 

29(12):1799{1806, December 1981. 

[51] R. Jain. Direct computation of the focus of expansion. IEEE 

Transactions on Pattern Analysis and Machine Intelligence, 

5(1):58{64, January 1983. 

[52] A.J. Jerri. The shannon sampling theorem - its various extensions 

and applications: A tutorial review. Proceedings of IEEE, 

65(11):1565{1596, November 1977.


[53] H. Jozawa. Motion compensated video coding using rotation and 

scaling models. volume 1, pages 309{312, Melbourne, March 1996. 

Picture Coding Symposium (PCS). 

[54] R.G. Keys. Cubic convolution interpolation for digital image processing. 

IEEE Transactions on Acoustics, Speech and Signal Processing, 

29(6):1153{1160, December 1981. 

[55] S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi. Optimization by 

simulated annealing. Science, 220, May 1983. 

[56] L. Kitchen and A. Rosenfeld. Gray-level corner detection. Pattern 

Recognition Letters, 1:95{102, December 1982. 

[57] R. Koenen, F. Pereira, and L. Chiariglione. Mpeg-4: Context and 

objectives. Signal Processing: Image Communication, 9(4):295{ 

304, May 1997. 

[58] D.G. Krige. A statistical approach to some mine valuations and 

allied problems on the witwatersrand, 1951. 

[59] M. Kunt, M. Benard, and R. Leonardi. Recent results in highcompression 

image coding. IEEE Transactions on Circuits and 

Systems, 34(11):1306{1336, November 1987. 

[60] G.G. Langdon Jr. An introduction to arithmetic coding. IBM 

Journal of Research and Development, 28(2):135{149, March 1984. 

[61] F. Lavagetto and S. Curinga. Object-oriented scene modeling for 

interpersonal video communication at very low bit-rate. Signal 

Processing: Image Communication, (6):379{395, June 1994. 

[62] D. Le Gall. Mpeg: A video compression standard for multimedia 

applications. Communications of the ACM, 34(4):46{58, April 

1991. 

[63] H. Li, A. Lundmark, and R. Forchheimer. Image sequence coding 

at very low bitrates: A review. IEEE Transactions on Image 

Processing, 3(5):589{609, September 1994. 

[64] B. Macq. A universal entropy coder for transform or hybrid coding. 

pages 12.1.1{12.1.2, Boston, 1990. Picture Coding Symposium 

(PCS).


[65] B. Macq, M.P. Queluz, and B. Simon. Very low bit-rate image 

coding on adaptive multigrids. Signal Processing: Image Communication, 

7(4-6):313{331, November 1995. 

[66] S.G. Mallat. A theory for multiresolution signal decomposition: 

The wavelet representation. IEEE Transactions on Pattern Analysis 

and Machine Intelligence, 11(7):674{693, July 1989. 

[67] X. Marichal. Universal entropy coding of segmentation trees, June 

1994. 

[68] X. Marichal. An original approachtowards very-low bit-rate image 

coding. Revue HF, (4):29{46, December 1995. 

[69] X. Marichal, C. De Vleeschouwer, T. Delmot, and B. Macq. Automatic 

detection of interest areas of an image or of a sequence 

of images. volume III, pages 371{374, Lausanne, September 1996. 

Int. Conf. on Image Processing (ICIP). 

[70] X. Marichal, C. De Vleeschouwer, T. Delmot, and B. Macq. Object 

based coding through multigrid representation. volume 2668, 

pages 6{17, San Jose, January 1996. SPIE Digital Visual Communications: 

algorithms and technologies. 

[71] X. Marichal, C. De Vleeschouwer, and B. Macq. Towards visual 

search engine based on fuzzy logic. pages 135{141, Louvain-la- 

Neuve, June 1997. Workshop on Image Analysis for Multimedia 

Interactive Services (WIAMIS'97). 

[72] X. Marichal, T. Delmot, C. De Vleeschouwer, and B. Macq. Determination 

automatique des regions d'interêt d'une image ou d'une 

sequence d'images. pages 228{234, Grenoble, February 1996. 

Journees d'etudes et d'echanges du CNET. 

[73] X. Marichal, T. Delmot, B. Macq, F. Oger, V. Warscotte, J.P. Thiran, 

and B. Simon. Towards object-based coding through multigrid 

representation. pages H{1, Tokyo, November 1995. International 

Workshop on Coding Techniques for Very Low Bit-rate 

Video (VLBV). 

[74] X. Marichal and B. Macq. Asymmetric motion estimation/compensation. 

volume III, pages 775{778, Lausanne, September 

1996. Int. Conf. on Image Processing (ICIP).


[75] X. Marichal and B. Macq. Motion reconstruction with wireframe 

structures. pages 633{637, Melbourne, March 1996. Picture Coding 

Symposium (PCS). 

[76] X. Marichal and B. Macq. Active mesh reconstruction of blockbased 

motion information. volume V, pages 2605{2608, Seattle, 

May 1998. Int. Conf. on Acoustics, Speech and Signal Processing 

(ICASSP). 

[77] X. Marichal, B. Macq, and M.P. Queluz. Generic coder for binary 

sources: the m-coder. IEE Electronics Letters, 31(7):544{545, 

March 1995. 

[78] G. Matheron. Estimating and Choosing. Springer-Verlag, Berlin, 

1989. 

[79] G. Medioni and Y. Yasumuto. Corner detection and curve representation 

unsing cubic b-splines. Computer Vision, Graphics, and 

Image Processing, 39:267{278, 1987. 

[80] Jerry M. Mendel. Fuzzy Logic Systems for Engineering: A Tutorial. 

Proceedings of IEEE, 83(3):345{377, March 1995. 

[81] N. Metropolis, A.W. Resenbluth, M.N. Resenbluth, A.H. Teller, 

and E. Teller. Equations of state caculations by fast computing 

machines. Journal Chem. Phys., 21:1087{1091, 1953. 

[82] D.P. Mitchell and A.N. Netravali. Reconstruction lters in computer 

graphics. Computer Graphics, 22(4):221{228, August 1988. 

[83] F. Mokhtarian and A.K. Mackworth. Scale-based description and 

recognition of planar curves and 2d shapes. IEEE Transactions 

on Pattern Analysis and Machine Intelligence, 8(1):34{43, 1986. 

[84] MORPHECO. Morphological segmentation-based coding of image 

sequences. pages 2.2.1{2.2.8, Hannover, December 1993. COST 

211ter European Workshop on New Techniques for Coding of 

Video Signals at Very Low Bitrates. 

[85] H.G. Musmann, M. Hotter, and J. Ostermann. Object-oriented 

analysis-synthesis coding of moving images. Image Communication, 

1(2):117{138, October 1989.


[86] H.G. Musmann, P. Pirsch, and H.-J. Grallert. Advances in picture 

coding. Proceedings of IEEE, 73(4):523{548, April 1985. 

[87] L. Najman and R. Vaillant. Topological and geometrical corners by 

watershed. volume LNCS 970, pages 262{269, Prague, September 

1995. Computer Analysis of Images and Patterns. 

[88] Y. Nakaya and H. Harashima. An iterative motion estimation 

method using triangular patches for motion compensation. pages 

546{557. Society of Photo-Instrumentation Engineers, November 

1991. 

[89] N.M. Nasrabadi and R.A. King. Image coding using vector quantization: 

A review. IEEE Transactions on Communications, 

36(8):81{95, August 1988. 

[90] R. Ne and A. Zakhor. Very low bit rate video coding based on 

matching pursuits. IEEE Transactions on Circuits and Systems 

for Video Technology, 7(1):158{171, February 1997. 

[91] A.N. Netravali and J.D. Robbins. Motion-compensated television 

coding: Part i. BELL System Technical Journal, 58(3):631{670, 

March 1979. 

[92] H. Nicolas and C. Labit. Motion and illumination variation estimation 

using a hierarchy of models: Application to image sequence 

coding. Technical report, IRISA, Rennes, June 1993. 

[93] J. Nieweglowski, T. Campbell, and H. Haavisto. Anovel video 

coding scheme based on temporal prediction using digital image 

warping. IEEE Transactions on Consumer Electronics, 39:141{ 

150, August 1993. 

[94] J.A. Noble. Finding half boundaries and junctions in images. Image 

and Vision Computing, 10(4):219{232, May 1992. 

[95] Telecommunication Standardization Sector of ITU. Itu-t recommendation 

h.261 - video codec for audiovisual services at p x 64 

kbit/s. ITU Recommendations, March 1993. 

[96] Telecommunication Standardization Sector of ITU. Draft itu-t 

recommendation h.263. ITU Recommendations, July 1995.


[97] AdHoc Group on MPEG-4 Test Procedures. Mpeg-4 

test/evaluation procedures document - revision 2.0. MPEG 4 

meeting, May 1995. 

[98] J.B. O'Neal and T.R. Natarajan. Coding isotropic images. IEEE 

Transactions on Information Theory, 23(6):697{707, November 

1977. 

[99] Papoulis. Probability, Random Variables, and Stochastic Processes. 

Mc Graw Hill, Inc., New-York, 3rd edition, 1991. 

[100] F. Pereira. Mpeg4: a new challenge for the representation of audiovisual 

information. pages 7{16, Melbourne, March 1996. Picture 

Coding Symposium (PCS). 

[101] F. Pereira. First proposals for mpeg-7 visual requirements. Bristol 

meeting, April 1997. 

[102] F. Pereira and P. Geada. Sketch-based database searching: a 

demonstration of an mpeg-7 application. Bristol meeting, April 

1997. 

[103] F. Pereira and R. Koenen. Very low bit-rate audio-visual applications. 

Signal Processing: Image Communication, 9(1):55{77, 

November 1996. 

[104] F. Pereira, K. O'Connell, R. Koenen, and M. Etoh. Special issue 

on mpeg-4, part 1: Invited papers. Signal Processing: Image 

Communication, 9(4), May 1997. 

[105] W.K. Pratt. John Wiley & Sons, New York, 1978. 

[106] W.K. Pratt. Photometry and colorimetry. In Digital Image Processing 

[105], chapter 3, pages 50{90. 

[107] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. 

Numerical Recipes in C - The Art of Scienti c Computing. Cambridge 

University Press, Cambridge, 2 edition, 1992. 

[108] M.P. Queluz. Motion estimation for video coding: a review. Revue 

HF, (4):5{28, December 1995. 

[109] M.P. Queluz. Multiscale Motion Estimation and Video Compression. 

PhD thesis, Universite catholique de Louvain, April 1996.


[110] M.P. Queluz and B. Macq. Signal-adapted motion compensation 

for video compression. Paris, September 1992. ISSES, URSI. 

[111] M.P. Queluz, X. Marichal, and B. Macq. E cient entropy coding 

of tree data structures. pages 478{481, Sacramento, September 

1994. Picture Coding Symposium (PCS). 

[112] M.P. Queluz, B. Simon, and B. Macq. Towards a spatio-temporal 

segmentation technique for very-low bitrate image coding. Hannover, 

December 1993. COST 211ter European Workshop on New 

Techniques for Coding of Video Signals at Very Low Bitrates. 

[113] M. Rabbani and P.W. Jones. Digital Image Compression Techniques. 

SPIE Optical Engineering Press, Georgia Institute of Technology, 

2 edition, 1991. 

[114] K. Rohr. Localization properties of direct corner detectors. Journal 

of Mathematical Imaging and Vision, 4:139{150, 1994. 

[115] M. Rydfalk. Candide: A parametrised face. Technical report, 

Dept. Electr. Eng., Linkoping University, Linkoping, Sweden, 

1987. 

[116] D.G. Sampson, E.A. da Silva, and M. Ghanbari. Interframe image 

sequence coding using overlapped motion estimation and wavelet 

lattice quantisation. pages 16{20, Edinburgh, July 1995. Fifth 

International conference on IMAGE PROCESSING and its applications. 

[117] H. Sanson. Motion a ne models identi cation and application to 

television image coding. volume 1605-2, pages 570{581, Boston, 

November 1991. Visual Communication and Image Processing. 

[118] J. Serra. Image Analysis and Mathematical Morphology. London 

Academic Press, 1982. 

[119] S.W. Sloan. A fast algorithm for constructing delaunay triangulations 

in the plane. Adv. Eng. Software, 9(1):34{55, 1987. 

[120] P. Strobach. Tree-structured scene adaptive coder. IEEE Transactions 

on Communications, 38(4):477{486, April 1990. 

[121] M. Tekalp et al. The status of core experiment m2. Technical 

report, MPEG 96/1102, July 1996.


[122] C. Toklu, T. Erdem, I. Sezan, and M. Tekalp. Tracking motion 

and intensity variations using hierarchical 2-d mesh modeling 

for synthetic object trans guration. Computer Vision, Graphics, 

and Image Processing: Graphical Models and Image Processing, 

58(6):553{573, November 1996. 

[123] R.Y. Tsai and T.S. Huang. Estimating three-dimensional motion 

parameters of a rigid planar patch. IEEE Transactions on 

Acoustics, Speech and Signal Processing, 29(6):1147{1152, December 

1981. 

[124] R.Y. Tsai and T.S. Huang. Estimating three-dimensional motion 

parameters of a rigid planar patch, iii: Finite point correspondences 

and the three-view problem. IEEE Transactions on Acoustics, 

Speech and Signal Processing, 32(2):213{220, April 1984. 

[125] R.Y. Tsai, T.S. Huang, and W.L. Zhu. Estimating threedimensional 

motion parameters of a rigid planar patch, ii: Singular 

value decomposition. IEEE Transactions on Acoustics, Speech and 

Signal Processing, 30(4):525{534, August 1982. 

[126] Y. Tse and R. Baker. Global zoom/pan estimation and compensation 

for video compression. pages 2725{2728. Int. Conf. on Acoustics, 

Speech and Signal Processing (ICASSP), 1991. 

[127] G. Tziritas and C. Labit. Motion Analysis for Image Sequence 

Coding. Elsevier, Amsterdam, 1994. 

[128] J. Vaisey and A. Gersho. Image compression with variable block 

size segmentation. IEEE Transactions on Acoustics, Speech and 

Signal Processing, 40(8):2040{2060, August 1992. 

[129] C. van den Branden Lambrecht and O. Verscheure. Perceptual 

quality measure using a spatio-temporal model of the human visual 

system. volume 2668, pages 450{461, San Jose, February 1996. 

SPIE Electronic Imaging: science and technology. 

[130] M. Van Droogenbroeck and H. Talbot. Fast computation of morphological 

operations with arbitrary structuring elements. Pattern 

Recognition Letters, 17:1451{1460, 1996. 

[131] F. Vermaut. Un algorithme distribue pour la compensation de 

mouvements: Distributed abma, June 1997.


[132] F. Vermaut, Y. Deville, B. Macq, and X. Marichal. A distributed 

adaptive block matching algorithm : Dis-abma. page accepted for 

publication, Island of Rhodes, September 1998. EUSIPCO'98. 

[133] A. Verri and T. Poggio. Motion eld and optical ow: qualitative 

properties. IEEE Transactions on Pattern Analysis and Machine 

Intelligence, 11:490{498, May 1989. 

[134] L. Vincent and P. Soille. Watersheds in digital spaces: An e cient 

algorithm based on immersion simulations. IEEE Transactions on 

Pattern Analysis and Machine Intelligence, 13(6):583{598, June 

1991. 

[135] G.K. Wallace. The jpeg still picture compression standard. Communications 

of the ACM, 34(4):30{44, April 1991. 

[136] Y. Wang and O. Lee. Active mesh - a feature seeking and tracking 

image sequence representation scheme. IEEE Transactions on 

Image Processing, 3(5):610{624, September 1994. 

[137] J.W. Woods. Markov image modeling. IEEE Transactions on 

Automatic Control, 23(5):846{850, October 1978. 

[138] Y. Yokoyama, Y. Miyamoto, and M. Ohta. Very low bitrate video 

coding using warping prediction adaptive to object contours. pages 

M{4, Tokyo, November 1995. International Workshop on Coding 

Techniques for Very Low Bit-rate Video (VLBV). 

[139] O.A. Zuniga and R.M. Haralick. Corner detection using the facet 

model. pages 30{37. Int. Conf. Pattern Recognition, 1983.

motion estimation and compensation for very low bitrate video coding

Create successful ePaper yourself

Delete template?

Save as template?