20.01.2013 Views

motion estimation and compensation for very low bitrate video coding

motion estimation and compensation for very low bitrate video coding

motion estimation and compensation for very low bitrate video coding

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

LABORATOIRE DE TELECOMMUNICATIONS ET<br />

TELEDETECTION<br />

B - 1348 Louvain-la-Neuve Belgique<br />

MOTION ESTIMATION AND COMPENSATION<br />

FOR VERY LOW BITRATE VIDEO CODING<br />

Xavier MARICHAL<br />

These presentee en vue de l'obtention du grade de<br />

Docteur en Sciences Appliquees<br />

Jury compose de<br />

Beno^t MACQ (UCL/TELE) - Promoteur<br />

Paul DELOGNE (UCL/TELE) - Examinateur<br />

Jean-Didier LEGAT (UCL/DICE) - Examinateur<br />

Ferran MARQUES (UPC - Barcelona) - Examinateur<br />

Thomas SIKORA (HHI - Berlin) - Examinateur<br />

Luc VAN GOOL (KUL - Leuven) - Examinateur<br />

Piotr SOBIESKI (UCL/TELE) - President<br />

Mai 1998


Nous pouvons aisement communiquer d'un<br />

continent a l'autre, mais un homme ne sait pas<br />

encore entrer en contact avec un autre homme.<br />

Vaclav Havel


Avant-propos<br />

Mener a bien une these de doctorat necessite certes de la part du doctorant<br />

un travail personnel appreciable ainsi qu'une certaine dose de tenacite.<br />

Toutefois, il m'aurait ete impossible de realiser un tel e ort sans le soutien de<br />

nombreuses personnes envers qui je souhaiterais exprimer ici ma gratitude.<br />

E n premier lieu, je voudrais remercier Beno^t Macq, mon promoteur. Tout<br />

d'abord pour la con ance qu'il m'a temoignee en me proposant d'entreprendre<br />

ce travail. Ensuite, pour sa direction juste assez autoritaire que pour m'emp^echer<br />

de me disperser, mais n'empietant jamais sur ma liberte de recherche. Je remercie<br />

egalement le FRIA pour le soutien nancier qu'il m'a octroye au cours<br />

de ces trois annees et demi; et ma vive reconnaissance vaa Messieurs Watteyne,<br />

Preux et Mertes qui ont soutenu ma c<strong>and</strong>idature a cette bourse.<br />

Je tiens a remercier les membres de mon jury, Paul Delogne, Jean-Didier Legat,<br />

Ferran Marques, Thomas Sikora, Luc Van Gool et Piotr Sobieski, pour leurs<br />

remarques et leurs commentaires aussi judicieux que precis a n d'ameliorer la<br />

version nale du texte.<br />

Realisee au sein du Laboratoire de Telecommunications et Teledetection, la<br />

presente these a gr<strong>and</strong>ement bene cie de l'environnement materiel et humain<br />

qui y est propose. Je tiens plus particulierement a souligner l'apport exceptionnel<br />

des personnes avec qui j'ai partage plusieurs annees le A.176: Xavier, qui<br />

joua le r^ole de mentor a mes debuts, Olivier, dont l'energie servait de catalyseur,<br />

Vicent, pour un sourire et une disponibilite omnipresents, Jean-Francois,<br />

dont le franc-parler n'a d'egal que le dynamisme, et surtout Christophe, a qui<br />

je suis notamment redevable d'une relecture commentee du present manuscrit.<br />

Il m'est malheureusement impossible de citer tout le monde, mais j'adresse ma<br />

plus vive sympathie a l'ensemble des autres membres du personnel scienti que<br />

du Laboratoire, sans qui ce dernier ne serait pas ce qu'il est. Avec un clin<br />

d'oeil particulier a ceux qui ont partage l'experience COMIS. Un gr<strong>and</strong> merci<br />

aussi au personnel \technique" pour son aide precieuse dans la resolution de<br />

nombreux problemes pratiques.<br />

C amarades de route, parents, soeurs, gr<strong>and</strong>s-parents, famille, belle-famille,<br />

amis, veuillez trouver dans ces quelques lignes l'expression de mon a ection et<br />

de ma gratitude pour votre soutien durant toutes ces annees.<br />

I niquement, il revient toujours aux ^etres les plus chers de se voir remercier<br />

en dernier lieu. A Veronique qui a eu la patience de m'aider sans rel^ache a<br />

corriger ce texte, redige dans une langue dont jenema^trise pas toutes les<br />

subtilites. Compagne de tous les instants, elle est la source qui me donne la<br />

<strong>for</strong>ce d'entreprendre et de realiser. A Joauma qui, adefaut d'avoir toujours ete<br />

une muse inspiratrice, a apporte un <strong>for</strong>midable rayon de soleil dans nos vies.<br />

Mes plus tendres remerciements leur sont reserves a toutes deux.<br />

Xavier<br />

8 mai 1998


Abstract<br />

Motion <strong>estimation</strong> is a key issue in the eld of moving images analysis.<br />

In the framework of <strong>video</strong> compression, it is combined with <strong>motion</strong><br />

<strong>compensation</strong> in order to exploit the spatio-temporal correlation of image<br />

sequences along the <strong>motion</strong> trajectory. It then achieves one of the<br />

most important compression factor of a <strong>video</strong> coder. The research presented<br />

in this thesis is mainly concerned with improvements of classical<br />

<strong>motion</strong> <strong>estimation</strong> <strong>and</strong> <strong>compensation</strong> techniques in the context of <strong>very</strong><strong>low</strong><br />

<strong>bitrate</strong> transmissions thanks to contents-adapted in<strong>for</strong>mation that<br />

can be extracted from images. This concept is consecutively, but independently,<br />

applied to the various steps of exploiting <strong>motion</strong> in a <strong>video</strong><br />

<strong>coding</strong> scheme: <strong>estimation</strong>, transmission <strong>and</strong> <strong>compensation</strong>.<br />

The manuscript starts with a brief overview of the \state of the arts" in<br />

<strong>video</strong> compression, introducing underlying concepts while putting some<br />

emphasis on <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> conditions. Motion <strong>estimation</strong> is further<br />

detailed in the fol<strong>low</strong>ing chapter, which includes a description of <strong>motion</strong><br />

representation <strong>and</strong> disturbing phenomena, as well as a review of the<br />

most commonly used <strong>estimation</strong> techniques. The contributions of the<br />

thesis are then presented.<br />

First, a reliable <strong>motion</strong> <strong>estimation</strong> technique based on multiscale algorithms<br />

is introduced: it outputs segmented <strong>and</strong> more coherent <strong>motion</strong><br />

elds that are adapted to the spatial contents of the images <strong>and</strong> which<br />

can be more e ciently coded. A model of distribution of the engendered<br />

computational burden is also proposed <strong>and</strong> demonstrates a linear speedup.<br />

Secondly, the transmission of moving images is analyzed in the light<br />

of the Rate-Distortion theory. Because of the <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> constraint,<br />

some spatial pre-processing is per<strong>for</strong>med prior to <strong>motion</strong> <strong>estimation</strong> in<br />

order to raise the correlation between the already encoded images <strong>and</strong><br />

the new ones. Prospects are also <strong>for</strong>mulated <strong>for</strong> selective pre-processing<br />

according to the contents relevance. Thirdly, the reconstruction (<strong>compensation</strong>)<br />

of block-based <strong>motion</strong> elds is subjectively improved by using<br />

image warping techniques. A new corner detector is set up so as to automatically<br />

design an active mesh, thanks to the Delaunay triangulation,<br />

on the reference image. Inverse kriging interpolation then al<strong>low</strong>s one<br />

to determine the <strong>motion</strong> of the mesh vertices from the original vector<br />

eld. It results in the suppression of the blocking artefacts while o ering<br />

the possibility to further edit <strong>and</strong> modify the images thanks to the mesh<br />

structure. The present thesis analyses thus three prospects of improving<br />

the quality of existing <strong>video</strong> schemes.


Resume<br />

L'<strong>estimation</strong> du mouvement rev^et une importance capitale dans le domaine<br />

de l'analyse des images mobiles. Combinee a la <strong>compensation</strong><br />

de mouvement, elle permet d'exploiter au mieux la correlation spatiotemporelle<br />

inherente aux sequences d'images et o re ainsi le facteur de<br />

compression le plus signi catif d'un codeur <strong>video</strong>. La presente these s'est<br />

principalement attelee aameliorer des techniques classiques d'<strong>estimation</strong><br />

et <strong>compensation</strong> de mouvement, dans un contexte de transmission a<br />

tres bas debit, en utilisant l'in<strong>for</strong>mation adaptee au signal qui peut ^etre<br />

extraite des images-m^emes. Cette approche est appliquee independemment<br />

aux di erentes etapes d'exploitation de l'in<strong>for</strong>mation de mouvement<br />

au sein d'un schema de codage <strong>video</strong>: <strong>estimation</strong>, transmission et<br />

<strong>compensation</strong>.<br />

Le texte debute par un survol de l'etat de l'art en compression <strong>video</strong>. Il<br />

introduit les concepts de base et met l'accent sur les conditions relatives<br />

au codage atres bas debit. Le principe de l'<strong>estimation</strong> de mouvement<br />

est ensuite expose endetail, en incluant un compte rendu des techniques<br />

les plus utilisees. Les contributions du travail sont alors presentees.<br />

Tout d'abord, une technique d'<strong>estimation</strong> du mouvement basee sur des<br />

outils multi-echelle est proposee: elle permet d'obtenir des champs de<br />

mouvement coherents, segmentes et adaptes au contenu spatial des images,<br />

ce qui permet leur codage e cace. Un modele de distribution de<br />

la charge de calcul de l'algorithme est etabli et resulte en une acceleration<br />

lineaire. En second lieu, la transmission de sequences d'images est<br />

analysee a la lumiere de la theorie de Debit-Distortion. Au vu de la contrainte<br />

de tres bas debit, un pre-traitement spatial des images est e ectue<br />

prealablement a l'<strong>estimation</strong> du mouvement. Le but est d'augmenter la<br />

correlation entre les images deja codees et les autres. Des pistes sont<br />

egalement tracees pour un traitement selectif sur base du contenu. Finalement,<br />

la reconstruction (<strong>compensation</strong>) de descriptions du mouvement<br />

par blocs est amelioree par des techniques de de<strong>for</strong>mation d'image.<br />

Un nouvel outil de detection de \coins" est mis au point a n de construire<br />

automatiquement un treillis actif sur l'image de reference gr^ace ala<br />

triangulation de Delaunay. Le krigeage inverse permet alors d'attribuer<br />

un vecteur de mouvement achaque noeud du treillis par interpolation<br />

du champ original. Les artefacts de type bloc sont ainsi supprimes. De<br />

plus, la technique des treillis o re des possibilites etendues de manipulation<br />

des images. La presente these analyse donc trois pistes pour<br />

ameliorer la qualite de systemes <strong>video</strong> actuels.


Contents<br />

Introduction 1<br />

Preamble :::::::::::::::::::::::::::::: 1<br />

Introduction <strong>and</strong> Thesis Outline ::::::::::::::::: 6<br />

Contributions of the Thesis :::::::::::::::::::: 10<br />

1 Digital Video Coding at Very-Low BitRate 13<br />

1.1 Digital Video ::::::::::::::::::::::::: 14<br />

1.2 Video Coding ::::::::::::::::::::::::: 15<br />

1.2.1 Image Compression :::::::::::::::::: 15<br />

1.2.2 Video Compression :::::::::::::::::: 18<br />

1.3 Coding at Very-Low BitRate :::::::::::::::: 20<br />

1.4 Some (Very-) Low BitRate Codecs ::::::::::::: 21<br />

1.4.1 H.263 ::::::::::::::::::::::::: 21<br />

1.4.1.1 Intra Macroblocks ::::::::::::: 22<br />

1.4.1.2 Inter Macroblocks <strong>and</strong> Residues ::::: 23<br />

1.4.1.3 Options ::::::::::::::::::: 23<br />

1.4.1.4 Future Improvements: H.263+ :::::: 24<br />

1.4.2 \COMIS", the UCL Approach ::::::::::: 24<br />

1.4.2.1 Intra Images :::::::::::::::: 25<br />

1.4.2.2 Inter Images :::::::::::::::: 31<br />

1.4.3 The Emerging MPEG-4 St<strong>and</strong>ard ::::::::: 33<br />

1.4.3.1 Developments to Be Supported :::::: 34<br />

1.4.3.2 A New Challenge <strong>for</strong> the Representation<br />

of Audio-Visual In<strong>for</strong>mation ::::::: 34<br />

1.4.3.3 Video Coding in MPEG-4 ::::::::: 35<br />

1.5 Discussion ::::::::::::::::::::::::::: 37<br />

2 Motion in the Framework of Video Coding 39<br />

2.1 Image Formation <strong>and</strong> Motion :::::::::::::::: 41


x<br />

2.1.1 Apparent versus Real Motion :::::::::::: 43<br />

2.1.2 Unsolvable Problems of Motion Estimation :::: 44<br />

2.2 Rate Distortion Theory ::::::::::::::::::: 47<br />

2.3 Practical Approaches to Motion Estimation :::::::: 48<br />

2.3.1 Additional Constraints :::::::::::::::: 49<br />

2.3.1.1 Preservation Constraint :::::::::: 49<br />

2.3.1.2 Coherence Constraint ::::::::::: 51<br />

2.3.2 Estimation Methods ::::::::::::::::: 51<br />

2.3.2.1 Multigrid <strong>and</strong> Multiscale Optimization<br />

Methods :::::::::::::::::: 52<br />

2.3.2.2 Forward versus Backward Estimation : : 54<br />

2.4 Background Techniques <strong>for</strong> Motion Estimation :::::: 55<br />

2.4.1 Linear Regression ::::::::::::::::::: 55<br />

2.4.2 Iterative Motion Estimation ::::::::::::: 56<br />

2.4.3 Pel-Recursive Algorithms :::::::::::::: 59<br />

2.4.4 Stochastic Estimation Relying on Markov R<strong>and</strong>om<br />

Field ::::::::::::::::::::::: 60<br />

2.4.5 Parametric Models of the Motion Field :::::: 62<br />

2.4.6 Within a Trans<strong>for</strong>m Domain :::::::::::: 64<br />

2.5 The Block-Matching Algorithm (BMA) :::::::::: 65<br />

2.5.1 BMA Principle :::::::::::::::::::: 65<br />

2.5.1.1 Search Techniques ::::::::::::: 66<br />

2.5.1.2 Advanced Possibilities ::::::::::: 67<br />

2.5.1.3 Result :::::::::::::::::::: 67<br />

2.5.2 Overlapped BMA ::::::::::::::::::: 69<br />

2.6 Image Warping Techniques ::::::::::::::::: 69<br />

2.6.1 The Hexagonal Matching Algorithm (HMA) :::: 72<br />

2.6.2 Adaptive Hexagonal Matching Algorithm (AHMA) 74<br />

2.7 Conclusion :::::::::::::::::::::::::: 74<br />

3 Multiscale Block Matching Algorithm 77<br />

3.1 Adaptive Block Matching Algorithm :::::::::::: 78<br />

3.1.1 Global Motion Estimation :::::::::::::: 78<br />

3.1.2 Change Detection :::::::::::::::::: 79<br />

3.1.3 Local Motion Estimation :::::::::::::: 81<br />

3.1.4 Results :::::::::::::::::::::::: 82<br />

3.2 Distributed Version of the Local Motion Estimation ::: 85<br />

3.2.1 Pseudo-Code of the Sequential Loop :::::::: 85<br />

3.2.2 Model of Distribution :::::::::::::::: 86<br />

3.2.3 Practical Implementation :::::::::::::: 88


3.2.3.1 Data Structures :::::::::::::: 89<br />

3.2.3.2 State Transitions :::::::::::::: 89<br />

3.2.4 Experimental Results :::::::::::::::: 92<br />

3.3 Conclusion :::::::::::::::::::::::::: 92<br />

4 Image Pre-Processing <strong>for</strong> VLBR Video Coding 95<br />

4.1 Intuitive Rationale :::::::::::::::::::::: 96<br />

4.2 Rate Distortion Conditions :::::::::::::::::101<br />

4.2.1 Image Model :::::::::::::::::::::101<br />

4.2.2 Intra Coding of Images :::::::::::::::102<br />

4.2.3 Inter Coding without Motion Compensation ::::103<br />

4.2.3.1 Usual Scheme :::::::::::::::104<br />

4.2.3.2 Proposed Scheme :::::::::::::105<br />

4.2.4 Inter Coding with Motion Compensation :::::105<br />

4.2.5 Theoretical Conclusion ::::::::::::::::106<br />

4.3 Experimental Results :::::::::::::::::::::107<br />

4.4 Conclusion ::::::::::::::::::::::::::119<br />

5 Mesh-Based Motion Compensation 121<br />

5.1 Estimation, Transcription <strong>and</strong> Reconstruction :::::::123<br />

5.1.1 The Proposed Reconstruction ::::::::::::123<br />

5.1.2 Problems to be Addressed ::::::::::::::124<br />

5.1.2.1 Mesh Vertices :::::::::::::::125<br />

5.1.2.2 Motion Interpolation :::::::::::127<br />

5.1.2.3 Reversing the Motion In<strong>for</strong>mation ::::128<br />

5.2 Mesh Design :::::::::::::::::::::::::128<br />

5.2.1 Detecting Edges :::::::::::::::::::130<br />

5.2.1.1 Enhancing Image Boundaries :::::::130<br />

5.2.1.2 Fol<strong>low</strong>ing Half Boundaries ::::::::134<br />

5.2.2 Corner Extraction ::::::::::::::::::137<br />

5.3 Motion Transcription :::::::::::::::::::::141<br />

5.3.1 Reversing the Sense of the Motion In<strong>for</strong>mation : :141<br />

5.3.2 Interpolating the Motion Values ::::::::::141<br />

5.3.3 Mesh Connectivity ::::::::::::::::::146<br />

5.3.4 First results ::::::::::::::::::::::147<br />

5.4 Results :::::::::::::::::::::::::::::148<br />

5.5 Conclusion ::::::::::::::::::::::::::158<br />

Conclusion 161<br />

xi


xii<br />

A VLBR-like <strong>video</strong> sequences 167<br />

B Rate Distortion theory 171<br />

B.1 In<strong>for</strong>mation Theory :::::::::::::::::::::173<br />

B.2 Discrete Memoryless Sources <strong>and</strong> Single-Letter Distortion 175<br />

B.3 Rate Distortion Function ::::::::::::::::::177<br />

B.4 Extension to Moving Pictures ::::::::::::::::178<br />

B.4.1 Note on Predictive Coding :::::::::::::178<br />

B.4.2 Intra Images :::::::::::::::::::::181<br />

B.4.3 Inter Images without Motion Compensation ::::181<br />

B.4.4 Inter Images with Motion Compensation ::::::181<br />

C Markov R<strong>and</strong>om Fields 183<br />

C.1 Bayesian Model ::::::::::::::::::::::::183<br />

C.2 Markov R<strong>and</strong>om Fields :::::::::::::::::::184<br />

C.3 Gibbs measure <strong>and</strong> Markov elds ::::::::::::::186<br />

C.4 System solution ::::::::::::::::::::::::187<br />

D Complements to Chapter 5 189<br />

D.1 Triangulation :::::::::::::::::::::::::189<br />

D.1.1 De nitions ::::::::::::::::::::::189<br />

D.1.2 Delaunay Triangulation :::::::::::::::190<br />

D.2 Pseudo-code of the implementation :::::::::::::191<br />

D.2.1 Compensation scheme ::::::::::::::::191<br />

D.2.2 Corner detection :::::::::::::::::::193<br />

D.2.3 Inverse Kriging System :::::::::::::::201<br />

Bibliography 203


Introduction<br />

Preamble<br />

1998 2010<br />

Since the early ages of prehistory,<br />

mankind has s<strong>low</strong>ly but<br />

in an inextinguishable way<br />

tried to underst<strong>and</strong> their environment<br />

so as to adapt themselves<br />

to it <strong>and</strong>, whenever possible,<br />

in uence it. It probably<br />

started in Mesopotamia,<br />

the \cradle of civilization",<br />

where Sumerians (3500-3000<br />

BC) erected the rst cities<br />

<strong>and</strong> invented the wheel. Writing,<br />

agriculture <strong>and</strong> many<br />

other characteristics of our<br />

modern age were invented at<br />

that time. The well-known<br />

civilizations of Egypt <strong>and</strong> of<br />

the Indus valley pursued the<br />

initial movement of what we<br />

today mean with \progress".<br />

The movie was particularly<br />

long <strong>and</strong> when Joauma got<br />

out of the theater it was already<br />

dark. She had been<br />

deeply moved (or was it intrigued?)<br />

by the story <strong>and</strong><br />

she was not paying attention<br />

to the road. When Daddy<br />

made the car turn right <strong>and</strong><br />

stopped in front of the house,<br />

she suddenly had to get out<br />

though she did not want to.<br />

She s<strong>low</strong>ly entered home <strong>and</strong><br />

climbed the stairs. She put<br />

on her pyjamas but could not<br />

get to sleep. She opened the<br />

window, looked at the desert<br />

street under the faint light of<br />

the lamp <strong>and</strong> listened to the<br />

silence of the night.


2 Introduction<br />

Ancient Greece initiated the<br />

aspiration to underst<strong>and</strong> both<br />

natural mechanisms, which<br />

found an accomplishment in<br />

the rst space shuttles, <strong>and</strong><br />

human nature. The latter<br />

profoundly a ected Western<br />

philosophy through the discourse<br />

of eminent thinkers like<br />

Socrates (c. 470-c. 399 BC),<br />

whose contribution was essentially<br />

ethical in character,<br />

<strong>for</strong> instance when he asserted<br />

that \Bad men live that they<br />

may eat <strong>and</strong> drink, whereas<br />

good men eat <strong>and</strong> drink that<br />

they may live". His pupil<br />

Plato (c. 428-c. 347 BC)<br />

probably is the father of political<br />

thought, while Aristotle<br />

(384-322 BC) o ers a perfect<br />

synthesis of what a man can<br />

be as both a scientist <strong>and</strong> a<br />

philosopher.<br />

Societies have been organizing<br />

themselves, <strong>and</strong> progress<br />

was made in various technical<br />

elds, not only in the West. In<br />

271 AD, Chinese mathematicians<br />

are reputed to have used<br />

a simple compass. The invention<br />

was not put to use in Europe<br />

<strong>for</strong> navigation be<strong>for</strong>e the<br />

13th century. Printing from<br />

carved wood blocks was inven-<br />

One has to say that this adaptation<br />

of J. Gaarder's \Sophie's<br />

World" was particularly<br />

brilliant but <strong>for</strong> sure not<br />

easy to tackle <strong>for</strong> a twelveyears<br />

old girl. Of course she<br />

agreed with the <strong>for</strong>eword of<br />

Goethe that says that \Whoever<br />

can not draw the lessons<br />

from 3000 years only lives<br />

from day to day", but it is<br />

not easy to discover Democrite,<br />

Socrates, Athens history,<br />

Plato, Aristotle, the hellenism,<br />

various aspects of the<br />

Middle Ages, the Renaissance<br />

<strong>and</strong> the baroque, Descartes,<br />

Spinoza, Locke, Hume, Berkeley,<br />

Kant, Hegel, Kierkegaard,<br />

Marx, Darwin, Freud, Einstein,<br />

the Big Bang <strong>and</strong> all<br />

at once. And all these guys<br />

alive <strong>and</strong> kicking thanks to<br />

the magic of cinema!<br />

She glanced at the stars as she<br />

never did be<strong>for</strong>e <strong>and</strong> nally<br />

went to sleep.<br />

Heather Nova s<strong>low</strong>ly started<br />

whispering \Heal" but e<strong>very</strong><br />

minute the song was becoming<br />

louder... When it<br />

changed to \Inside my Head"<br />

from Waitstate, the song<br />

was just bathing the whole<br />

room. Joauma woke up. She<br />

looked at her watch: ten<br />

o'clock. This Panasonic integrated<br />

system (\Pansense, the


Introduction 3<br />

ted in China in the 6th century.<br />

The rst book known to<br />

have been printed from wood<br />

blocks was a Chinese edition<br />

of the Diamond Sutra, a Buddhist<br />

text, dating from 868,<br />

i.e. hundreds of years be<strong>for</strong>e<br />

the rst printed edition of the<br />

Bible by Gutenberg (c. 1400-<br />

1468).<br />

In 1543, after 25 years of<br />

work, the Polish astronomer<br />

Nicolaus Copernicus (1473-<br />

1543) <strong>for</strong>ced men <strong>and</strong> women<br />

to rethink their place within<br />

the solar system. It is only after<br />

150 years that the theory<br />

that constitutes the Copernican<br />

revolution, rein<strong>for</strong>ced by<br />

Galileo's (1564-1642) observations<br />

with his telescope, became<br />

widely accepted. Since<br />

then, the underst<strong>and</strong>ing of<br />

physical laws have been improved<br />

by many scientists, like<br />

Johannes Kepler (1571-1630),<br />

Sir Isaac Newton (1642-1727),<br />

Gottfried Leibniz<br />

(1646-1716), James Maxwell<br />

(1831-1879) <strong>and</strong> Albert Einstein<br />

(1879-1955). One should<br />

not <strong>for</strong>get Charles Darwin's<br />

(1809-1882) other revolution<br />

when he claimed that man<br />

was not the center of creation.<br />

ultimate solution <strong>for</strong> senses<br />

entertainment" stated the ad)<br />

driven with fuzzy logic had<br />

just perfectly memorized (understood?)<br />

the way she liked<br />

to be woken up on Fridays:<br />

a r<strong>and</strong>om compilation of her<br />

top ten \soft" singles with a<br />

crescendo volume.<br />

Friday, the rst day of the<br />

week-end. As the stars had<br />

announced it, the weather was<br />

just splendid. And birds were<br />

singing. The beginning of<br />

a great week-end! Unconsciously,<br />

Joauma changed the<br />

Pansense program <strong>and</strong> asked<br />

<strong>for</strong> downloading the week-end<br />

edition of \VideoBelche", an<br />

online <strong>video</strong> magazine that<br />

appears twice a week <strong>and</strong> tells<br />

her more about her country.<br />

After viewing the report, she<br />

quickly had a look at her<br />

Email. Be<strong>for</strong>e leaving the<br />

room, she required the system<br />

to download all possible in<strong>for</strong>mation<br />

about the movie. And<br />

also about the actors whom<br />

she introduced some pictures<br />

in the scanner box. She asked<br />

the system to reserve a <strong>video</strong>conferencing<br />

line <strong>for</strong> eleven<br />

o'clock <strong>and</strong> nally went down<br />

the stairs.


4 Introduction<br />

Meanwhile, the desire among<br />

men <strong>and</strong> women <strong>for</strong> freedom<br />

had led to other revolutions in<br />

France <strong>and</strong> the USA. Another<br />

major change was the Industrial<br />

Revolution, i.e. the shift,<br />

at di erent times in di erent<br />

countries, from a traditional<br />

agriculturally based economy<br />

to an economy based on<br />

the mechanized mass production<br />

of manufactured goods in<br />

large-scale rms.<br />

Nowadays some people already<br />

consider the end of our<br />

2 nd millennium to be that<br />

of the In<strong>for</strong>mation Revolution.<br />

If revolution there is,<br />

it all started with the patent<br />

of Alex<strong>and</strong>er Graham Bell<br />

(1847-1922), which opened<br />

the way to telephony <strong>and</strong> -<br />

nally telecommunication networks.<br />

In the 1950s, television<br />

appeared <strong>and</strong> quickly became<br />

part of any living room:<br />

image was then accompanying<br />

sound. Besides this invention,<br />

the \Analytical machine"<br />

of Charles Babbage<br />

(1792-1871) <strong>and</strong> the invention<br />

of the transistor in 1947 resulted<br />

in the development of<br />

computers. Those came in<br />

common use in government<br />

<strong>and</strong> industry during the 1960s<br />

while the 1980s brought small,<br />

powerful <strong>and</strong> inexpensive Personal<br />

Computers (PC) into<br />

In the kitchen, breakfast was<br />

waiting <strong>for</strong> her <strong>and</strong> she had a<br />

large slice of bread with some<br />

compote of apples (88% apples,<br />

sugar, citric acid, ascorbic<br />

acid, 12% added sugar)<br />

<strong>and</strong> had a glass of orange<br />

juice.<br />

She cleared the table <strong>and</strong><br />

went back to her room to<br />

get dressed. You usually<br />

want to look nice while <strong>video</strong>conferencing...<br />

10.55 A <strong>video</strong>-conferencing<br />

line has e ectively been reserved.<br />

She just dials Moma<br />

(her gr<strong>and</strong>-mother) number.<br />

11.00 The screen-saver animation<br />

(including some ads) of<br />

her ITU H.666 communication<br />

module starts playing on<br />

the screen. A few minutes<br />

later, the image of Moma appears<br />

on the screen. \The image<br />

is not as perfect as with<br />

Antoine. For sure she only has<br />

a 609 system!", she thinks.<br />

-\Hi Joauma! What a nice<br />

surprise!" says her gr<strong>and</strong>mother.<br />

-\Hi Moma!", she answered,<br />

\How are you doing?"<br />

-...<br />

She surreptitiously pushes the<br />

record button.


Introduction 5<br />

the home. Experts have predicted<br />

that by the year 2000,<br />

the worldwide revenues of the<br />

computer industry will be second<br />

to agricultural revenues.<br />

For a few years now, the<br />

convergence of the digitized<br />

world of computer with the<br />

worlds of audio-visual data<br />

<strong>and</strong> telecommunication have<br />

taken us to an age of multimedia<br />

<strong>and</strong> in<strong>for</strong>mation technology.<br />

Such a trend is especially<br />

visible on the WWW<br />

where pages (created with a<br />

computer) that one can download<br />

(across the Internet) contain<br />

more <strong>and</strong> more visual<br />

in<strong>for</strong>mation. This evolution<br />

is quickly changing the organization<br />

of our society as it<br />

abolishes both time <strong>and</strong> distances.<br />

Data, sounds <strong>and</strong> images<br />

are now traveling around<br />

the world at the speed of light<br />

thanks to the nascent in<strong>for</strong>mation<br />

highways. Nevertheless,<br />

technology evolution requires<br />

an evolution of mentalities.<br />

Moreover, the end-user<br />

of the proposed services is <strong>and</strong><br />

remains the human being. A<br />

rst step in this direction is<br />

the stress put by recent systems<br />

on the interactivity concept:<br />

the user is not passive<br />

anymore but really becomes<br />

the pilot of the application.<br />

She nally logs o , answers<br />

\positive" to the automatic<br />

query about the quality of<br />

transmission <strong>and</strong> switches to<br />

the <strong>video</strong> editing table.<br />

In analysis mode, she opens<br />

the recording of her conversation<br />

<strong>and</strong> selects the center of<br />

the screen. Once the tracking<br />

algorithm has per<strong>for</strong>med<br />

its task, she can save the segmented<br />

head <strong>and</strong> body of her<br />

gr<strong>and</strong>-mother into a separate<br />

le.<br />

She then intermingles images,<br />

past <strong>and</strong> present, reality <strong>and</strong><br />

ction, lms <strong>and</strong> cartoons.<br />

She has a gr<strong>and</strong>-mother jump<br />

on Terminator's motorbike,<br />

<strong>and</strong> join a sabbath of great<br />

historical gures. Should she<br />

have her dance with Attila or<br />

M<strong>and</strong>ela? Would she have her<br />

wear M. Monroe's dress?<br />

Joauma nally got it!<br />

This animation perfectly introduces,<br />

with her twelveyears<br />

old humor, the compilation<br />

she achieved with all<br />

the <strong>video</strong> archives of the family.<br />

Moma's birthday present<br />

is ready.


6 Introduction<br />

However, one can question<br />

the content of the transmitted<br />

in<strong>for</strong>mation <strong>and</strong> its real<br />

use. Some people already<br />

claim that the next revolution<br />

is going to be social but we<br />

will stop the preamble here. It<br />

is up to e<strong>very</strong> reader to bring<br />

answers, up to the masses to<br />

make other revolutions <strong>and</strong> up<br />

to time to help building new<br />

worlds after new worlds.<br />

Introduction <strong>and</strong> Thesis Outline<br />

She just hopes her gr<strong>and</strong>mother<br />

will like it!<br />

Among all the in<strong>for</strong>mation that circulates through telecommunications<br />

networks, (moving) images occupy anever-increasing place, with respect<br />

to both their contents <strong>and</strong> volume. Images e ectively require <strong>very</strong> important<br />

storage <strong>and</strong> transmission capacities. Digitization has opened<br />

the way to the automatic treatment of data by computers. As far as<br />

(moving) images are concerned, digitization al<strong>low</strong>s easier manipulation,<br />

modi cation, the possibility to extract some characteristics out of them<br />

<strong>and</strong> to encode them: an appropriate analysis of the image data al<strong>low</strong>s<br />

modifying the representation of images so as to more easily detect the<br />

redundancies (parts of the signal that are similar to other ones) <strong>and</strong> the<br />

irrelevancies (parts of the signal that are not perceived by the human<br />

eye). Coding algorithms aim thus at compressing the signal by reducing<br />

redundancies <strong>and</strong> suppressing irrelevancies. International <strong>video</strong> <strong>coding</strong><br />

st<strong>and</strong>ards that gather some \state of the arts" techniques (at the time of<br />

the st<strong>and</strong>ard) into a uniquely described scheme enable providing <strong>video</strong><br />

services using existing networks <strong>and</strong> storage facilities.<br />

Chapter One introduces such concepts as digitization, redundancy reduction<br />

<strong>and</strong> <strong>video</strong> <strong>coding</strong>. It also clari es the speci city of<strong>very</strong>-<strong>low</strong><br />

<strong>bitrate</strong> which iscentral to the present work. Very-<strong>low</strong> <strong>bitrate</strong> generally<br />

refers to a <strong>bitrate</strong> inferior to sixty-four kbit=s which al<strong>low</strong>s transmissions<br />

over audio channels (e.g. <strong>for</strong> audio-visual personal communication services).<br />

Moreover, this chapter brie y presents some overall <strong>video</strong> <strong>coding</strong><br />

schemes, the ITU H.263 st<strong>and</strong>ard <strong>and</strong> the COMIS scheme that is a contribution<br />

of the present research. It ends by introducing the future ISO


Introduction 7<br />

MPEG-4 st<strong>and</strong>ard which adds a new dimension to <strong>video</strong> <strong>coding</strong> as it<br />

o ers the possibility to separately encode the di erent objects of a scene<br />

<strong>and</strong> there<strong>for</strong>e interact with these objects at the decoder end.<br />

Time-varying image sequences can be compressed by independently <strong>coding</strong><br />

each frame (intra-frame image <strong>coding</strong>) or by extending spatial <strong>coding</strong><br />

techniques to the time dimension (e.g. 3D trans<strong>for</strong>m <strong>coding</strong>). However,<br />

the main characteristic of a <strong>video</strong> sequence is precisely its spatiotemporal<br />

component: most of the in<strong>for</strong>mation in an image sequence is<br />

the result of <strong>motion</strong>. A lot of e ort has there<strong>for</strong>e been put into <strong>motion</strong><br />

analysis of <strong>video</strong> sources. The range of applications of such an analysis<br />

includes, but is not limited to, automatic tracking of targets, piloting<br />

of robots, events detection <strong>for</strong> surveillance, tridimensional reconstruction<br />

of objects, image restoration... In a <strong>video</strong> <strong>coding</strong> context, <strong>motion</strong><br />

analysis is mainly used to reduce the inter-image redundancy: instead<br />

of <strong>coding</strong> e<strong>very</strong> new frame on its own basis, references are searched <strong>for</strong><br />

in the previously coded image. On a practical point of view, it means<br />

that one searches <strong>for</strong> parts of the new picture which are already present<br />

in the previous frame <strong>and</strong> which have just undergone some movement.<br />

Once the <strong>motion</strong> parameters have been estimated <strong>and</strong> transmitted to<br />

the decoder, their application provides a <strong>very</strong> good prediction of the<br />

new image. This technique, referred to as \<strong>motion</strong> <strong>estimation</strong> <strong>and</strong> <strong>compensation</strong>",<br />

achieves one of the most important compression factor in<br />

a <strong>video</strong> coder thanks to its radical reduction of the spatio-temporal redundancy.<br />

Since underst<strong>and</strong>ing image <strong>for</strong>mation is a prerequisite <strong>for</strong> fully grasping<br />

the methods to recover <strong>motion</strong> in<strong>for</strong>mation from images, Chapter<br />

Two starts with a short description on how images are generated <strong>and</strong><br />

how the real tridimensional <strong>motion</strong> of objects results in <strong>motion</strong> on the<br />

bidimensional picture plane. Chapter Two simultaneously presents the<br />

phenomena that can disturb or prevent a correct <strong>motion</strong> <strong>estimation</strong>,<br />

<strong>and</strong> stresses the ill-posed nature of the problem. The chapter provides<br />

a Rate-Distortion justi cation of the use of <strong>motion</strong> <strong>estimation</strong> in <strong>video</strong><br />

coders <strong>and</strong> introduces the various models <strong>and</strong> methodologies that can<br />

constitute the basis of di erent <strong>motion</strong> algorithms. Classical <strong>and</strong> emerging<br />

<strong>motion</strong> <strong>estimation</strong> techniques developed <strong>for</strong> image <strong>coding</strong> purposes<br />

are detailed <strong>and</strong> the two of them (namely the Block-Matching Algorithm,<br />

BMA, <strong>and</strong> Image Warping technique) which are used in the present work<br />

conclude the chapter.<br />

More largely, the present thesis mainly deals with <strong>motion</strong> <strong>estimation</strong> <strong>and</strong>


8 Introduction<br />

<strong>compensation</strong> in a <strong>video</strong> <strong>coding</strong> context. While the BMA is the most<br />

widely used technique <strong>for</strong> <strong>motion</strong> <strong>estimation</strong> because it has emerged as<br />

the one achieving the best compromise between complexity <strong>and</strong> quality,<br />

it does not at all take the content depicted on the images into account.<br />

The aim of our work is thus to explore new possibilities <strong>for</strong> helping the<br />

BMA to bene t as much as possible from the contents-adapted in<strong>for</strong>mation<br />

that can be extracted from the images. This concept is consecutively,<br />

but independently, applied to the various steps of exploiting<br />

<strong>motion</strong> in a <strong>video</strong> <strong>coding</strong> scheme: <strong>estimation</strong>, transmission <strong>and</strong> <strong>compensation</strong>.<br />

Chapter Three rst tackles the <strong>motion</strong> <strong>estimation</strong> problem. It carries<br />

on with the research achieved by M.P. Queluz, i.e. a new algorithm<br />

to estimate the <strong>motion</strong> between successive frames. The Adaptive BMA<br />

(ABMA) implements a measure of the (in)certitude of the <strong>motion</strong> analysis<br />

achieved by the BMA so as to operate an adaptation of the block<br />

size along object edges <strong>and</strong> avoid having di erent moving objects in the<br />

same block. On the other h<strong>and</strong>, a merge procedure, based on <strong>motion</strong><br />

con dence measures, is applied to correctly propagate the <strong>motion</strong> vectors<br />

from blocks with reliable <strong>motion</strong> to blocks with uncertain <strong>motion</strong>. The<br />

ABMA is a multiscale BMA algorithm which uses a quad-tree structure<br />

to represent the various steps of its split-<strong>and</strong>-merge procedure. The<br />

ABMA thereafter per<strong>for</strong>ms an adaptation of the <strong>motion</strong> eld <strong>estimation</strong><br />

to the spatial contents of the image.<br />

Although our contribution to the ABMA mainly consists in implementing<br />

<strong>and</strong> ne tuning the scheme, we also had the opportunity to supervise<br />

a Master Thesis whose aim was to distribute the computational burden<br />

of the ABMA among several processors because the calculation of the<br />

various con dence measures <strong>and</strong> of several BMA <strong>for</strong> the same block<br />

seriously puts a strain on the BMA per<strong>for</strong>mances. A distributed model of<br />

the ABMA has thus been theoretically established. A practical \masterslave"<br />

version has been derived from this model <strong>and</strong> demonstrates a<br />

linear speed-up.<br />

Chapter Four indirectly takes on the transmission of <strong>motion</strong> parameters.<br />

Since <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> channels en<strong>for</strong>ce <strong>video</strong> coders to debase more<br />

in<strong>for</strong>mation than the only irrelevant part of the signal, the image used<br />

as a reference <strong>for</strong> <strong>motion</strong> <strong>estimation</strong> (i.e. the previously coded one) does<br />

not contain the in<strong>for</strong>mation one would like to nd in it anymore, <strong>and</strong><br />

especially the in<strong>for</strong>mation needed to correctly predict the new image.<br />

Motion <strong>estimation</strong> is then partly noise-driven <strong>and</strong> the result is a sparse


Introduction 9<br />

<strong>motion</strong> eld that is more di cult to encode e ciently. This analysis is<br />

detailed at the beginning of Chapter Four <strong>and</strong> leads to the hypothesis<br />

that voluntarily simplifying the contents of the images to code could<br />

be pertinent. The <strong>motion</strong> <strong>estimation</strong> would then be per<strong>for</strong>med between<br />

two images with the same characteristics in terms of the accuracy of the<br />

contents description, <strong>and</strong> it is expected that the resulting <strong>motion</strong> eld<br />

will be easier to encode.<br />

This idea of pre-processing is rst <strong>for</strong>malized with the help of the Rate-<br />

Distortion theory: both intra-<strong>coding</strong> <strong>and</strong> pre-processing are modeled as<br />

the combination of a <strong>low</strong>-pass lter <strong>and</strong> additive white noise. Conditions<br />

<strong>for</strong> improving the coder per<strong>for</strong>mances are derived from this model. The<br />

in uence of various types of pre-processing on <strong>coding</strong> image sequences<br />

with the ITU H.263 st<strong>and</strong>ard is then experimentally tested. These preprocessings<br />

are: intra-<strong>coding</strong> <strong>and</strong> Gaussian, median or morphological<br />

ltering. Prospects are also <strong>for</strong>mulated <strong>for</strong> selective pre-processing according<br />

to the contents relevance.<br />

The aim of Chapter Five, which focuses on <strong>compensation</strong>, is to demonstrate<br />

that it is possible to subjectively improve the result of the <strong>motion</strong><br />

<strong>compensation</strong> stage (at the decoder) without modifying the <strong>estimation</strong><br />

(at the encoder) nor the transmission (the bitstream) 1 . This is possible<br />

by taking the spatial contents of the reference image into account in<br />

order to adapt the <strong>motion</strong> in<strong>for</strong>mation to it.<br />

The warping image techniques that have been described at the end of<br />

Chapter Two o er such a possibility, namely to easily adapt the <strong>motion</strong><br />

in<strong>for</strong>mation on irregular grid. However, their <strong>estimation</strong> phase is <strong>very</strong><br />

e ort-dem<strong>and</strong>ing because of its iterative nature. On the contrary, the<br />

BMA reveals itself a <strong>very</strong> e cient <strong>estimation</strong> method while its <strong>compensation</strong><br />

stage generates an image that su ers from so-called \blocking<br />

artifacts". We do thereafter suggest to use an asymmetric scheme that<br />

consists in a BMA <strong>estimation</strong> <strong>and</strong> a warping <strong>compensation</strong>. The warping<br />

technique involves meshes made out of triangular patches (that ensures<br />

a direct link with the a ne trans<strong>for</strong>m). Moreover, warping techniques<br />

(should they use triangular, quadrilateral or other structures) are <strong>very</strong><br />

well-suited <strong>for</strong> further editing <strong>and</strong> modi cation of images, manipulations<br />

that an increasing number of users like toachieve.<br />

1 Of course, if one wants to use the proposed reconstruction scheme in the <strong>coding</strong><br />

loop, it has to modify the coder so as to include the developed tool, which in turn<br />

will modify the contents (not the structure) of the bitstream.


10 Introduction<br />

Once the proposed scheme is outlined, Chapter Five presents solutions<br />

to the two main steps of the scheme: the automatic design of a mesh<br />

adapted to the spatial contents of the reference image, <strong>and</strong> the adaptation<br />

of the BMA vector eld so as to be applicable to the mesh. The<br />

rst problem is addressed by detecting object contours in the image<br />

<strong>and</strong> tracking these contours so as to extract \corners" (i.e. maximum<br />

curvature points). The selected corners then serve asvertices of a triangular<br />

mesh generated by Delaunay triangulation. The second problem is<br />

solved byaninterpolation technique known as \inverse kriging". Finally,<br />

the chapter concludes by presenting some subjective results.<br />

Finally, a general conclusion is drawn from the obtained results. It<br />

reviews the various contributions of the present work but also raises the<br />

question as to which parts of the <strong>motion</strong> exploitation chain (<strong>estimation</strong><br />

- transmission - <strong>compensation</strong>) is it useful to take the spatial contents<br />

of the images into account.<br />

Contributions of the Thesis<br />

The key contributions of the present research are summarized on the gure<br />

be<strong>low</strong>. Some of these contributions are personal, others are the result<br />

of a close collaboration with other members of the TELE laboratory 2 .<br />

Some are also the subject of previous publications [111, 77, 11, 73, 68,<br />

70, 23, 69, 19, 20, 71, 75, 74, 76].<br />

Framework<br />

Thesis<br />

contributions<br />

CODER<br />

Estimation:<br />

<strong>motion</strong> analysis<br />

Chap. 3<br />

Adaptive Block-<br />

Matching<br />

Algorithm<br />

Channel<br />

Transmission<br />

Chap. 4<br />

Study of the<br />

impact of preprocessing<br />

DECODER<br />

Motion<br />

<strong>compensation</strong><br />

Chap. 5<br />

Mesh-based<br />

reconstruction<br />

of BMA fields<br />

Improve BMA with content-adapted techniques<br />

2 Most of the work related to the COMIS scheme has been achieved in collaboration<br />

with many other researchers in image processing at the laboratory: M.P. Queluz, B.<br />

Simon, O. Bruyndonckx, V. Warscotte, T. Delmot, C. Devleeschouwer.


Introduction 11<br />

With the exception of the COMIS contribution presented in Chapter<br />

One, all the contributions obey the same approach: to use some contentadapted<br />

in<strong>for</strong>mation in order to improve the per<strong>for</strong>mances of the reference<br />

Block-Matching Algorithm. This approach involves the fol<strong>low</strong>ing<br />

achievements: Chapter Three requires the programming <strong>and</strong> ne tuning<br />

of the existing Adaptive Block Matching Algorithm (ABMA) in order<br />

to improve <strong>motion</strong> <strong>estimation</strong>. It also details how to distribute the<br />

computational burden of the ABMA among several processors. Chapter<br />

Four establishes a model of the behavior of <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> <strong>video</strong> <strong>coding</strong>.<br />

It proposes an analytical study of the impact of pre-processing on such<br />

a VLBR <strong>coding</strong>. Some experimentation aims at validating the developed<br />

model <strong>and</strong> at practically determining whether pre-processing o ers<br />

some gain when inserted in a complete <strong>coding</strong> scheme. Chapter Five<br />

deals with the conception <strong>and</strong> the development of an original scheme<br />

<strong>for</strong> <strong>motion</strong> <strong>compensation</strong>. It rst involves developing a tool <strong>for</strong> corner<br />

extraction, which isachieved by improving the half-boundaries detector<br />

of Noble. The application of interpolation to <strong>motion</strong> vector elds is<br />

also tackled so as to transpose the in<strong>for</strong>mation they provide from a<br />

regular grid to a non-regular one. Finally, Chapter Five integrates <strong>and</strong><br />

simulates the proposed asymmetric <strong>motion</strong> <strong>estimation</strong> & <strong>compensation</strong><br />

process, relying on mesh-based reconstruction.


Chapter 1<br />

Digital Video Coding at<br />

Very-Low BitRate<br />

\Visual communication is commonly regarded as the next generation<br />

communication tool beyond the conventional voice communication (...<br />

<strong>and</strong>) to achieve more e cient visual communication <strong>and</strong> fully utilize<br />

limited channel b<strong>and</strong>width <strong>and</strong> storage space, <strong>video</strong> compression (or <strong>coding</strong>)<br />

that reduces data amount is necessary." (C.T. Chen)<br />

Aware of this relevant challenge, research has been focusing on <strong>video</strong><br />

compression <strong>for</strong> already more than twenty years. People from both the<br />

industry <strong>and</strong> research teams have collaborated in groups like the International<br />

Telecommunications Union (ITU) or the Moving Picture Experts<br />

Group (MPEG) to design several international st<strong>and</strong>ards <strong>for</strong> transmission<br />

<strong>and</strong> storage of digital <strong>video</strong>.<br />

The present chapter aims at introducing the general framework of the<br />

doctoral work. It rst <strong>very</strong> brie y describes the structure of a digital<br />

<strong>video</strong> signal, <strong>and</strong> exp<strong>and</strong>s on the generic principles of <strong>video</strong> compression<br />

according to type of correlation which is exploited: spatial or spatiotemporal.<br />

It then highlights the speci city ofVery-Low Bitrate Coding<br />

(VLBR) with regards to high-<strong>bitrate</strong> <strong>coding</strong>: VLBR is unable to transmit<br />

a visually perfect copy of the source images as the limited channel<br />

capacity prevents the coder from sending all the in<strong>for</strong>mation needed.<br />

Secondly, it reviews three <strong>video</strong> coders-decoders (codecs): the H.263<br />

st<strong>and</strong>ard from the ITU, the scheme to be st<strong>and</strong>ardized in November<br />

1998 under the acronym MPEG-4, <strong>and</strong> some trials implemented at UCL<br />

within a scheme named COMIS.


14 Chapter 1. Digital Video Coding at Very-Low BitRate<br />

Finally, the state of the art of <strong>video</strong> <strong>coding</strong> is discussed <strong>and</strong> some new<br />

investigations in image analysis are introduced.<br />

1.1 Digital Video<br />

Digital images or frames consist of luminance (i.e. the brightness) <strong>and</strong><br />

chrominance (i.e. the colors) intensities of regularly sampled points<br />

(the picture elements 1 ). The sampling process is per<strong>for</strong>med either on<br />

a natural scene (digital camera) or on an \analog" image (digital scanning).<br />

The spatial in<strong>for</strong>mation (in the two-dimensional space) is characterized<br />

by the image resolution. Instead of characterizing this resolution<br />

in pel=cm or pel=inch, it is generally expressed as the product<br />

pels=line lines, which is a measure independent from the screen size. A<br />

(digital) <strong>video</strong> sequence is a succession of (digital) images whose characteristic<br />

is the temporal resolution in terms of frames=s or images=s.<br />

This temporal domain in<strong>for</strong>mation (the changes of image intensity along<br />

the time axis) is speci c to <strong>video</strong> transmission <strong>and</strong> raises the problems<br />

addressed in the present thesis, namely <strong>motion</strong> <strong>estimation</strong> <strong>and</strong> <strong>compensation</strong>.<br />

In its Recommendation 601 [12], the ITU-R (International<br />

Telecommunication Union - Radiocommunication, <strong>for</strong>merly CCIR) has<br />

de ned a way to digitize images in st<strong>and</strong>ard <strong>for</strong>mat (720 576). An extension<br />

of this de nition provides one with the di erent resolutions which<br />

should be used according to the target application (table 1.1), which directly<br />

introduces the required <strong>bitrate</strong> if no compression is achieved.<br />

Application Luminance Chrom. Aspect Temporal Bitrate<br />

resolution resolution ratio (fr:=s) (Mbit=s)<br />

HDTV 1920 1152 960 576 16=9 50 1800<br />

TV (broad.) 720 576 360 576 4=3 25 166<br />

TV (CD rec.) 360 288 180 144 4=3 25 31<br />

Video phone 360 288 180 144 4=3 10 12:4<br />

Mobile <strong>video</strong> 180 144 90 72 4=3 5 1:6<br />

Table 1.1: CCIR 601 <strong>for</strong>mats <strong>for</strong> moving pictures applications<br />

A few remarks can be made concerning table 1.1. At rst, the spatial resolution<br />

of the chrominance components is <strong>low</strong>er than the luminance one.<br />

1 Commonly abbreviated as \pixels" or\pels".


1.2 Video Coding 15<br />

These di erences result from the less important impact of colors on the<br />

Human Visual System (HVS). Secondly, achange of aspect ratio 2 (16=9<br />

like <strong>for</strong> movies) accompanies the High De nition TV (HDTV) resolution.<br />

Another comment deals with the range of values of the luminance<br />

<strong>and</strong> chrominance components: digital images refer to components described<br />

by a nite number of bits, namely 8 bits <strong>for</strong> e<strong>very</strong> luminance or<br />

chrominance pel. This 8-bit range al<strong>low</strong>s values to go from 0 (black) to<br />

255 (white) 3 .<br />

All these characteristics help computing the required b<strong>and</strong>widths: table<br />

1.1 introduces the necessary <strong>bitrate</strong>s in Mbit=s. If one expects to<br />

transmit <strong>video</strong> telephony across an ISDN network (64kbit=s), one must<br />

rst nd a way to compress the signal by a factor superior to 194, while<br />

<strong>for</strong> mobile <strong>video</strong> transmission a compression ratio of only 25 is needed<br />

<strong>for</strong> ISDN networks. This factor rises again to 160 <strong>for</strong> mobile channels<br />

at 10kbit=s.<br />

Appendix A introduces some typical <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> sequences used to<br />

test the <strong>video</strong> compression algorithms.<br />

1.2 Video Coding<br />

1.2.1 Image Compression<br />

Compression is thus needed <strong>and</strong> is possible thanks to the reduction of the<br />

high redundancies <strong>and</strong> to the irrelevancies present in the data of a <strong>video</strong><br />

sequence. The redundancy has to deal with the correlation <strong>and</strong> the<br />

predictability inherent to the pictures. Its use <strong>for</strong> compression purposes<br />

does not involve any loss of in<strong>for</strong>mation. The irrelevancy exploits the<br />

perceptual limits of the HVS so as to avoid the transmission of invisible<br />

in<strong>for</strong>mation. It introduces irreversible degradations.<br />

On a practical point of view, compression involves a succession of many<br />

di erent steps which aim at detecting what is redundant or irrelevant<br />

<strong>and</strong> how to encode it di erently. The fol<strong>low</strong>ing description proposes a<br />

2 The aspect ratio is the ratio between the image width <strong>and</strong> height.<br />

3 Generally, a color image may be described by a combination of three chromatic<br />

stimuli. Color television <strong>for</strong> instance uses the Red, Green <strong>and</strong> Blue (RGB) primary<br />

colors. Another possibility, usually used <strong>for</strong> <strong>video</strong> compression, is the Y luminance,<br />

plus two di erence chrominance signals: Cb <strong>and</strong> Cr,at<strong>low</strong>er spatial resolution like<br />

in table 1.1. People interested in more advanced readings concerning color treatment<br />

may consult [106].


16 Chapter 1. Digital Video Coding at Very-Low BitRate<br />

closer look at some of the underlying concepts of <strong>video</strong> compression.<br />

The scheme of gure 1.1 introduces the classical chain of tools used to<br />

compress moving pictures.<br />

Input image(s)<br />

T Q<br />

G<br />

T<br />

Q<br />

G<br />

enC<br />

enC<br />

chC<br />

Source Coding Channel Coding<br />

Tran<strong>for</strong>m, analysis,...<br />

Quantization<br />

Transcription<br />

Entropy Coding<br />

Figure 1.1: Operators classically involved in a compression process of<br />

(moving) images<br />

Hereafter is a brief description of the aim of e<strong>very</strong> step. The way the<br />

process is reversed at the decompression stage is also tackled. More<br />

detailed overview of compression techniques may be found in [86, 59,<br />

113].<br />

Compression<br />

{ First, the image characteristics are analyzed: the analysis<br />

may consist in frequential wave<strong>for</strong>ms analysis, likewavelet [66],<br />

matching pursuits [90], or more simply block trans<strong>for</strong>m (e.g.<br />

the Discrete Cosine Trans<strong>for</strong>m, DCT [2], like in JPEG [135]),<br />

or in spatial contours <strong>and</strong> texture analysis. It may also consist<br />

in an <strong>estimation</strong> of the <strong>motion</strong> eld between two successive<br />

frames. This analysis step aims at detecting the redundant<br />

parts of the images <strong>and</strong> proposing another way of transmitting<br />

the same in<strong>for</strong>mation.<br />

{ The resulting parameters are then usually quantized in order<br />

to suppress all irrelevancies of the signal. Quantization can<br />

be scalar or vector [41, 89].


1.2 Video Coding 17<br />

{ The redundancy <strong>and</strong> the irrelevancy reduction of the input<br />

signal by the trans<strong>for</strong>m <strong>and</strong> the quantization can be more<br />

globally viewed as a transcription (or a projection) of the<br />

input signal into a decorrelated relevant representation (or<br />

space).<br />

{ This transcription is encoded (entropy <strong>coding</strong>) <strong>and</strong> sent<br />

to the channel interface: this <strong>coding</strong> step is referred to as<br />

source <strong>coding</strong> [38], <strong>and</strong> aims at reducing the <strong>bitrate</strong> without<br />

corrupting the data. Such a reduction is possible thanks to<br />

an appropriate exploitation of the statistical properties of the<br />

signal.<br />

{ Finally, some channel <strong>coding</strong> may beintroduced. It adds<br />

speci c redundancies (error correcting codes <strong>and</strong> synchronization<br />

words) in order to protect the signal in case of erroneous<br />

transmission.<br />

Decompression<br />

{ After a de<strong>coding</strong> step (both channel <strong>and</strong> entropy de<strong>coding</strong><br />

if necessary), the decoder recuperates the transcription generated<br />

by the coder.<br />

{ Then two possibilities exist:<br />

Either, the transcription is merely reversed in order to<br />

obtain a description of the images. The obtained images<br />

are more or less identical to the original ones according<br />

to the type of analysis <strong>and</strong> quantization (the channel<br />

<strong>and</strong> entropy <strong>coding</strong>-de<strong>coding</strong> phases are assumed to be<br />

lossless).<br />

Or a reconstruction stage may be added: the transcription<br />

is no more simply reversed but speci cally treated in<br />

order to better take advantage of the available in<strong>for</strong>mation.<br />

An example of such a reconstruction process is the<br />

use of overlapping functions to recover a block description.<br />

Reconstruction is di erent from post-processing,<br />

which tries to improve the image quality after inversion of<br />

the transcription, without taking the transcription structure<br />

into account.<br />

All these parts of a <strong>coding</strong> algorithm mayvary from one implementation<br />

to another. Sometimes two or more steps are merged together, while in


18 Chapter 1. Digital Video Coding at Very-Low BitRate<br />

other cases one step is not used. However, the di erent steps have to<br />

be coherent if one wants the scheme to be e cient: the type of analysis<br />

made often guides the rest of the algorithm.<br />

Up to now, two di erent philosophies have been used in (VLBR) <strong>video</strong><br />

<strong>coding</strong>. The rst class is the block-based one [62, 96], where pictures<br />

are divided into subblocks according to an a priori grid: e<strong>very</strong> subblock<br />

may be independently coded according to its own characteristics. The<br />

other class is the segmentation-based or object-based one. Such<br />

schemes [85, 84, 34] aim to better preserve essential characteristics: high<br />

compression ratios are obtained by removing insigni cant objects <strong>and</strong><br />

by en<strong>coding</strong> textures more coarsely, while contours are considered as<br />

essential features <strong>for</strong> image description. After a detection of the di erent<br />

pictured regions (objects), e<strong>very</strong> region is separately coded: its shape<br />

(contour) is rst described <strong>and</strong> is fol<strong>low</strong>ed by a texture (or <strong>motion</strong>)<br />

in<strong>for</strong>mation.<br />

1.2.2 Video Compression<br />

Original<br />

images<br />

Intra-images<br />

encoder<br />

Motion<br />

<strong>compensation</strong><br />

Motion<br />

<strong>estimation</strong><br />

Frame<br />

memory<br />

Intra-images<br />

decoder<br />

Motion vectors<br />

Multiplexer<br />

to channel<br />

coder<br />

Figure 1.2: Typical <strong>video</strong> coder: intra-images, <strong>motion</strong> <strong>estimation</strong> &<br />

<strong>compensation</strong> <strong>and</strong> residues are used<br />

Independently of these di erent approaches of the image contents, one<br />

can point out that both categories of (VLBR) <strong>coding</strong> schemes make an


1.2 Video Coding 19<br />

exhaustive use of the two fol<strong>low</strong>ing types of pictures:<br />

Intra-images: When <strong>coding</strong> an image in intra mode, only the<br />

spatial correlation inherent to the picture itself is exploited. In<br />

fact, an intra-image is an isolated picture <strong>and</strong> is compressed as<br />

such (cf. the JPEG [135] algorithm <strong>for</strong> still pictures compression).<br />

Inter-images: These images are speci c to <strong>video</strong> sequences in<br />

comparison with still pictures. They al<strong>low</strong> the coder to exploit<br />

the spatio-temporal correlation between the present image <strong>and</strong><br />

the previous one(s) within a sequence. The classical tool used to<br />

achieve compression on such images is <strong>motion</strong> <strong>estimation</strong> & <strong>compensation</strong><br />

which al<strong>low</strong>s one to obtain a prediction of the present<br />

image from the previous one(s).<br />

{ Residues: As the <strong>estimation</strong> & <strong>compensation</strong> of an interimage<br />

is not perfect, the di erence between the estimate <strong>and</strong><br />

the original image is computed: the so-called Displaced Frame<br />

Di erence (DFD) has to be sent if it is relevant enough. It can<br />

only be transmitted on its own basis, i.e. as an intra-image.<br />

The rst image of a sequence must of course be encoded as an intraimage.<br />

For the subsequent images, a scene-cut detector compares the<br />

new image with the previous one(s) to check if some temporal correlation<br />

still exists. In the a rmative, the new image will be inter-coded.<br />

Otherwise, the coder considers that a scene-cut has occurred: the new<br />

image di ers so much from the previous one(s) that it is better to encode<br />

it as an intra-image. The user may also en<strong>for</strong>ce intra-image <strong>coding</strong> to<br />

take place e<strong>very</strong> x seconds to o er speci c functions like fast retrieval,<br />

temporal scalability or resynchronization in case of noisy channel.<br />

According to these considerations, one can a rm that the scheme presented<br />

on gure 1.2 (from [13]) is typical of most (VLBR) <strong>video</strong> coders.<br />

The rst image is intra-coded. Then the second image is <strong>motion</strong> estimated<br />

on the basis of the rst one (original or reconstructed). The<br />

<strong>motion</strong> vectors of course need to be sent to the decoder in order to al<strong>low</strong><br />

it to behave the same way. The <strong>motion</strong> vector eld is applied to the<br />

rst image thereby obtaining a prediction of image two, which is used<br />

<strong>for</strong> the DFD computation. This DFD is the residual image that also<br />

needs to be coded. The process goes on until some scene cut arises or<br />

until a de ned threshold en<strong>for</strong>ces to start again with an intra-image.


20 Chapter 1. Digital Video Coding at Very-Low BitRate<br />

Although the <strong>coding</strong> technique of the DFD is often similar to intra<strong>coding</strong>,<br />

Strobach has demonstrated [120] the ine ciency of applying<br />

the same technique to intra <strong>and</strong> residual <strong>coding</strong> because of the drastic<br />

reduction of the spatial correlation after <strong>motion</strong> <strong>compensation</strong>. This<br />

inadequacy can also be explained by the fact that the in<strong>for</strong>mation of<br />

an intra image is uni<strong>for</strong>mly spread out, while the contents of residual<br />

images is located in speci c areas of the images (e.g. the borders of the<br />

moving objects). For instance, Matching Pursuits [90] are designed <strong>for</strong><br />

residual <strong>coding</strong>.<br />

1.3 Coding at Very-Low BitRate<br />

Very-Low BitRate transmission can be understood as transmissions across<br />

b<strong>and</strong>widths of 5 to 64kbit=s, or sometimes up to 128kbit=s. Some of<br />

the applications which might be addressed are: wired <strong>video</strong> phones<br />

(28:8kbit=s modems <strong>and</strong> be<strong>low</strong>), wireless <strong>video</strong> phones (under 13kbit=s),<br />

Internet <strong>video</strong> conferencing (28:8kbit=s modems <strong>and</strong> be<strong>low</strong>), remote<br />

monitoring, tele-operation, tele-working,::: Appendix A presents the<br />

typical VLBR-like <strong>video</strong> sequences that are used as test sequences in<br />

the present document.<br />

The rst <strong>video</strong> <strong>coding</strong> st<strong>and</strong>ards considered higher <strong>bitrate</strong>s. For instance<br />

MPEG-1 [62] isintended to ensure domestic use quality (CD-ROM use)<br />

under 1:5Mbit=s <strong>and</strong> MPEG-2 [62] proposes digital TV broadcasting<br />

be<strong>low</strong> 10Mbit=s or HDTV at 60Mbit=s. So, what constitutes the main<br />

di erence between codecs working at such <strong>bitrate</strong>s <strong>and</strong> VLBR ones?<br />

The rst di erence was already introduced in table 1.1: in order to<br />

quickly reach <strong>low</strong> <strong>bitrate</strong>s, <strong>low</strong>er spatial <strong>and</strong> temporal resolutions are<br />

used. But another major di erence can be pointed out: to transmit<br />

(moving) pictures across high-<strong>bitrate</strong> channels, only the irrelevant part<br />

of the image(s) has to be suppressed, while VLBR compression schemes<br />

generally have to suppress more in<strong>for</strong>mation. There<strong>for</strong>e, VLBR <strong>video</strong><br />

<strong>coding</strong> introduces more artifacts. A key-issue <strong>for</strong> VLBR <strong>coding</strong> is thus<br />

an e cient management of the visual degradations that have to be accepted.<br />

The in uence of the <strong>bitrate</strong> mainly arises at the residual level:<br />

<strong>very</strong>-<strong>low</strong> <strong>bitrate</strong>s prevent the coder from sending a su cient amount<br />

of residual in<strong>for</strong>mation. The imperfections of the <strong>motion</strong> <strong>estimation</strong> &<br />

<strong>compensation</strong> phase are not entirely corrected <strong>and</strong> many artifacts remain<br />

visible.


1.4 Some (Very-) Low BitRate Codecs 21<br />

1.4 Some (Very-) Low BitRate Codecs<br />

It appears that new solutions have to be set up if one wants to achieve<br />

VLBR <strong>video</strong> <strong>coding</strong>: just using the same techniques as <strong>for</strong> high <strong>bitrate</strong><br />

<strong>coding</strong> would not compress the signal su ciently or would cause too<br />

many artifacts. At least the <strong>coding</strong> parameters <strong>and</strong> the tables <strong>for</strong> entropy<br />

<strong>coding</strong> should be adapted. Several codecs speci cally devoted to<br />

VLBR have been designed. The present section aims at introducing<br />

three coders: two international st<strong>and</strong>ards, namely the existing H.263<br />

which has been explicitely designed <strong>for</strong> VLBR <strong>and</strong> the future MPEG-4<br />

st<strong>and</strong>ard which al<strong>low</strong>s one to address VLBR although its primary goal<br />

is to provide the user with added functionalities, <strong>and</strong> an original trial<br />

from UCL, the COMIS scheme.<br />

Only the main outline of these three algorithms is presented here. Readers<br />

interested in a more detailed presentation of these algorithms should<br />

refer to the cited bibliography. The various <strong>motion</strong> analysis tools utilized<br />

by the codecs are fully described in the next chapter.<br />

1.4.1 H.263<br />

The ITU-T Recommendation H.263 [96] is the <strong>very</strong> rst VLBR codec<br />

able to ensure good subjective quality at<strong>low</strong> <strong>bitrate</strong>s. It is why it served<br />

as an anchor in the MPEG-4 tests (cf. Section 1.4.3.3).<br />

Figure 1.3 depicts the overall scheme of the H.263 coder. This scheme<br />

is <strong>very</strong> similar to the generic one on gure 1.2. H.263 mainly consists<br />

in a particularization of MPEG-1 <strong>and</strong> MPEG-2 codecs: e<strong>very</strong> picture is<br />

divided into 16 16 pels macroblocks (MB). The main di erence is that<br />

all parameters (entropy <strong>coding</strong> tables,...) have been speci cally tuned<br />

<strong>for</strong> <strong>low</strong> <strong>bitrate</strong>s.<br />

E<strong>very</strong> picture is intra or inter labelled but, <strong>for</strong> e<strong>very</strong> macroblock, the<br />

<strong>coding</strong> control unit decides on the most e cient way to code it. In<br />

addition, e<strong>very</strong> 132 times, a macroblock shall be coded in intra mode<br />

in order to prevent looped error propagation, even if the inter mode is<br />

more e cient or if the rest of the picture is inter coded.<br />

The various types of macroblocks are presented in table 1.2. The specicities<br />

of e<strong>very</strong> mode are detailed in the fol<strong>low</strong>ing sections.


22 Chapter 1. Digital Video Coding at Very-Low BitRate<br />

CC<br />

Video<br />

in<br />

T Q<br />

P<br />

-1<br />

Q<br />

-1<br />

T<br />

T Trans<strong>for</strong>m<br />

Q Quantizer<br />

P Picture Memory with <strong>motion</strong> compensated variable delay<br />

CC<br />

p<br />

t<br />

qz<br />

q<br />

v<br />

Coding control<br />

Flag <strong>for</strong> INTRA/INTER<br />

Flag <strong>for</strong> transmitted or not<br />

Quantizer indication<br />

Quantizing index <strong>for</strong> trans<strong>for</strong>m coefficients<br />

Motion vector<br />

1.4.1.1 Intra Macroblocks<br />

Figure 1.3: H.263 encoder<br />

p<br />

t<br />

qz<br />

q<br />

v<br />

To<br />

<strong>video</strong><br />

multiplex<br />

coder<br />

H.263 is part of the trans<strong>for</strong>m <strong>coding</strong> class (Chapter 10 of [113]) of algorithms.<br />

The macroblocks to be intra coded are linearly trans<strong>for</strong>med by<br />

the Discrete Cosine Trans<strong>for</strong>m (DCT, [2]) that decomposes the signal<br />

into its di erent frequency components. E<strong>very</strong> coe cient is then separately<br />

quantized <strong>and</strong> the set of coe cients is entropy-coded. The DCT<br />

is applied to blocks of 8 8 pels, i.e. a macroblock is made out of four<br />

DCT blocks <strong>for</strong> the luminance <strong>and</strong> two DCT blocks <strong>for</strong> the chrominance


1.4 Some (Very-) Low BitRate Codecs 23<br />

Picture MB DCT Motion Residual Change<br />

type type coef. vector <strong>coding</strong> quant.<br />

Intra Intra<br />

Intra Intra<br />

Intra Stu ng<br />

Inter Intra<br />

Inter Intra<br />

Inter Inter<br />

Inter Inter<br />

Inter Inter<br />

Inter Stu ng<br />

Table 1.2: Macroblock types <strong>and</strong> included data elements in H.263<br />

component (one <strong>for</strong> Cb, one <strong>for</strong> Cr). The quantization matrix applied<br />

to the frequency-based matrix of coe cients can be regularly updated<br />

during the en<strong>coding</strong> process. There are thus three ways of transmitting<br />

an intra macroblock, as illustrated in the three rst lines of table 1.2.<br />

\Stu ng" means that the macroblock remains exactly the same as in<br />

the previous frame.<br />

1.4.1.2 Inter Macroblocks <strong>and</strong> Residues<br />

In inter mode, a <strong>motion</strong> vector is rst searched <strong>for</strong> so as to determine the<br />

origin of the macroblock. It is achieved by a Block Matching Algorithm<br />

(BMA, cf. Section 2.5.1). Then, according to the relevance of the result,<br />

the macroblock is either coded in intra mode or via the <strong>motion</strong> vector<br />

<strong>and</strong> (possibly) additional residues: one of the inter picture modes of<br />

table 1.2 is selected. In the case of inter <strong>coding</strong> with residues, the DFD<br />

is coded as an intra macroblock with the DCT.<br />

1.4.1.3 Options<br />

In addition to this basic algorithm, four options can be used to improve<br />

the quality of the decoded images (with the same <strong>bitrate</strong>). Each option<br />

may be used separately or in combination with some others. The options<br />

are:<br />

Unrestricted <strong>motion</strong> vectors: the BMA can select vectors that<br />

locate the origin of a MB out of the reference image. The last line


24 Chapter 1. Digital Video Coding at Very-Low BitRate<br />

of pels on the image border is then reproduced (<strong>for</strong> more details,<br />

see Section 2.5.1).<br />

Arithmetic <strong>coding</strong> [60]: instead of Hu man codes [38], a more<br />

sophisticated entropy <strong>coding</strong> can be used.<br />

Advanced prediction mode: if necessary, a <strong>motion</strong> vector can<br />

be assigned to e<strong>very</strong> 8 8 block. In addition, the <strong>motion</strong> reconstruction<br />

is achieved with overlapping (cf. Section 2.5.2).<br />

PB-frames: a PB-frame consists in two pictures being coded as<br />

one unit. A predicted (P) frame is a normal inter-coded one, while<br />

a bi-directionally predicted (B) frame is computed on the basis of<br />

both the previous frame <strong>and</strong> the next P frame (which is located in<br />

the future of B along the time axis).<br />

1.4.1.4 Future Improvements: H.263+<br />

Thanks to its ne tuning, H.263 already achieves <strong>very</strong> good results,<br />

combined with <strong>low</strong> complexity <strong>and</strong> fast computation. A software coder<br />

able to treat 5 frames of 144 176 pels per second <strong>and</strong> a decoder<br />

working faster than 30 frames=s are provided by Telenor Corp. at<br />

http://www.nta.no/brukere/DVC/.<br />

Moreover Bjontegaard [9] made a few improvements that are to be added<br />

to the existing st<strong>and</strong>ard in order to de ne a new one: H.263+.<br />

1.4.2 \COMIS", the UCL Approach<br />

Initially developed by M.P. Queluz, B. Simon <strong>and</strong> B. Macq in the context<br />

of the European COST 211ter research group [112], <strong>and</strong> further rened<br />

by C. Devleeschouwer, T. Delmot, X. Marichal <strong>and</strong> B. Macq [70],<br />

the UCL has proposed an original scheme <strong>for</strong> VLBR <strong>video</strong> <strong>coding</strong> in<br />

which spatial in<strong>for</strong>mation <strong>and</strong> temporal changes are encoded using similar<br />

tools: binary trees are used to transcribe a multiscale decomposition<br />

of the in<strong>for</strong>mation, which explains the name of the algorithm: \COding<br />

on Multiscales Image Sequences" (COMIS). A complete description of<br />

the algorithm may be found in [65, 68].<br />

Designed as a VLBR codec, COMIS tries to combine the cheap blockbased<br />

picture description provided by the multiscale representation <strong>and</strong><br />

a region-oriented underst<strong>and</strong>ing of the pictures in order to improve the<br />

analysis <strong>and</strong> the reconstruction stages. Figure 1.4 presents the codec


1.4 Some (Very-) Low BitRate Codecs 25<br />

scheme. One can notice that a \region model" box is present in both the<br />

encoder <strong>and</strong> the decoder. There<strong>for</strong>e, no object in<strong>for</strong>mation needs<br />

to be transmitted: the object detection can be simultaneously generated<br />

on both sides (<strong>for</strong> instance, via a morphological watershed procedure<br />

[134]), e<strong>very</strong> time the decoder receives additional picture in<strong>for</strong>mation.<br />

preprocessing<br />

CODER DECODER<br />

region<br />

model<br />

optional<br />

communication<br />

region<br />

model<br />

Figure 1.4: Scheme of the COMIS codec<br />

reconstruction<br />

user’s interaction<br />

An original <strong>bitrate</strong> regulation scheme [19] gives then the priority tothe<br />

regions that are considered subjectively more important.<br />

The fol<strong>low</strong>ing sections aim at presenting the way images are coded <strong>and</strong><br />

the peculiarity of COMIS that voluntarily simpli es images (both in<br />

intra <strong>and</strong> inter mode) in order to reach VLBR with a good subjective<br />

quality.<br />

1.4.2.1 Intra Images<br />

Figure 1.5 describes the way a picture is treated in the intra-mode. The<br />

fol<strong>low</strong>ing paragraphs provide explanations about the di erent parts. All<br />

the images presenting intermediate results will always refer to a letter<br />

(A to F) of the diagram.<br />

Tree decomposition. The objective of the spatial segmentation is to<br />

identify the regions with uni<strong>for</strong>m luminance or with r<strong>and</strong>om textures.<br />

At rst, the image is split into non-overlapping 16 16 blocks 4 . Starting<br />

from such large blocks, the aim is to obtain homogeneous block<br />

4 Any size 2 x would be convenient: if 16 16 is used, it is because analysis of<br />

test images [128] has shown that larger blocks are always inhomogeneous in a QCIF<br />

(144 176 pixels) picture.


26 Chapter 1. Digital Video Coding at Very-Low BitRate<br />

Watershed<br />

computation<br />

original<br />

image tree<br />

A<br />

decomposition<br />

B<br />

C<br />

regions<br />

labels<br />

interest<br />

criteria<br />

merge<br />

procedure<br />

interest<br />

labels<br />

D<br />

reconstruction<br />

E F<br />

Figure 1.5: Block diagram to intra-code a picture in the COMIS coder<br />

components by a succession of binary decisions. A binary tree ensures<br />

the correspondence with a multiscale representation of the picture (Figure<br />

1.6).<br />

0<br />

1<br />

1 1<br />

1<br />

0 0<br />

1 0<br />

0 0<br />

1<br />

1<br />

1 0<br />

0 1<br />

1 0<br />

0 1<br />

Figure 1.6: Example of binary tree segmentation<br />

The split process is repeated until reaching homogeneous blocks or a<br />

block of prede ned minimal size. Once the block is considered homogeneous<br />

enough, the luminance of all its pixels is replaced by the mean<br />

luminance of the block (plus midtread quantization) <strong>and</strong> is attached to<br />

the appropriate leaf of the tree. A criterion (see references) is added<br />

to prevent regions with r<strong>and</strong>om textures from splitting. Such regions<br />

are assumed not to be perceptually important <strong>and</strong> are also described by<br />

their mean value.<br />

0 0


1.4 Some (Very-) Low BitRate Codecs 27<br />

Figure 1.7: Result of the tree decomposition - (left) \Foreman" (right)<br />

arti cial image - (top) original A (center, bottom) multiscale decomposition<br />

B<br />

Figure 1.7 illustrates the principle <strong>and</strong> the result of the decomposition.


28 Chapter 1. Digital Video Coding at Very-Low BitRate<br />

Figure 1.8: Result of the tree decomposition (B) with coarser thresholds<br />

-\Foreman" (left) <strong>and</strong> an arti cial image (right)<br />

One of its advantages is the use it can make of the correlation between<br />

luminance <strong>and</strong> chrominance components: <strong>for</strong> color images, the same tree<br />

is used to describe the three components.<br />

If the reader has a close look at the \Foreman" decomposition (image<br />

1.7(center)), he/she will notice that the image is rather complicated.<br />

This complexity has to be reduced to transmit such a picture<br />

over a VLBR channel. A rst way toachieve such a reduction of the<br />

complexity is to use coarser thresholds (Figure 1.8).<br />

The visual result is not particularly pleasant. The solution adopted here<br />

is to interpret the pictures on a region basis: after a segmentation step,<br />

all regions are classi ed according to their subjective relevance, <strong>and</strong> the<br />

algorithm tries to keep a good quality on the interesting regions (e.g.<br />

speaker's head,:::) while coarsely <strong>coding</strong> the other regions.<br />

Interesting regions determination. For the purpose of region selection,<br />

an interest criterion is de ned [72, 20, 19]. It can automatically<br />

distinguish between what is important <strong>and</strong> what is not. Three criteria<br />

are combined so as to provide the nal classi cation:<br />

the image border criterion: it is based on the fact that the<br />

most important part of an image in a <strong>video</strong> sequence is correctly<br />

centered <strong>and</strong> that the eye precision is the best in the center of the


1.4 Some (Very-) Low BitRate Codecs 29<br />

vision area. The criterion gives less priority to regions with most<br />

of their pixels located along image borders.<br />

the interactive criterion: it can rein<strong>for</strong>ce the previous criterion<br />

or ght against it. Its principle is to eliminate regions with no<br />

pixels inside a pre-de ned window of the image. This criterion is<br />

interactive as the user can easily select another \interest window".<br />

the face texture criterion: mainly designed <strong>for</strong> <strong>video</strong>-phone<br />

or <strong>video</strong>-conference, this criterion rejects all the regions whose<br />

chrominance components do not coincide with a set of skin samples.<br />

These criteria are combined using the Fuzzy Logic Theory [80] <strong>and</strong> result<br />

in a classi cation of the image regions that is used in the next step of<br />

the algorithm (Figure 1.9 (b) depicts the nal classi cation).<br />

Merge procedure. The tree description of the image can be considered<br />

as a split procedure. A classical way of dealing with such split<br />

algorithms is to combine them with a merge procedure [59] that aims at<br />

homogenizing their result <strong>and</strong> correcting the split errors engendered by<br />

local traps.<br />

The split tree, a watershed analysis of the original image <strong>and</strong> an associated<br />

region classi cation are the inputs of the present merge procedure.<br />

The aim of this algorithm is to homogenize the value of all the subblocks<br />

belonging to a region that has not been labeled \interesting". Figure 1.9<br />

shows the result of the complete algorithm with reference to gure 1.5.<br />

One can notice a dark spot on the nose of the \Foreman": it results<br />

from a bad classi cation of the associated region. Nevertheless, this<br />

problem would be directly solved if no over-segmentation was present.<br />

Classi cation would be much more stable if dealing with coherent regions<br />

(<strong>for</strong> instance one region <strong>for</strong> the entire head, <strong>and</strong> not twenty).<br />

According to the <strong>bitrate</strong> available, a di erent percentage of regions may<br />

be protected or not (Figure 1.10).<br />

Parameters <strong>coding</strong>. The resulting parameters have then to be transmitted.<br />

Two elements are necessary to fully describe the image: the binary<br />

tree structure <strong>and</strong> the attached leaves values. The tree is entropy<br />

coded with the M-Coder [77] [111] previously developed [67]. The leaves


30 Chapter 1. Digital Video Coding at Very-Low BitRate<br />

(a) (b)<br />

(c) (d)<br />

Figure 1.9: Split <strong>and</strong> merge procedure on \Foreman": (a) original A,<br />

(b) interesting regions determination D (dark = interesting, light = not<br />

interesting) <strong>and</strong> (c,d) merged image <strong>and</strong> associated multigrid E<br />

obtained in the analysis-transcription phase (the luminance <strong>and</strong> chrominance<br />

means) are rst decorrelated by optimum linear prediction, using<br />

the nearest neighbors. The resulting prediction errors are encoded by<br />

the Universal Variable Length Coder (UVLC [64]).<br />

Reconstruction. Describing tree segmented images as the juxtaposition<br />

of variable size blocks (with one single mean value associated to<br />

e<strong>very</strong> block) leads to large blocking artifacts (cf. gures 1.7, 1.8 <strong>and</strong> 1.9)


1.4 Some (Very-) Low BitRate Codecs 31<br />

Figure 1.10: \Foreman": comparison between a merge with region protection<br />

(left) <strong>and</strong> a full merge (right).<br />

<strong>and</strong> leads to the conclusion that the description would better not be the<br />

mere reverse operation of the transcription. Overlapping functions, that<br />

respect transitions between small <strong>and</strong> large blocks but are also spread<br />

out in <strong>low</strong> resolution areas, are used. The step impulse response is replaced<br />

by a longer impulse response but only locally, in order to respect<br />

transitions in the image. Low resolution areas receive an importantoverlap<br />

while high resolution areas need to keep the interpolation functions<br />

disjoint.<br />

The visual improvement o ered by this reconstruction technique is depicted<br />

on gure 1.11.<br />

1.4.2.2 Inter Images<br />

The Adaptive Block-Matching Algorithm. The overall <strong>motion</strong><br />

<strong>estimation</strong> procedure used in the COMIS codec is based on the BMA (cf.<br />

Section 2.5.1), but some substantial improvements have been brought<br />

by P. Queluz. The resulting scheme manages a hierarchical structure<br />

of block sizes. It is thus called an Adaptive Block Matching Algorithm<br />

(ABMA, [110]) <strong>and</strong> is presented in detail in Section 3.1.<br />

Image pre-treatment prior to <strong>motion</strong> <strong>estimation</strong>. The idea is<br />

still to voluntarily simplify the images prior to <strong>motion</strong> <strong>estimation</strong> <strong>and</strong>


32 Chapter 1. Digital Video Coding at Very-Low BitRate<br />

Figure 1.11: Overlapping reconstruction - (top) original A (center) de<strong>coding</strong><br />

B,E (bottom) reconstruction F<br />

<strong>coding</strong>, <strong>and</strong> the new image is rst treated as an intra-image. Once


1.4 Some (Very-) Low BitRate Codecs 33<br />

arriving at point E of gure 1.5, it is used <strong>for</strong> ME. The impact of such<br />

an approach is studied in Chapter 4.<br />

To intra-treat the picture, the determination of interesting regions (cf.<br />

Section 1.4.2.1) is needed. Two criteria have been added so as to take<br />

the <strong>motion</strong> between two successive frames into account 5 .<br />

a <strong>motion</strong> criterion: based on a succession of <strong>motion</strong> <strong>estimation</strong>s<br />

between the images at the highest frequency (even if the sequence<br />

is coded at 8; 33Hz, the criterion will compute three <strong>motion</strong> <strong>estimation</strong><br />

between images at 25Hz), it eliminates regions with no<br />

<strong>motion</strong> <strong>and</strong> no texture (the Human Visual System does not focus<br />

on these areas) as well as the highly textured regions with a <strong>very</strong><br />

important <strong>motion</strong> (that are too noisy to be correctly perceived).<br />

the continuity criterion: some tracking helps the algorithm taking<br />

the classi cation of the region in the previous image into<br />

account in order to ensure the temporal stability of the nal criterion.<br />

Motion transcription <strong>and</strong> <strong>coding</strong>. The high level of smoothness<br />

provided by the ABMA method justi es the use of contour/content <strong>coding</strong><br />

of the <strong>motion</strong> vector eld. The same kind of transcription <strong>and</strong><br />

<strong>coding</strong> is used to take advantage of what is done in the intra mode.<br />

1.4.3 The Emerging MPEG-4 St<strong>and</strong>ard<br />

The ISO group ISO/IEC JTC1/SC29/WG11 (MPEG) has been working<br />

on a new work item since November 1992. After a lot of procrastination,<br />

MPEG-4 has nally found a suitable match at the Singapore meeting<br />

in November 1994. The rst Proposal Package Description (PPD) was<br />

established, focusing on three major trends of today's world:<br />

the trend towards wireless communications,<br />

the trend towards interactive computer applications, <strong>and</strong><br />

the trend towards integration of audio-visual data into an ever increasing<br />

number of applications.<br />

5 A <strong>motion</strong> <strong>estimation</strong> is thus per<strong>for</strong>med to help computing these criteria. Nevertheless,<br />

this <strong>motion</strong> <strong>estimation</strong> is not the one used later by the <strong>coding</strong> algorithm <strong>and</strong><br />

may even use a di erent technique.


34 Chapter 1. Digital Video Coding at Very-Low BitRate<br />

This PPD considers that Multimedia is the convergent point of three<br />

industries, i.e. computer, telecommunications <strong>and</strong> TV/ lm industries.<br />

Although focusing at rst on all the <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> applications [103],<br />

MPEG-4 has decided to extend its scope so as to be a much more<br />

open <strong>and</strong> evolutive system: exibility <strong>and</strong> extensibility are its driving<br />

<strong>for</strong>ces.<br />

The fol<strong>low</strong>ing sections highlight [57] the key points of MPEG-4 motivation(s)<br />

<strong>and</strong> principle(s). More details may be found in the Image<br />

Communication Special Issue [104] that explores the various aspects of<br />

the future st<strong>and</strong>ard on a <strong>video</strong> point of view.<br />

1.4.3.1 Developments to Be Supported<br />

In addition to the increasing need <strong>for</strong> audio-visual communications at<br />

<strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> [103], some new trends concerning the use of audiovisual<br />

in<strong>for</strong>mation have to be taken into account. These are mainly:<br />

The way images are produced. Computer-generated images are<br />

being <strong>very</strong> commonly used. MPEG-4 addresses both \Synthetic<br />

& Natural Hybrid Coding" (SNHC [26]).<br />

The waymultimedia content isdelivered. VLBR is needed across<br />

\Global System <strong>for</strong> Mobile communications" (GSM) networks, but<br />

the same content could also been transmitted across ADSL (Asymmetric<br />

Digital Subscriber Loop) or ATM (Asynchronous Transfer<br />

Mode) nets, at a rate of several Megabits per second. Such a large<br />

range of <strong>bitrate</strong>s requires scalability. Moreover, a scalable description<br />

of the in<strong>for</strong>mation will enable the decoder to adapt itself<br />

to the processing power of the machine.<br />

The consumption of multimedia is rapidly evolving as it is taking<br />

more <strong>and</strong> more place in the way people communicate. The main<br />

changes in consumer habits are the need <strong>for</strong> interactive supports,<br />

possibilities to re-use the in<strong>for</strong>mation <strong>and</strong> software implementations.<br />

1.4.3.2 A New Challenge <strong>for</strong> the Representation of Audio-<br />

Visual In<strong>for</strong>mation<br />

In order to ll all these needs, MPEG-4 represents an audio-visual scene<br />

in a new way [100]: rather than considering a frame-based <strong>video</strong>, it


1.4 Some (Very-) Low BitRate Codecs 35<br />

de nes the scene as a coded representation of Audio-Visual Objects<br />

(AVO). An AVO can be a Video Object Component (VOC), or an audio<br />

one (AOC) or a combination of both. These AVOs are obeying a scenario<br />

describing their relations in space <strong>and</strong> time. The scene depicted<br />

on gure 1.12 can be interpreted as a combination of ve AVOs: the<br />

background plus four elements.<br />

4<br />

2<br />

Figure 1.12: Audio-visual scene described in terms of AVOs<br />

All these objects are separately encoded <strong>and</strong> the various objects bitstreams<br />

are multiplexed be<strong>for</strong>e being transmitted (Figure 1.13). In addition<br />

to this object-based representation of the in<strong>for</strong>mation, a scene<br />

compositor ensures the possibility of re-using objects <strong>and</strong> extends the<br />

interaction capabilities of the scheme. Of course, this compositor is st<strong>and</strong>ardized<br />

(MPEG-4, just like many image <strong>coding</strong> st<strong>and</strong>ards, is a de<strong>coding</strong><br />

st<strong>and</strong>ard) unlike the segmentation stage prior to <strong>coding</strong> (see Section 1.5<br />

<strong>for</strong> a brief discussion of this topic).<br />

1.4.3.3 Video Coding in MPEG-4<br />

H.263 was used as an anchor <strong>for</strong> the subjective tests established with all<br />

the proposed codecs. Finally, the tools used to encode <strong>and</strong> compress the<br />

<strong>video</strong> in<strong>for</strong>mation in MPEG-4 are <strong>very</strong> similar to the ones of H.263 (cf.<br />

Section 1.4.1), with the slight di erence that they are applied to objects<br />

whose borders should also be described.<br />

5<br />

3<br />

1


<strong>and</strong> the <strong>video</strong> group one (http://wwwam.HHI.DE/mpeg-<strong>video</strong>/).<br />

the general MPEG web page (http://drogo.cselt.stet.it/mpeg/),<br />

The rst Committee Draft (CD) has been set up (October 97, Fribourg<br />

meeting), <strong>and</strong> the Draft International St<strong>and</strong>ard is planned <strong>for</strong> November<br />

98. However, in order to be able to include more sophisticated tools that<br />

still need development <strong>and</strong> assessment (like the matching pursuits [90]),<br />

a second CD will be established in November 98 <strong>and</strong> will become a Draft<br />

International St<strong>and</strong>ard (MPEG-4 version 2) in November 1999. People<br />

interested in being kept in<strong>for</strong>med of the rapid evolution of the st<strong>and</strong>ard<br />

should refer to:<br />

A major innovation of MPEG-4 is to guarantee the access to multimedia<br />

in<strong>for</strong>mation from e<strong>very</strong> part of the world. There<strong>for</strong>e, the bitstream<br />

structure is speci c in a twofold way: it includes <strong>coding</strong> tools <strong>and</strong> a<br />

system layer that can cope with severe channel errors <strong>and</strong> it al<strong>low</strong>s<br />

scalability at the bitstream level. This scalability ensures universal<br />

accessibility as any decoder can select the in<strong>for</strong>mation it can decode.<br />

Figure 1.13: Schematic overview of an MPEG-4 system<br />

interaction<br />

objects<br />

objects<br />

objects<br />

objects<br />

encoder<br />

multiplexer<br />

demultiplexer<br />

decoder<br />

compositor<br />

<strong>video</strong><br />

audio<br />

stored<br />

objects<br />

local objects<br />

(coded/uncoded)<br />

36 Chapter 1. Digital Video Coding at Very-Low BitRate


1.5 Discussion 37<br />

1.5 Discussion<br />

This rst chapter has presented di erent facets of <strong>video</strong> compression,<br />

with emphasis on <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong>s. A clear distinction has been made<br />

between intra-<strong>coding</strong> <strong>and</strong> inter-<strong>coding</strong>. Inter-<strong>coding</strong>, which tries to exploit<br />

the spatio-temporal correlation between successive images, o ers<br />

the framework of the present thesis. VLBR <strong>video</strong> transmission is characterized<br />

by the impossibility <strong>for</strong> the decoder to reach a<strong>very</strong> good<br />

approximation of the original images. It is commonly admitted that<br />

the in<strong>for</strong>mation must be deteriorated in order to reach the requested<br />

<strong>bitrate</strong>s.<br />

Three VLBR codecs have been brie y reviewed <strong>and</strong> some new trends can<br />

be emphasized. With MPEG-4, the <strong>video</strong> <strong>coding</strong> community has realized<br />

that compression is not the only aim anymore. The convergence of both<br />

the ever-increasing channel capacities <strong>and</strong> the compression per<strong>for</strong>mances<br />

appears to be su cient to ll most needs. Yet, the market seems to<br />

expect new features like interaction, manipulation, software support,<br />

copyright protections,...<br />

Is it then the end of research in <strong>video</strong> <strong>coding</strong>? De nitely not! But new<br />

requirements have tobetaken into account. Among the various new<br />

topics that are emerging, one may cite:<br />

Joint source <strong>and</strong> channel <strong>coding</strong> <strong>for</strong> speci c transmissions in errorprone<br />

environments.<br />

As stated in the description of MPEG-4 (cf. Section 1.4.3), this<br />

future st<strong>and</strong>ard uses segmented objects but does not st<strong>and</strong>ardize<br />

how to obtain them. Tools to per<strong>for</strong>m segmentation <strong>and</strong> tracking<br />

prior to <strong>coding</strong> will o er industrial competitors a struggle eld to<br />

distinguish one from each other.<br />

MPEG has decided to end with pure <strong>coding</strong>, <strong>and</strong> the next st<strong>and</strong>ard<br />

to come, MPEG-7 [101], deals with image analysis in the<br />

framework of audio-visual search engines [102], which promises to<br />

issue some new <strong>very</strong> interesting challenges.<br />

Aware of this evolution, many people have already started working in<br />

these new topics. Among them, one may cite the European COST 211ter<br />

(quater) project [1].


Chapter 2<br />

Motion in the Framework<br />

of Video Coding<br />

The extraction of <strong>motion</strong> in<strong>for</strong>mation from a sequence of time-varying<br />

images has numerous applications in the eld of image processing: medical<br />

image analysis, mobile robot navigation, automatic tracking of moving<br />

objects, interpretation of atmosphere observation (remote sensing),<br />

image interpolation <strong>and</strong> restoration,... as well as digital <strong>video</strong> compression.<br />

As seen in the introduction to (VLBR) <strong>video</strong> <strong>coding</strong> (cf. Chapter<br />

1), <strong>motion</strong> <strong>estimation</strong> & <strong>compensation</strong> plays a key role in <strong>video</strong><br />

compression as it results in the most per<strong>for</strong>ming compression gains.<br />

This chapter does not claim to propose a complete overview of all existing<br />

<strong>motion</strong> <strong>estimation</strong> techniques in the <strong>video</strong> <strong>coding</strong> context. This<br />

task has already been successfully achieved with numerous results <strong>and</strong><br />

comments in specialized books like [127] or articles like [108, 31], in<br />

which extensive references are given. Yet, it seems important to review<br />

the most classical techniques in order to present the state of the art <strong>and</strong><br />

to highlight the contribution put <strong>for</strong>ward by the present work.<br />

Another aim is to emphasize the distinction between the <strong>estimation</strong> <strong>and</strong><br />

the <strong>compensation</strong> stages of a codec. Figure 2.1 brie y reminds one<br />

that <strong>estimation</strong> is per<strong>for</strong>med at the encoder side in order to extract the<br />

<strong>motion</strong> parameters of the <strong>video</strong> sequence, while the decoder uses the<br />

estimated <strong>motion</strong> in<strong>for</strong>mation during the <strong>compensation</strong> phase. A parallel<br />

can be established between the general principles of <strong>video</strong> compression<br />

(cf. Section 1.2) <strong>and</strong> <strong>motion</strong> <strong>estimation</strong> & <strong>compensation</strong> processes,<br />

which are based on three main steps:


40 Chapter 2. Motion in the Framework of Video Coding<br />

1. The <strong>motion</strong> <strong>estimation</strong>: it is the analysis per<strong>for</strong>med on the encoder<br />

side. Two images at time t <strong>and</strong> t , 1 (or even more reference<br />

images) are compared <strong>and</strong> the algorithm attempts to nd the <strong>motion</strong><br />

between the two frames. It results in <strong>motion</strong> description:<br />

dense or discrete <strong>motion</strong> eld, a ne parameters,...<br />

2. The transcription phase takes the result of the <strong>motion</strong> <strong>estimation</strong><br />

<strong>and</strong> tries to describe it in the most compact representation.<br />

The aim of this step is e cient <strong>coding</strong> (so as to reach the required<br />

<strong>bitrate</strong>). This step is reversed at the decoder in order to get the<br />

initial <strong>motion</strong> <strong>estimation</strong> parameters back. It can be done with or<br />

without losses, according to the type of transcription.<br />

3. The <strong>motion</strong> <strong>compensation</strong> takes place at the decoder (<strong>and</strong> also<br />

at the encoder to simulate the decoder behavior). It aims at predicting<br />

the image at time t on the basis of both the <strong>motion</strong> parameters<br />

<strong>and</strong> the image at time t , 1 (or more images).<br />

It seems important to take this distinction into account toevaluate all<br />

<strong>motion</strong> <strong>estimation</strong> techniques in a <strong>video</strong> <strong>coding</strong> context. Both <strong>estimation</strong><br />

<strong>and</strong> <strong>compensation</strong> have tobeevaluated with regards to their<br />

computational costs. This is particularly important when one has realtime<br />

industrial applications in mind. But, while the aim of <strong>estimation</strong><br />

<strong>and</strong> transcription will be the extraction of the <strong>motion</strong> parameters <strong>and</strong><br />

their cheap transmission (because of the (VLBR) <strong>coding</strong> context), <strong>compensation</strong><br />

has to be evaluated on the basis of the visual quality of the<br />

compensated image.<br />

But be<strong>for</strong>e evaluating some classical techniques, the rst section of this<br />

chapter introduces why <strong>motion</strong> computation is an <strong>estimation</strong> problem.<br />

The reason is that the <strong>motion</strong> present in the real scene rendered by<br />

the image sequence is not directly observable: only its e ects on the<br />

pictured scene are observable. Consequently, it cannot be measured but<br />

estimated instead. Moreover, the problem is said to be ill-posed in the<br />

sense that the available data insu ciently constrain the solution that<br />

may be non-unique or non-existing.<br />

The Rate-Distortion theory will demonstrate in Section 2.2 the e ciency<br />

of <strong>motion</strong> <strong>estimation</strong> & <strong>compensation</strong> <strong>for</strong> <strong>video</strong> compression if it is per<strong>for</strong>med<br />

with a su cient accuracy. Section 2.3 presents the hypotheses<br />

that may be used to su ciently constrain the problem. It also exp<strong>and</strong>s


2.1 Image Formation <strong>and</strong> Motion 41<br />

Previous<br />

image(s)<br />

Source<br />

image<br />

at time t<br />

Estimated<br />

image<br />

at time t<br />

Estimation:<br />

<strong>motion</strong> analysis<br />

Motion<br />

<strong>compensation</strong><br />

Previous<br />

image(s)<br />

CODER<br />

Motion<br />

parameters<br />

Motion<br />

parameters<br />

DECODER<br />

Representation grid<br />

Transcription,<br />

<strong>coding</strong><br />

Transmission channel<br />

De<strong>coding</strong><br />

Representation grid<br />

Figure 2.1: Distinction between <strong>motion</strong> <strong>estimation</strong> <strong>and</strong> <strong>compensation</strong><br />

on the various methodologies that may be adopted to solve it. Section<br />

2.4 introduces some background techniques in <strong>motion</strong> <strong>estimation</strong><br />

& <strong>compensation</strong> (<strong>for</strong> image <strong>coding</strong>), while the two particular techniques<br />

that are used in the present work are fully described in sections 2.5<br />

<strong>and</strong> 2.6. Finally, Section 2.7 concludes the chapter.<br />

2.1 Image Formation <strong>and</strong> Motion<br />

An image is a two-dimensional (2D) pattern of brightness resulting from<br />

the projection of a three-dimensional (3D) scene onto a 2D plane. This<br />

projection, that may beaperspective one or an orthographic one,<br />

brings about some loss of depth in<strong>for</strong>mation, which engenders several<br />

problems such asaperture, occlusions,...<br />

If one considers the wayaphysical scene is lmed (Figure 2.2), two coor-


42 Chapter 2. Motion in the Framework of Video Coding<br />

Y<br />

X<br />

f<br />

y<br />

r<br />

x<br />

2D image plane<br />

R<br />

3D object<br />

Z<br />

camera axis<br />

Figure 2.2: Perspective projection geometry: from 3D to 2D<br />

dinate systems have to cooperate: the camera <strong>and</strong> the image plane.<br />

The rst is a 3D Cartesian coordinate system with its origin at the<br />

camera lens <strong>and</strong> the Z-axis corresponding to the camera axis. The<br />

second is the 2D (<strong>and</strong> space-limited) system of the image plane. Let<br />

R =(XY Z) T be the position vector of a point in the 3D space <strong>and</strong><br />

r =(xy) T be the position vector of the projected point on the image<br />

plane. Under a perspective projection, the image r of R is the intersection<br />

of the plane with a ray linking the camera lens <strong>and</strong> R. If one<br />

considers the similar triangles in the projection geometry, the fol<strong>low</strong>ing<br />

relations are obtained:<br />

f Z f<br />

= ;<br />

x X y<br />

Z<br />

= ; (2.1)<br />

Y<br />

where f is the distance between the camera <strong>and</strong> the image plane, referred<br />

to as the focal length, <strong>and</strong> related to the focus of expansion [51]. The<br />

perspective projection equation <strong>for</strong> e<strong>very</strong> point isthus:<br />

r =<br />

x<br />

y<br />

!<br />

= f<br />

Z<br />

X<br />

Y<br />

!<br />

: (2.2)


2.1 Image Formation <strong>and</strong> Motion 43<br />

Under orthographic projection, the ray also starts in R but is perpendicular<br />

to the image plane (parallel to the camera axis). The orthographic<br />

projection equation is:<br />

r =<br />

x<br />

y<br />

!<br />

=<br />

X<br />

Y<br />

!<br />

: (2.3)<br />

It is clear that perspective projection reduces to orthographic projection<br />

if the focal length f of the camera is much larger than the depth Z. The<br />

assumption of orthographic projection is valid also in all cases where the<br />

variation of the object shape is small or when the pictured object is small,<br />

both in comparison to the focal length f. In these situations, the depth<br />

might be considered constant <strong>and</strong> a kind of parallel projection appears;<br />

but a scaling factor is still present. The orthographic projection implies<br />

a further reduction of the in<strong>for</strong>mation about the 3D scene compared<br />

with the perspective projection, since the depth in<strong>for</strong>mation is totally<br />

lost.<br />

A more di cult question of image <strong>for</strong>mation is the determination of the<br />

brightness of a particular point of the image. This <strong>very</strong> complicated<br />

problem will not be tackled here since it involves many other data: the<br />

source of light, the angle of re ection on the object, the re ection properties<br />

of the object, the spectral sensitivity of the sensor,... [48]<br />

Nevertheless, the fact that a 3D scene is projected onto a 2D image explains<br />

that a di erence exists between the apparent <strong>and</strong> the real <strong>motion</strong><br />

(Section 2.1.1). Moreover, this di erence <strong>and</strong> some other considerations<br />

about the nature of images prevent from solving all ambiguities of <strong>motion</strong><br />

<strong>estimation</strong> (Section 2.1.2).<br />

2.1.1 Apparent versus Real Motion<br />

In the image <strong>for</strong>mation process, the 3D scene is projected by the camera<br />

onto the 2D image plane. Thereafter, the 3D <strong>motion</strong> in<strong>for</strong>mation of the<br />

objects is also projected onto the image plane. The presence of <strong>motion</strong><br />

manifests itself on the image plane by changes of the intensity values of<br />

the pixels along the time axis. These changes are used to recover the<br />

<strong>motion</strong> of the objects.<br />

The 3D <strong>motion</strong> of the objects is called the real <strong>motion</strong>, in opposition<br />

to the 2D velocity eld that represents the apparent <strong>motion</strong> of the<br />

objects on the image plane <strong>and</strong> is referred to as optical ow [49]. Both


44 Chapter 2. Motion in the Framework of Video Coding<br />

(a) (b) (c)<br />

Figure 2.3: Illustration of optical ow: (a) Sphere at time t , 1 (b)<br />

Sphere at time t (c) Optical ow<br />

are di erent from the velocity eld that would result from the projection<br />

on the image plane of the true 3D velocity eld. Illuminations change,<br />

shadow, occlusion are phenomena that are interpreted as <strong>motion</strong> e ects<br />

by the optical ow.<br />

Optical ow is exactly what <strong>motion</strong> <strong>estimation</strong> tries to produce as a<br />

result of its analysis. It is de ned as (from [44], illustration in gure 2.3):<br />

Optical F<strong>low</strong> Image in which the value of each pixel is the estimated<br />

projected translational velocity arising from a surface point ofan<br />

object in <strong>motion</strong> relative to the camera. Some pixels may have<br />

no optical ow in<strong>for</strong>mation because the projected velocity is not<br />

always estimable.<br />

One of the aims of object-oriented <strong>coding</strong> [84, 85], in the framework<br />

of <strong>motion</strong> <strong>estimation</strong> & <strong>compensation</strong>, is to establish 3D models of the<br />

scene in order to overcome this problem of restricted access to the only<br />

apparent <strong>motion</strong>. As they are out of the scope of the present work, they<br />

will not be detailed, even if they are often close to the parametric models<br />

(cf. Section 2.4.5).<br />

2.1.2 Unsolvable Problems of Motion Estimation<br />

Many di erent algorithms exist to compute 3D <strong>and</strong> 2D <strong>motion</strong> from<br />

image sequences. However, many questions remain open because of the<br />

boundary e ects that make <strong>motion</strong> <strong>estimation</strong> a non trivial problem.<br />

Hereunder are some well-known phenomena that engender troubles when<br />

one tries to estimate the true 2D <strong>motion</strong> eld:


2.1 Image Formation <strong>and</strong> Motion 45<br />

A<br />

t-1 t<br />

Figure 2.4: The aperture problem<br />

Apparent <strong>motion</strong> vector<br />

A<br />

g<br />

e<br />

f<br />

true vector<br />

True <strong>motion</strong> vector<br />

Figure 2.5: The bicycle wheel: ambiguity in the correspondence problem<br />

because of aliasing<br />

The aperture problem is illustrated on gure 2.4. Any operation<br />

that sees the moving edge through a local window A can only<br />

compute the component of <strong>motion</strong> perpendicular to the edge. It<br />

means that on gure 2.4, any of the vectors e; f or g would be<br />

convenient. The optical ow is there<strong>for</strong>e not uniquely determined<br />

by the local in<strong>for</strong>mation in the changing image. The problem is<br />

not su ciently constrained: it is ill-posed.<br />

The correspondence problem which is depicted on gure 2.5 prevents<br />

<strong>estimation</strong> algorithms from correctly putting in relation the<br />

intensity values of successive frames, <strong>and</strong> results from the spatiotemporal<br />

sampling achieved during digital image acquisition. In-


46 Chapter 2. Motion in the Framework of Video Coding<br />

(a) (b)<br />

Figure 2.6: The optical ow is not always equal to the <strong>motion</strong> eld. (a)<br />

Null optical ow during non-null <strong>motion</strong> eld. (b) Reverse situation.<br />

deed, it is not always possible to respect the Nyquist frequency [52],<br />

particularly in the case of high spatial frequencies undergoing fast<br />

<strong>motion</strong>s. A typical illustration of this kind of temporal aliasing<br />

is the \wheel" ( gure 2.5): if the angular velocity of the wheel is<br />

greater than ( =n frame rate), where n is the number of rails,<br />

there will be an ambiguity in the correspondence process <strong>and</strong> the<br />

wheel seems to turn in the opposite direction.<br />

Because <strong>motion</strong> is estimated by establishing correspondences between<br />

successive images intensities, any noise (camera noise, quantization<br />

noise,...) will cause additional di culties. Moreover, the<br />

illumination changes will be interpreted as <strong>motion</strong> e ects <strong>and</strong><br />

will distance the optical ow from the true <strong>motion</strong> eld. Figure<br />

2.6(a) shows a smooth sphere rotating under constant illumination:<br />

the projected image does not change, yet the true <strong>motion</strong><br />

eld is non-null. On gure 2.6(b), a xed sphere is illuminated by<br />

amoving source: shadow changes engender an optical ow, even<br />

if the true <strong>motion</strong> eld is null.<br />

Occlusions between moving objects such as appearance or disappearance<br />

of objects parts (because of (un)covering), create regions<br />

where the observed intensities at time t do not haveany correspon-


2.2 Rate Distortion Theory 47<br />

dence in image t,1. And the partial loss of the depth in<strong>for</strong>mation<br />

does not provide enough in<strong>for</strong>mation to recover the true <strong>motion</strong>.<br />

2.2 Rate Distortion Theory<br />

Be<strong>for</strong>e eventually reviewing some techniques of <strong>motion</strong> <strong>estimation</strong> &<br />

<strong>compensation</strong>, it seems important to highlight its main goal, in a <strong>video</strong><br />

compression context, which is to reduce the amount of data necessary<br />

to transmit moving pictures across b<strong>and</strong>width-limited channels. The<br />

various di culties that might arise when estimating <strong>motion</strong> have been<br />

presented in the previous section, but <strong>motion</strong> <strong>estimation</strong> & <strong>compensation</strong><br />

is still used <strong>and</strong> regarded as the most compressing tool of a complete<br />

codec.<br />

To justify why <strong>motion</strong> <strong>estimation</strong> is used in <strong>video</strong> coders, Tziritas <strong>and</strong><br />

Labit [127] use the Rate Distortion theory (a brief summary of which<br />

may be found in appendix B). With some additional hypotheses, this<br />

theory clearly indicates the advantages <strong>and</strong> limits of <strong>motion</strong> <strong>estimation</strong><br />

in the <strong>video</strong> compression framework.<br />

With the results of appendix B.4, one can obtain a function of the distortion<br />

D that de nes the rate R required to code a picture in intra<br />

mode:<br />

R =<br />

( 1<br />

2 log 2<br />

2<br />

I<br />

D if 0 2<br />

I ;<br />

I;<br />

(2.4)<br />

where 2<br />

I is the input image variance.<br />

When inter image <strong>coding</strong> is per<strong>for</strong>med without <strong>motion</strong> <strong>compensation</strong>,<br />

the image I(x; y; t , 1) is merely used as a prediction of I(x; y; t)). The<br />

residual <strong>coding</strong> then requires a <strong>bitrate</strong> of:<br />

R = 1<br />

2 log 2<br />

2 2 I(1 , e<br />

D<br />

p<br />

,2 f0 u2 +v2 )<br />

+1<br />

!<br />

; (2.5)<br />

where (u; v) is the true displacement, (de ned in equation (B.29) of<br />

appendix B.4) is the temporal correlation of the two images <strong>and</strong> f0 0:05 p . Equation (2.5) outper<strong>for</strong>ms equation (2.4) only if:<br />

p 1<br />

2 2 u + v < ln 2 ( >0): (2.6)<br />

2 f0


48 Chapter 2. Motion in the Framework of Video Coding<br />

This condition clearly points out that inter image <strong>coding</strong> without <strong>motion</strong><br />

<strong>compensation</strong> is only worthwhile if the displacement is small in<br />

comparison to the spatial variations of the picture. It also depends on<br />

the temporal correlation: in case of scene cut, or large displacements,<br />

is either <strong>very</strong> <strong>low</strong> ornull <strong>and</strong> it is better to intra code the new picture.<br />

Inter image <strong>coding</strong> with <strong>motion</strong> <strong>compensation</strong> is achieved with an estimated<br />

<strong>motion</strong> eld (u0 ;v0 ) di erent from (u; v). We denote (dx;dy) the<br />

<strong>estimation</strong> error (u0 , u; v0 , v). Such a <strong>coding</strong> also requires the <strong>coding</strong><br />

of the residues1 . In this case, the rate is<br />

Z p<br />

1=<br />

R =<br />

0<br />

f log 2<br />

2(1 , (fx;fy)) I(fx;fy)<br />

D<br />

+1 df; (2.7)<br />

where (fx;fy) is the characteristic function of the <strong>motion</strong> <strong>estimation</strong> error<br />

(dx;dy) <strong>and</strong> I(fx;fy) the power spectral density of the images (more<br />

details are available in appendix B.4). By comparing equations (2.4) <strong>and</strong><br />

(2.7), one can demonstrate that inter image <strong>coding</strong> with <strong>motion</strong> <strong>compensation</strong><br />

is more e ective than intra <strong>coding</strong> only if<br />

p<br />

2 , 1<br />

d < (<br />

f0 >0); (2.8)<br />

where 2<br />

d is the variance of the <strong>motion</strong> <strong>estimation</strong> error (dx;dy). Motion<br />

<strong>compensation</strong> techniques are only interesting if their accuracy is high<br />

enough with regards to the picture content. Finally, <strong>motion</strong> <strong>compensation</strong><br />

improves inter image prediction (equation (2.5)) if<br />

d < p u 2 + v 2 : (2.9)<br />

These two last conditions not only justify the use of <strong>motion</strong> <strong>compensation</strong><br />

in <strong>video</strong> coders whenever the <strong>motion</strong> <strong>estimation</strong> is precise enough,<br />

but also explain why all coders use several modes of transmission. They<br />

provide criteria to decide which mode should be used according to the<br />

spatio-temporal activity of the image sequence.<br />

2.3 Practical Approaches to Motion Estimation<br />

If one is now convinced of the possible impact of <strong>motion</strong> <strong>estimation</strong> &<br />

<strong>compensation</strong> <strong>for</strong> <strong>video</strong> compression purposes, it is time to reveal how<br />

this complex problem can be solved. As the problem is an ill-posed<br />

1 In this calculus, the <strong>coding</strong> cost of the <strong>motion</strong> in<strong>for</strong>mation is neglected.


2.3 Practical Approaches to Motion Estimation 49<br />

one, constraints have to be added so as to reach a unique solution. The<br />

type of constraints is a rst criterion that distinguish one technique<br />

from another. The second criterion is the methodology that is adopted<br />

to solve the problem.<br />

Let us remind one that in a <strong>video</strong> <strong>coding</strong> context the aim of <strong>motion</strong><br />

<strong>estimation</strong> is to detect the <strong>motion</strong> eld M (the optical ow) that is<br />

present between two successive images at time t , 1, I(x; y; t , 1), <strong>and</strong><br />

t, I(x; y; t), where (x; y) de nes the pixel position in the image. Motion<br />

<strong>compensation</strong> then applies this <strong>motion</strong> eld M to the reference image<br />

I(x; y; t , 1) in order to obtain a prediction ^I(x; y; t) ofI(x; y; t).<br />

2.3.1 Additional Constraints<br />

By venturing hypotheses about the nature of the scene objects, additional<br />

constraints can be <strong>for</strong>mulated. These ones establish an error<br />

criterion that is exploited by a minimization process. There are two<br />

main types of constraints: the preservation constraint which assumes<br />

that the objects properties in terms of re exion <strong>and</strong> luminance are kept<br />

constant, <strong>and</strong> the coherence constraint which is bound to the notion of<br />

object cohesion.<br />

2.3.1.1 Preservation Constraint<br />

This constraint considers that, if a luminous point of the scene is visible<br />

at time t , 1, it is also visible at time t. Moreover, it assumes that<br />

the luminance of a pixel is invariant with respect to the <strong>motion</strong>, i.e.<br />

that any temporal modi cation of the luminance distribution over the<br />

pixels is directly attributable to the pixels <strong>motion</strong>. Such anhypothesis is<br />

correct when the scene illumination is constant <strong>and</strong> uni<strong>for</strong>m, <strong>and</strong> when<br />

the objects re ectance is Lambertian [133]. The preservation constraint<br />

has two classical <strong>for</strong>mulations.<br />

DFD-based <strong>for</strong>mulation. The Displaced Frame Di erence (DFD)<br />

expresses the di erence between the luminance of the image at time t<br />

<strong>and</strong> the luminance of the image at time t + dt having undergone some<br />

displacement (dx; dy) 2 :<br />

DFD(x; y; dx; dy; t)=I(x + dx; y + dy; t + dt) , I(x; y; t): (2.10)<br />

2 dx dy<br />

( dt ; dt )=(u; v) describes the optical ow at position (x; y).


50 Chapter 2. Motion in the Framework of Video Coding<br />

The preservation constraint consists in assuming that a <strong>motion</strong> vector<br />

(dx; dy) that nulli es the DFD exists. If such avector does not exist, the<br />

aim of the <strong>motion</strong> <strong>estimation</strong> is to determine the vector that minimizes<br />

the DFD. The methods that use such a minimization process are called<br />

correlation-based. So as to compute the value of the DFD over a<br />

precise region, the criteria that are most frequently used are:<br />

the absolute value of the DFD over all region pixels<br />

X<br />

region(i;j)<br />

jDFD(i; j; u; v)j; (2.11)<br />

the squared value of the DFD over all region pixels<br />

X<br />

region(i;j)<br />

(DFD(i; j; u; v)) 2 ; (2.12)<br />

both these criteria can be divided by the total number of pixels<br />

taken into account, <strong>and</strong> are then respectively called the Mean Absolute<br />

Error (MAE) <strong>and</strong> the Mean Square Error (MSE).<br />

The rst criterion is often used as it requires less computation than the<br />

others.<br />

Di erential <strong>for</strong>mulation. If one considers the image function I continuous<br />

<strong>and</strong> possessing a derivative, a Taylor expansion (limited to the<br />

rst order) provides:<br />

I(x + dx; y + dy; t + dt)<br />

= I(x; y; t)+Ix(x; y; t):dx + Iy(x; y; t):dy + It(x; y; t):dt;<br />

(2.13)<br />

where Ii indicates the partial derivative of I with respect to i. The<br />

combination of equations (2.10) <strong>and</strong> (2.13) leads to the optical ow<br />

equation:<br />

Ix(x; y; t):u + Iy(x; y; t):v + It(x; y; t)=0: (2.14)<br />

The optical ow equation only al<strong>low</strong>s one to compute the component<br />

of <strong>motion</strong> in the direction of the spatial gradient (cf. the aperture problem,<br />

Section 2.1.2), <strong>and</strong> requires additional hypotheses to suppress all<br />

uncertainties.


2.3 Practical Approaches to Motion Estimation 51<br />

2.3.1.2 Coherence Constraint<br />

This constraint assumes the cohesion of all the elements of a unique object.<br />

It is valid if the <strong>motion</strong> variation between the neighboring elements<br />

of an area is limited. It can be implicitly expressed in two di erent ways<br />

thanks to the neighborhood in<strong>for</strong>mation:<br />

either with a region-based approach (all pixels of the region obey<br />

the same <strong>motion</strong> parameters);<br />

either by the adoption of iterative or recursive solving methods<br />

that propagate the estimate of the neighbor pixels.<br />

The coherence constraint can also be explicitly expressed when restrictions<br />

are <strong>for</strong>mulated about the <strong>motion</strong> nature (a priori in<strong>for</strong>mation), or<br />

when regularization is ensured by smoothing criteria.<br />

Chicken <strong>and</strong> Egg Problem. A remark should be made concerning<br />

the implicit <strong>for</strong>mulation in terms of regions: on the one h<strong>and</strong>, segmentation<br />

is required in order to determine the various regions on which<br />

the coherence constraint should be applied. On the other h<strong>and</strong>, this<br />

segmentation should take the <strong>motion</strong> in<strong>for</strong>mation into account soasto<br />

respect <strong>motion</strong> transitions. Achicken <strong>and</strong> egg problem arises as <strong>motion</strong><br />

<strong>estimation</strong> requires segmentation, which requires <strong>motion</strong> <strong>estimation</strong>.<br />

Consequently, emerging techniques try to jointly solve the two<br />

problems.<br />

2.3.2 Estimation Methods<br />

Estimating the <strong>motion</strong> between successive pictures generally consists in<br />

minimizing a function that expresses some of the constraints presented<br />

above. Minimization can be achieved in several ways. One can distinguish<br />

three main families of methods:<br />

Di erential methods which are based on gradient measures. Direct<br />

di erential methods aim at nullifying the gradient of the function<br />

to be minimized, while indirect di erential methods converge<br />

towards a solution according to the gradient direction. Iterative<br />

(Section 2.4.2) <strong>and</strong> pel-recursive (Section 2.4.3) <strong>motion</strong> <strong>estimation</strong><br />

algorithms are part of this class of methods.<br />

Matching methods are based on an explicit search <strong>for</strong> the best<br />

matching between two structures (one at time t <strong>and</strong> another at


52 Chapter 2. Motion in the Framework of Video Coding<br />

time t , 1). Any kind of primitive can be a structure: pixels,<br />

blocks of pixels, regions, segments,... The search <strong>for</strong> the best<br />

matching generally involves trying all the solutions of the search<br />

space. Block-Matching (Section 2.5.1) <strong>and</strong> Hexagonal-Matching<br />

(Section 2.6.1) belong to this class.<br />

Stochastic methods use r<strong>and</strong>om choices to drive the exploration<br />

of the parameters space. They include Bayesian <strong>estimation</strong>, Markov<br />

models (Section 2.4.4) <strong>and</strong> genetic algorithms.<br />

These di erent approaches to the <strong>motion</strong> <strong>estimation</strong> problem can be<br />

variously implemented: although only one methodology is generally<br />

adopted, it may beofinterest to use a hierarchy of models [92]. On<br />

another way, Section 2.3.2.1 introduces two special types of implementations<br />

that have been successfully applied to several methods: the multigrid<br />

<strong>and</strong> multiscale optimization methods.<br />

Be<strong>for</strong>e reviewing some techniques, Section 2.3.2.2 brie y tackles the<br />

choice of the sense of <strong>estimation</strong>.<br />

2.3.2.1 Multigrid <strong>and</strong> Multiscale Optimization Methods<br />

In order to avoid convergence to local minima <strong>and</strong> to speed up the<br />

convergence, the <strong>motion</strong> <strong>estimation</strong> methods described previously are<br />

sometimes coupled with multigrid or multiscale optimization techniques<br />

( gure 2.7).<br />

Multigrid algorithms operate on a hierarchy of resolution levels (image<br />

pyramid) that are build with <strong>low</strong>-pass ltering <strong>and</strong> sub-sampling by a<br />

factor 2 in each direction. Multiscale methods are based on the original<br />

resolution of image data but, like multigrid schemes, they produce a<br />

pyramid representation of the <strong>motion</strong> data.<br />

To develop a multigrid or multiscale algorithm, several components must<br />

be speci ed:<br />

the number of levels;<br />

a restriction operation that maps a solution at a ne level to a<br />

coarser level;<br />

a prolongation operation that maps from the coarse to the ne<br />

level;


2.3 Practical Approaches to Motion Estimation 53<br />

Multigrid<br />

Multiscale<br />

fine<br />

fine<br />

coarse<br />

medium<br />

<strong>motion</strong> pyramid image pyramid<br />

coarse<br />

medium<br />

<strong>motion</strong> pyramid<br />

original image<br />

Figure 2.7: Multigrid <strong>and</strong> multiscale optimization<br />

a coordination scheme that speci es the number of iterations


54 Chapter 2. Motion in the Framework of Video Coding<br />

at each level of the pyramid <strong>and</strong> the sequence of prolongation <strong>and</strong><br />

restrictions.<br />

The coordination scheme which is most frequently used is a simple<br />

coarse-to- ne algorithm, where the prolongated coarse solution is used<br />

as a starting point <strong>for</strong> the next ner level. In this case, simpler repetition<br />

<strong>and</strong> bilinear interpolation are the most commonly used prolongation<br />

methods. More sophisticated schemes implement a ne-to-coarse-tone<br />

[32] sequence in order to further re ne the solution.<br />

When multigrid methods are applied to <strong>motion</strong> <strong>estimation</strong>, <strong>low</strong> spatial<br />

frequencies are used to measure large displacements with a <strong>low</strong> accuracy.<br />

Higher frequencies in<strong>for</strong>mation is then used to improve the accuracy of<br />

the <strong>estimation</strong> by incrementally estimating small displacements. Besides<br />

achieving a computationally e cient <strong>estimation</strong>, this also reduces aliasing<br />

introduced by high spatial frequency components undergoing large<br />

<strong>motion</strong>s.<br />

Multigrid optimization has been successfully applied to iterative orpelrecursive<br />

approaches as large displacements are generally not reachable<br />

with these methods. Multigrid has also been applied to the BMA, which<br />

gives a hierarchical search BMA. But of course the e ectiveness of multigrid<br />

methods depends on the image content. If the image mainly contains<br />

high spatial frequencies, then, after <strong>low</strong>-pass ltering, there maybe<br />

insu cient in<strong>for</strong>mation to al<strong>low</strong> a reliable <strong>estimation</strong>. To overcome this<br />

limitation, multiscale methods can be used: the ABMA [109], presented<br />

in Section 3.1) illustrates this principle.<br />

2.3.2.2 Forward versus Backward Estimation<br />

To estimate the <strong>motion</strong> between two images at time t,1 <strong>and</strong> t, there are<br />

two possibilities: <strong>for</strong>ward or backward <strong>motion</strong> <strong>estimation</strong> ( gure 2.8).<br />

The search in image t <strong>for</strong> a displaced object of image t , 1 is a <strong>for</strong>ward<br />

search, while the backward search consists in looking <strong>for</strong> an object of<br />

the present image t in the previous one.<br />

The backward sense is generally used with region matching techniques so<br />

that the displaced regions cover all the image surface. On the contrary,<br />

the <strong>for</strong>ward sense al<strong>low</strong>s both the coder <strong>and</strong> the decoder to exploit their<br />

memory so as to automatically create a partition on the image at time<br />

t,1. Moreover some interpolation scheme <strong>and</strong> <strong>motion</strong> analysis tools use<br />

both senses of <strong>estimation</strong> so as to overcome prediction errors resulting<br />

from occlusions [33].


2.4 Background Techniques <strong>for</strong> Motion Estimation 55<br />

One must point out that in the H.263 [96] <strong>and</strong> MPEG [62] context,<br />

backward <strong>and</strong> <strong>for</strong>ward are used in another meaning: when bidirectional<br />

<strong>coding</strong> is applied, a macroblock of image t can be obtained with a <strong>for</strong>ward<br />

prediction (from image t,1) or with a backward one (from image t+1).<br />

Forward<br />

Backward<br />

time t-1 time t<br />

1 2 3 4 1 2 3 4<br />

5 6 7 8<br />

5 6 7 8<br />

9 10 11 12<br />

9<br />

10<br />

11 12<br />

1 2 3 4<br />

5 6 7 8<br />

9<br />

10<br />

11 12<br />

1 2 3 4<br />

5 6 7 8<br />

9 10 11 12<br />

Figure 2.8: \Forward" versus \backward" <strong>motion</strong> <strong>estimation</strong><br />

2.4 Background Techniques <strong>for</strong> Motion Estimation<br />

2.4.1 Linear Regression<br />

Linear regression uses both the preservation constraint of equation (2.14)<br />

<strong>and</strong> a translational model of displacement <strong>for</strong> determined regions. The<br />

resolution over all the region pixels is achieved by a least square method,<br />

<strong>and</strong> provides the fol<strong>low</strong>ing solution:<br />

(^u; ^v) =arg min<br />

X<br />

(u;v)<br />

(i;j)2R<br />

(Ix(x; y; t)u + Iy(x; y; t)v + It) 2 (2.15)


56 Chapter 2. Motion in the Framework of Video Coding<br />

where Ii designs the partial derivative ofI with respect to i. Usually,<br />

the di erence between the two frames serves as temporal gradient <strong>and</strong><br />

spatial gradients are digitally computed on the previous image. Small<br />

displacements can be measured with this method.<br />

2.4.2 Iterative Motion Estimation<br />

A gradient-based <strong>motion</strong> <strong>estimation</strong> was proposed in [49] so as to determine<br />

the optical ow. It was one of the <strong>very</strong> rst method established to<br />

solve equation (2.14). In order to correctly constrain the problem, Horn<br />

<strong>and</strong> Schunck haveadded an a priori smoothness condition on the resulting<br />

optical ow: the value of the gradient module had to be as small as<br />

possible. The problem then moves to a cost function minimization, with<br />

the function expressed as:<br />

ZZ<br />

((Ixu + Iyv + It) 2 + (u 2<br />

x + u 2<br />

y + v 2<br />

x + v 2<br />

y) 2 )dx:dy (2.16)<br />

where ux;uy;vx <strong>and</strong> vy are the partial rst derivatives of the two optical<br />

ow (u; v) components, <strong>and</strong> is a (Lagrange) constant that balances<br />

the importance of the error in the <strong>motion</strong> equation <strong>and</strong> the penalty of<br />

departure from smoothness. A solution provided to this minimization<br />

problem is:<br />

^u = um , Ix P<br />

D<br />

^v = vm , Iy P<br />

D<br />

where um <strong>and</strong> vm are local average of u; v, <strong>and</strong><br />

(2.17)<br />

P = Ixum + Iyvm + It; D= + I 2<br />

x + I 2<br />

y: (2.18)<br />

Final determination of the optical ow can then be based on an iterative<br />

Gauss-Seidel method, re ning (^ui; ^vi) using (^ui,1; ^vi,1) (i is the iteration<br />

number) until a certain convergence criterion is reached.<br />

As expected, the resulting <strong>motion</strong> eld contains only smooth variations<br />

along space, which is not always correct with regards to divergent object<br />

<strong>motion</strong>s, as illustrated on gure 2.9.<br />

Moreover, as such techniques provide a dense <strong>motion</strong> eld (one <strong>motion</strong><br />

vector <strong>for</strong> e<strong>very</strong> pixel), the <strong>coding</strong> cost is <strong>very</strong> high. This is the major<br />

reason why it is practically never used <strong>for</strong> <strong>coding</strong> but <strong>for</strong> analysis instead.<br />

Another way of using this type of <strong>motion</strong> <strong>estimation</strong> which avoids the<br />

extensive transmission cost consists in per<strong>for</strong>ming the <strong>estimation</strong> at the


2.4 Background Techniques <strong>for</strong> Motion Estimation 57<br />

(a) (b)<br />

(c)<br />

Figure 2.9: Limit of iterative <strong>motion</strong> determination: (a) Sphere at time<br />

t (b) Applied <strong>motion</strong> eld on I(t , 1) (c) Detected optical ow


58 Chapter 2. Motion in the Framework of Video Coding<br />

Compensation<br />

Estimation<br />

j’’<br />

j’’ j’<br />

time t-2<br />

i’’<br />

j’’ j’ j<br />

time t-1<br />

Figure 2.10: Estimation per<strong>for</strong>med at the decoder<br />

i’<br />

i’’<br />

time t<br />

decoder side between the decoded frames at time t , 2 <strong>and</strong> time t , 1.<br />

The computed <strong>motion</strong> eld can then be applied to image t , 1 in order<br />

to predict frame t ( gure 2.10). It avoids transmitting the dense <strong>motion</strong><br />

eld but <strong>for</strong>ces a lot of computation to be achieved at the decoder side.<br />

However, many problems of occlusion <strong>and</strong> uncovered background can<br />

occur.<br />

d<br />

v<br />

u<br />

i<br />

i’<br />

i’’


2.4 Background Techniques <strong>for</strong> Motion Estimation 59<br />

2.4.3 Pel-Recursive Algorithms<br />

Pel recursive [91] <strong>motion</strong> <strong>estimation</strong> algorithms estimate 2D <strong>motion</strong> recursively<br />

on a pixel basis: given an initial <strong>estimation</strong> <strong>for</strong> e<strong>very</strong> point,<br />

di =(ui;vi), a correction is carried out according to the resulting DFD:<br />

di+1 = di + di; (2.19)<br />

with di =( ui; vi) the update term of iteration i. The iteration can<br />

be executed along a scan line or from line to line or from frame to frame;<br />

the technique is then respectively denoted pel-recursive <strong>estimation</strong> with<br />

horizontal, vertical or temporal recursion. The basic assumption of this<br />

technique is that the DFD converges locally to zero when the estimated<br />

<strong>motion</strong> converges to the actual movement of the object point. The aim<br />

is thus to recursively minimize the squared value of the DFD using a<br />

steepest-descent (gradient) method:<br />

di+1 = di , "DF D(x; y; u; v)rdi(DFD(x; y; u; v)) (2.20)<br />

where rdi is the gradient operator with respect to di <strong>and</strong> " a positive<br />

constant. Noting that<br />

rdi(DFD(x; y; u; v)) = rI(x , u; y , v; t , ) (2.21)<br />

where rI is the spatial image gradient, we obtain:<br />

di+1 = di , "DF D(x; y; u; v)rI(x , u; y , v; t , ): (2.22)<br />

" is a regulating parameter that achieves quick but sometimes oscillating<br />

convergence if it is high, or s<strong>low</strong> but accurate estimate if it is small. More<br />

advanced techniques use a variable " to improve both the convergence<br />

speed <strong>and</strong> the solution accuracy.<br />

Evaluation of pel-recursive methods is <strong>very</strong> similar to the evaluation of<br />

iterative methods. In fact, the pel-recursive methodology has been applied<br />

to interframe <strong>coding</strong> using the scheme of gure 2.10, which implies<br />

a lot of computation. In addition to the problem of properly computing<br />

the gradient, pel-recursive methods also estimate a <strong>motion</strong> eld generally<br />

too smooth (like iterative methods). This is why an original approach<br />

to the problem of discontinuities has been proposed by Gaidon [37]. Although<br />

it implies <strong>very</strong> important computation, it is worth presenting<br />

because of the novelty itintroduces.


60 Chapter 2. Motion in the Framework of Video Coding<br />

pixel site<br />

line site<br />

Figure 2.11: The dual lattice <strong>for</strong> eld segmentation<br />

2.4.4 Stochastic Estimation Relying on Markov R<strong>and</strong>om<br />

Field<br />

In order to manage discontinuities, <strong>motion</strong> elds can be modeled as<br />

Markov R<strong>and</strong>om Fields [40] (MRF, summary in appendix C) within a<br />

Bayesian framework. There<strong>for</strong>e, a lattice dual to the one designed by<br />

the image sites is set up. This dual lattice has its sites located between<br />

the pixels, <strong>and</strong> represents the possible discontinuities (edges) of the eld<br />

(Figure 2.11).<br />

The sampling grid of this dual lattice has a quincunx structure. Its<br />

neighborhood can be de ned according to equation (C.8), with kn =<br />

1=2; 1; 2; 5=2; 4;::: <strong>for</strong> n = 1; 2; 3; 4; 5; ::: (the distance computed from<br />

the line center). Figure 2.12 shows the con gurations of the dual lattice<br />

neighborhood <strong>for</strong> n = 1 <strong>and</strong> 2.<br />

Based on this lattice, Geman <strong>and</strong> Geman [99] introduced a line or discontinuity<br />

process, which al<strong>low</strong>s neighboring sites to have di erent<br />

interpretations. The only cost relies in the introduction of the line process<br />

which is modeled as a MRF. The elements of this MRF are either<br />

active (\on") or inactive (\o "): an active line means that a discontinuity<br />

occurs in the <strong>motion</strong> eld at the line location.<br />

Without such a line process, the energy function to be minimized is<br />

usually something like:<br />

E(u; v)= (1 , ):Ed(u; v)+ :Ep(u; v)<br />

= (1 , ): P x;y DFD2 (x; y; u; v)<br />

+ : P x;y<br />

1<br />

4 (u2<br />

x + u 2<br />

y + v 2<br />

x + v 2<br />

y)<br />

(2.23)<br />

that includes both an energy function to ensure the con<strong>for</strong>mity tothe<br />

data, <strong>and</strong> a stabilizing (regulating) function to smooth-constraint the<br />

solution. is a (Lagrange) parameter.<br />

Instead, in order to include the discontinuities, the function is made<br />

dependent on both the displacement (u; v) <strong>and</strong> the line process l (particularized<br />

here to b; b 0 ;c;c 0 , cf. gure 2.13):


2.4 Background Techniques <strong>for</strong> Motion Estimation 61<br />

vertical edge<br />

pixel site<br />

central line site<br />

neighbor line site<br />

horizontal edge vertical edge horizontal edge<br />

(a) first-order neighborhood (b) second-order neighborhood<br />

u i,j-1<br />

Figure 2.12: Dual lattice neighborhood<br />

u i-1,j<br />

b<br />

i,j<br />

u<br />

i,j<br />

c<br />

i,j<br />

v<br />

i,j-1<br />

c’<br />

i,j<br />

v<br />

i-1,j<br />

b’<br />

i,j<br />

v<br />

i,j<br />

Figure 2.13: Particular con guration of the line process


62 Chapter 2. Motion in the Framework of Video Coding<br />

(a) (b)<br />

Figure 2.14: Comparison of <strong>motion</strong> eld <strong>estimation</strong>: (a) With smoothness<br />

constraint (b) With Markov R<strong>and</strong>om Field model<br />

E(u; vjl)= (1 , ): P x;y DFD2 (x; y; u; v)+<br />

2 ( P i;j u 2<br />

x(i; j)(1 , bi;j)+ P i;j v2 x(i; j)(1 , b0 P i;j)+<br />

i;j u2 y (i; j)(1 , ci;j)+ P i;j v2<br />

y (i; j)(1 , c0i;j )))+<br />

: P i;j(bi;j + b 0 i;j + ci;j + c 0 i;j)<br />

(2.24)<br />

with the cost of introducing one discontinuity in the <strong>motion</strong> eld:<br />

either no discontinuity isintroduced (b; b 0 ;cor c 0 = 0) <strong>and</strong> the algorithm<br />

per<strong>for</strong>ms as be<strong>for</strong>e, either a discontinuity is used (b; b 0 ;cor c 0 = 1) which<br />

eliminates the smoothness constraint but has an additional cost .<br />

Although the Markov model induces more computation, the result is<br />

much closer to reality as gure 2.14 demonstrates.<br />

2.4.5 Parametric Models of the Motion Field<br />

All the techniques described up to now may be classi ed as non-parametric<br />

methods. It this section, parametric methods that explicitly<br />

describe the <strong>motion</strong> of individual pixels within a region with a small<br />

number of parameters are introduced. The problem of <strong>motion</strong> <strong>estimation</strong><br />

is then equivalent to a problem of parameters <strong>estimation</strong>. Since all the<br />

pixels within a region can contribute to this <strong>estimation</strong>, highly reliable


2.4 Background Techniques <strong>for</strong> Motion Estimation 63<br />

results may be obtained. The parametric models that are mostly used<br />

implicitly assume that objects are rigid planar surfaces undergoing 3D<br />

<strong>motion</strong> [123, 125, 124].<br />

From Two toTwelve <strong>and</strong> More Parameters. Parametric models<br />

are often characterized by their number of parameters. Starting from<br />

the simplest model, the translation hypothesis, the position (x 0 ;y 0 )of<br />

a pixel with respect to the position (x; y) of the pixel in the reference<br />

image is:<br />

x 0<br />

y 0<br />

!<br />

=<br />

x<br />

y<br />

!<br />

+<br />

tx<br />

ty<br />

!<br />

: (2.25)<br />

Using the classi cation of Jozawa [53], this model may be built up step<br />

by step. The integration of a unique horizontal <strong>and</strong> vertical scaling<br />

factor C gives a three-parameter model<br />

x 0<br />

y 0<br />

!<br />

= C<br />

x<br />

y<br />

!<br />

+<br />

tx<br />

ty<br />

!<br />

: (2.26)<br />

One additional parameter can be included, or to separately specify the<br />

scaling factors with<br />

x 0<br />

y 0<br />

!<br />

=<br />

Cx 0<br />

0 Cy<br />

! x<br />

y<br />

either to take into account a rotation with<br />

x 0<br />

y 0<br />

!<br />

= C<br />

cos sin<br />

, sin cos<br />

! x<br />

y<br />

!<br />

!<br />

+<br />

+<br />

tx<br />

ty<br />

tx<br />

ty<br />

!<br />

!<br />

; (2.27)<br />

: (2.28)<br />

Combining both improvements, one obtains a ve-parameter model<br />

x 0<br />

y 0<br />

!<br />

=<br />

cos sin<br />

, sin cos<br />

!<br />

:<br />

Cx 0<br />

0 Cy<br />

! x<br />

y<br />

!<br />

+<br />

tx<br />

ty<br />

!<br />

: (2.29)<br />

Finally, a distinction between the x <strong>and</strong> y axis rotations leads to the<br />

six-parameter a ne trans<strong>for</strong>m:


64 Chapter 2. Motion in the Framework of Video Coding<br />

x 0<br />

y 0<br />

!<br />

=<br />

=<br />

Cx cos x<br />

Cx sin x<br />

a1 a 2<br />

a 3<br />

a 4<br />

! x<br />

y<br />

,Cy sin y<br />

!<br />

Cy cos y<br />

+<br />

!<br />

a5<br />

a 6<br />

:<br />

!<br />

:<br />

x<br />

y<br />

!<br />

+<br />

tx<br />

ty<br />

!<br />

(2.30)<br />

The a ne trans<strong>for</strong>m results from the orthographic projection of the<br />

<strong>motion</strong> of a planar surface. Under perspective projection, an eightparameter<br />

perspective trans<strong>for</strong>m is built:<br />

x 0 = a 1 + a 2x + a 3y<br />

1+a 7x + a 8y ;<br />

y 0 = a 4 + a 5x + a 6y<br />

1+a 7x + a 8y :<br />

Another commonly used trans<strong>for</strong>m is the bilinear trans<strong>for</strong>m:<br />

x 0 = a 1x + a 2y + a 3xy + a 4<br />

y 0 = a 5x + a 6y + a 7xy + a 8<br />

(2.31)<br />

: (2.32)<br />

Higher level models also take into account acceleration e ects. Sanson<br />

[117] <strong>for</strong> instance proposes a twelve-parameter model:<br />

x 0<br />

y 0<br />

!<br />

=<br />

x ax ax y<br />

ay x<br />

ay y<br />

! x<br />

y<br />

!<br />

+<br />

x b 2<br />

x<br />

bx2 y<br />

bxy x<br />

bxy y<br />

by2 x<br />

by2 y<br />

! 0<br />

B<br />

@ x2<br />

xy<br />

y2 1<br />

C<br />

A +<br />

(2.33)<br />

Because of the presence of numerous moving <strong>and</strong> possibly overlapping<br />

objects in the scene, the above parametric models do not hold, in general,<br />

throughout the whole image plane. A solution to this problem<br />

is provided assuming that e<strong>very</strong> object is characterized by its <strong>motion</strong><br />

parameters. It leads to the \chicken-<strong>and</strong>-egg" combined segmentation<br />

& <strong>estimation</strong> problem. Away toovercome it is to use warping techniques<br />

(cf. Section 2.6) that have successfully implemented a matching<br />

methodology.<br />

2.4.6 Within a Trans<strong>for</strong>m Domain<br />

Estimation methods based on spatio-temporal lters over several pictures<br />

have recently been implemented. They are based on the property<br />

tx<br />

ty<br />

!<br />

:


2.5 The Block-Matching Algorithm (BMA) 65<br />

of the Fourier trans<strong>for</strong>m to concentrate the energy within a few coefcients<br />

of the trans<strong>for</strong>med frequential domain. The <strong>motion</strong> <strong>estimation</strong><br />

can then be achieved by analyzing the importance <strong>and</strong> the variation of<br />

the temporal frequency !t. However, the main limitation of the use of<br />

such techniques <strong>for</strong> <strong>coding</strong> purposes is that they require more than two<br />

successive pictures so to provide a reliable analysis.<br />

2.5 The Block-Matching Algorithm (BMA)<br />

Since its introduction by Jain <strong>and</strong> Jain in 1981, the Block Matching<br />

Algorithm (BMA, [50]) has emerged as the ME technique achieving the<br />

best compromise between complexity <strong>and</strong> quality: a fast <strong>estimation</strong> procedure<br />

al<strong>low</strong>s obtaining a block-based <strong>motion</strong> eld that is transmitted<br />

at <strong>low</strong>-cost. An appropriate choice of the block size o ers one a compromise<br />

between adaptation to small moving objects (per<strong>for</strong>med by small<br />

blocks) <strong>and</strong> robustness against noise (per<strong>for</strong>med by large blocks). These<br />

properties have granted the BMA to be included in most <strong>video</strong> st<strong>and</strong>ards<br />

like H.263 [96], MPEG-1,2 [62] <strong>and</strong> MPEG-4 [104].<br />

2.5.1 BMA Principle<br />

The principle of the BMA is to apply a translational <strong>motion</strong> model to<br />

subblocks of the image. For e<strong>very</strong> block, the matching measure is based<br />

on the Displaced Frame Di erence (DFD). Fuh <strong>and</strong> Maragos [36] thereafter<br />

consider the BMA <strong>and</strong> its two sole free parameters (the translation<br />

vector) as a particular case of more elaborate models.<br />

Its implementation in most st<strong>and</strong>ards fol<strong>low</strong>s the backward procedure<br />

(cf. Section 2.3.2.2):<br />

1. the image I(t) is divided into a set of subblocks of size (K; L);<br />

2. <strong>for</strong> e<strong>very</strong> subblock, the origin of the block in the reference image<br />

I(t , 1) is searched <strong>for</strong> within a search area kuk u, kvk<br />

v (cf. gure 2.15) according to one of the criteria presented in<br />

Section 2.3.1.1.<br />

3. <strong>compensation</strong> is achieved by reconstructing ^I(t) with the blocks of<br />

I(t , 1) designed by the <strong>motion</strong> vectors.


66 Chapter 2. Motion in the Framework of Video Coding<br />

K+2 Δ<br />

u<br />

u<br />

v<br />

L+2 Δ<br />

v<br />

Δ<br />

u<br />

Δ<br />

v<br />

Search window<br />

in the previous frame<br />

(KxL) block<br />

in the current frame<br />

(KxL) block under search<br />

in the previous frame<br />

Figure 2.15: The Block Matching Algorithm search space<br />

2.5.1.1 Search Techniques<br />

The location of the best match within the search area thanks to an<br />

error criterion (e.g. the MAE) requires intensive computation when<br />

it is per<strong>for</strong>med using a full search (FS) procedure. There<strong>for</strong>e, several<br />

complexity-reduced algorithms have been proposed <strong>for</strong> a faster search<br />

<strong>for</strong> the minimum DFD. These algorithms (listed be<strong>low</strong>) are of particular<br />

interest <strong>for</strong> software implementations. These are (references are given<br />

in [108]):<br />

the three-step algorithm (TSA);<br />

the 2D logarithmic search (2D LM);<br />

the modi ed <strong>motion</strong> <strong>estimation</strong> algorithm (MMEA);<br />

the conjugate direction search (CDS);<br />

:::<br />

Table 2.1 compares the computational complexity required by all these<br />

techniques, <strong>for</strong> a maximum displacement u = v = M (with pel accuracy,<br />

see next section be<strong>low</strong>).


2.5 The Block-Matching Algorithm (BMA) 67<br />

Maximum number M<br />

Algorithm of search points 4 8 16<br />

FS (2M +1) 2 81 289 1089<br />

TSA 1 + 8 log 2 M 17 25 33<br />

2D LM 2 + 7 log 2 M 16 23 30<br />

MMEA 1 + 6 log 2 M 13 19 25<br />

CDS 3+2M 11 19 35<br />

Table 2.1: Fast search algorithms <strong>for</strong> the BMA: number of positions to<br />

test<br />

2.5.1.2 Advanced Possibilities<br />

Computation of dense <strong>motion</strong> eld is possible thanks to the BMA.<br />

In this case, the minimization of the error function is computed <strong>for</strong> each<br />

image pixel. In order to reduce the probability of false matches caused by<br />

noise, the matching criterion (e.g. the MSE) relies on a square window<br />

around the pixel.<br />

Subpel Accuracy is possible when the BMA is computed backwards.<br />

It means that blocks of image t are searched <strong>for</strong> in image t , 1 with<br />

a step of half a pel (which requires interpolation functions of course)<br />

or with <strong>low</strong>er fraction of a pel. Such half-pel accuracy is often used so<br />

as to compensate <strong>for</strong> the interpolation e ects engendered by the small<br />

displacements of the camera.<br />

2.5.1.3 Result<br />

Figure 2.16 presents some results of the Block Matching Algorithm (full<br />

search, half-pel precision). The smaller K <strong>and</strong> L are, the better the<br />

blocks can match the objects but more computations are needed <strong>and</strong><br />

the algorithm becomes more sensitive to noise.<br />

Despite its simplicity <strong>and</strong> wide use, the BMA presents one major limitation<br />

because of its assumption that blocks <strong>motion</strong> is purely translational.<br />

This assumption is indeed not true in many instances as di erent moving<br />

regions that undergo di erent movements within a same block actually<br />

exist. Moreover, when adjacent blocks use di erent vectors, they <strong>for</strong>m<br />

straight-line discontinuities in the prediction image, known as blocking


68 Chapter 2. Motion in the Framework of Video Coding<br />

Figure 2.16: BMA results: (top) original images, (center) <strong>motion</strong> eld<br />

<strong>and</strong> result with 8 8 blocks, (bottom) ibid. with 16 16 blocks


2.6 Image Warping Techniques 69<br />

artifacts, artifacts to which the Human Visual System (HVS) is <strong>very</strong><br />

sensitive. The bigger the blocks are, the more blocking artifacts are visible.<br />

Two classical solutions exist to solve these problems: overlapped<br />

BMA, presented in Section 2.5.2, <strong>and</strong> multigrid or multiscale BMA, an<br />

example of which isgiven in Section 3.1.<br />

In order to synthesize the evaluation of the BMA technique, two last<br />

points have to be considered: <strong>motion</strong> eld <strong>coding</strong> <strong>and</strong> computational<br />

burden of the <strong>compensation</strong>. The <strong>for</strong>mer is subject to various discussions<br />

but one could state that, globally, it is possible to e ciently encode<br />

the resulting <strong>motion</strong> eld. The latter is easier: the <strong>compensation</strong>, that<br />

merely consists in applying the selected <strong>motion</strong> vector to e<strong>very</strong> subblock,<br />

is <strong>very</strong> fast. However, annoying blocking artifacts appear.<br />

2.5.2 Overlapped BMA<br />

In order to reduce such artifacts, overlapped block <strong>motion</strong> <strong>compensation</strong><br />

has been proposed [116, 4] <strong>and</strong> even included as an option in the<br />

ITU H.263 st<strong>and</strong>ard. The underlying idea is to reconstruct the image<br />

using overlapping neighbor blocks. The problem of determining optimal<br />

windows with adequate weights may be <strong>for</strong>mulated as an optimal linear<br />

<strong>estimation</strong> problem of pixel intensities. The nal value of a pixel in the<br />

reconstructed image is thusaweighted sum of the intensities coming<br />

from the di erent neighbor vectors. Figure 2.17 illustrates a particular<br />

neighborhood con guration that results in the fol<strong>low</strong>ing weighted<br />

equation:<br />

I(x; y; t)=wA:I(x + ,! Ax;y+ ,! Ay;t, 1) + wB:I(x + ,! Bx;y+ ,! By;t, 1)<br />

+wC:I(x + ,! Cx;y+ ,! Cy;t, 1) + wD:I(x + ,! Dx;y+ ,! Dy;t, 1)<br />

(2.34)<br />

where wi are some adequate weights <strong>and</strong> ,! V the <strong>motion</strong> vectors.<br />

2.6 Image Warping Techniques<br />

Image warping techniques are correlation-based parametric methods.<br />

They have been initially developed in order to cope with more displacement<br />

con gurations than the only translation of the block matching (cf.<br />

Section 2.5.1). They consist of three main steps.


70 Chapter 2. Motion in the Framework of Video Coding<br />

A B<br />

C<br />

Figure 2.17: Overlapped block <strong>motion</strong> <strong>compensation</strong><br />

At rst, the input image is split into small patches. When the patch<br />

structure (or mesh, orwireframe) is predetermined, the approach is<br />

called xed mesh <strong>motion</strong> <strong>compensation</strong>. Otherwise, if the mesh is<br />

adaptively built, e.g. according to the image content, it is called adaptive<br />

mesh <strong>motion</strong> <strong>compensation</strong>. In this case, the patch structure is<br />

adaptively de<strong>for</strong>med to t the contours of the moving areas or objects.<br />

No additional in<strong>for</strong>mation about the patch structure is needed if the<br />

patch adaptation is applied to the decoded image of the previous frame.<br />

For such achoice, <strong>for</strong>ward <strong>estimation</strong> is of course advantageous. It is<br />

notable that warping techniques may indi erently be applied <strong>for</strong>wards<br />

or backwards.<br />

Subsequently, the <strong>motion</strong> vectors of the grid points or vertices are estimated.<br />

Figure 2.18(a) demonstrates the <strong>estimation</strong> <strong>for</strong> a quadrangular<br />

mesh [93].<br />

The last step implements the <strong>motion</strong> <strong>compensation</strong> ( gure 2.18(b)): the<br />

displacements of the vertices are sent as <strong>motion</strong> vectors <strong>and</strong> the vectors<br />

<strong>for</strong> the remaining pixels are obtained with a geometric trans<strong>for</strong>mation<br />

technique (image warping) <strong>and</strong> some interpolation (e.g. using bilinear<br />

or bicubic [82, 54]interpolation). The parameters of the trans<strong>for</strong>mation<br />

can of course be determined from the vertices <strong>motion</strong> vectors.<br />

While bilinear (equation (2.32)) <strong>and</strong> perspective (equation (2.31)) trans<strong>for</strong>ms<br />

require quadrilateral patches, triangular patches correspond to the<br />

a ne trans<strong>for</strong>m (equation (2.30)): the two <strong>motion</strong> components of eachof<br />

the three triangle tops determine the six parameters of the a ne trans<strong>for</strong>m<br />

(Figure 2.19). The two invariants p <strong>and</strong> q of the trans<strong>for</strong>mation<br />

x<br />

D


2.6 Image Warping Techniques 71<br />

Previous frame Current frame<br />

A<br />

C<br />

object<br />

B<br />

D<br />

A B<br />

C<br />

D<br />

geometric<br />

trans<strong>for</strong>mation<br />

C’<br />

C’<br />

(a) Estimation of the vertices <strong>motion</strong><br />

(b) Warping <strong>motion</strong> <strong>compensation</strong><br />

A’<br />

object<br />

A’ B’<br />

Figure 2.18: Principle of warping <strong>motion</strong> <strong>estimation</strong> <strong>and</strong> <strong>compensation</strong><br />

D’<br />

D’<br />

B’


72 Chapter 2. Motion in the Framework of Video Coding<br />

q.AC<br />

C<br />

A p.AB<br />

X<br />

B<br />

q.A’C’<br />

A’<br />

C’<br />

p.A’B’<br />

Figure 2.19: A ne trans<strong>for</strong>m between two triangular patches<br />

help warping any point of a triangle into its a ne-de<strong>for</strong>med version:<br />

X = A + p: ,!<br />

AB + q: ,!<br />

AC<br />

X0 = A0 + p: ,,!<br />

A0B0 + q: ,,!<br />

A0C X’<br />

B’<br />

0 (2.35)<br />

2.6.1 The Hexagonal Matching Algorithm (HMA)<br />

Using such triangles, Nakaya <strong>and</strong> Harashima have proposed to estimate<br />

<strong>motion</strong> by means of an Hexagonal Matching Algorithm (HMA, [88]).<br />

Once a regular mesh has been overlaid on the picture, the authors propose<br />

a two-step algorithm to determine the vector of e<strong>very</strong> mesh vertex:<br />

1. the displacement of the vertices is rst estimated with a coarse<br />

BMA;<br />

2. then an iterative local minimization of the prediction error renes<br />

this initial displacement. The Hexagonal Matching Algorithm<br />

(HMA) treats e<strong>very</strong> vertex x sequentially: it xes its neighbor vertices<br />

(a; b; c; d;e;f in gure 2.20) <strong>and</strong> searches <strong>for</strong> the new position<br />

x 0 that minimizes the reconstruction error of all the attached<br />

patches. The procedure iterates over all vertices whose position<br />

has been modi ed in the previous iteration. It ends when all vertices<br />

are not moved anymore or after a xed number of iterations.


2.6 Image Warping Techniques 73<br />

c<br />

a<br />

iteration i iteration i+1<br />

e<br />

x<br />

b<br />

f<br />

d<br />

Figure 2.20: Re nement procedure of the Hexagonal Matching Algorithm<br />

c<br />

Local convergence is ensured since the solution is always locally<br />

improved.<br />

An important feature of such mesh-based <strong>compensation</strong> schemes is that<br />

the connectivity of the mesh must be preserved: inconsistent <strong>motion</strong><br />

vectors like the one of gure 2.21, which results in overlapping mesh<br />

elements, has to be avoided.<br />

c<br />

a<br />

e<br />

b<br />

f<br />

a<br />

x’<br />

e<br />

x’<br />

b<br />

d overlapping<br />

area<br />

Figure 2.21: Illustration of inconsistent <strong>motion</strong> vector <strong>for</strong> the HMA<br />

The result of such a procedure is illustrated on gure 2.22 <strong>for</strong> the original<br />

images of gure 2.16(top). As expected, no blocking artifacts are present<br />

f<br />

d


74 Chapter 2. Motion in the Framework of Video Coding<br />

in the compensated picture since rotation <strong>and</strong> zoom e ects are taken<br />

into account by the a ne trans<strong>for</strong>m. However, two drawbacks have to<br />

be pointed out:<br />

the <strong>estimation</strong> requires a lot of computation as it starts with a<br />

classic BMA fol<strong>low</strong>ed by an iterative algorithm (that can be parallelized);<br />

the xed grid does not take anyinterest in the image content.<br />

2.6.2 Adaptive Hexagonal Matching Algorithm (AHMA)<br />

In order to overcome the second drawback, Dudon extended this work<br />

to active mesh that are automatically adapted to the spatial contents<br />

of the image [29, 30]. In parallel, Dudon also establishes other ways<br />

of estimating the <strong>motion</strong> of the vertices (mesh nodes) [28]. Vertices are<br />

placed on characteristic points that are detected according to the spatial<br />

activity. During the en<strong>coding</strong>, memory <strong>and</strong> temporal gradient can be<br />

used to concentrate vertices in crucial areas of the image. The mesh is<br />

then generated thanks to a Delaunay triangulation [17, 119]. The result<br />

of this adaptation is depicted on gure 2.23 with reference to gure 2.22,<br />

<strong>and</strong> the nal result is still better, or equivalent <strong>for</strong> a <strong>low</strong>er number of<br />

vertices (99 in gure 2.23 versus 120 on gure 2.22).<br />

If one wishes to evaluate both the HMA <strong>and</strong> the AHMA, it st<strong>and</strong>s out<br />

that the <strong>compensation</strong> is not that costly <strong>and</strong> achieves <strong>very</strong> pleasant<br />

visual results as no blocking artifacts are present on the compensated<br />

image. Yet, the <strong>estimation</strong> is <strong>very</strong> computation dem<strong>and</strong>ing <strong>and</strong> the <strong>motion</strong><br />

eld is often spurious <strong>and</strong> less reliable than the one of the BMA.<br />

Moreover, in the case of adaptive mesh, the neighborhood relation between<br />

the vectors is not obvious <strong>and</strong> the entropy <strong>coding</strong> of the <strong>motion</strong><br />

eld is less e cient.<br />

2.7 Conclusion<br />

The present chapter has introduced the problem of <strong>motion</strong> <strong>estimation</strong><br />

<strong>and</strong> <strong>compensation</strong> in the framework of (VLBR) <strong>video</strong> <strong>coding</strong> <strong>and</strong> has<br />

tried to clearly distinguish between both <strong>estimation</strong> <strong>and</strong> <strong>compensation</strong>.<br />

Once the ambiguities of the <strong>estimation</strong> problem have been exp<strong>and</strong>ed<br />

on, the advantages of using such techniques in <strong>video</strong> codecs has been<br />

demonstrated by the Rate-Distortion Theory.


2.7 Conclusion 75<br />

(a) (b)<br />

Figure 2.22: Example of hexagonal matching <strong>compensation</strong>: (a) Fixed<br />

mesh on reference image (b) Compensated image<br />

(a) (b)<br />

Figure 2.23: Adaptive mesh hexagonal matching: (a) Adaptive mesh<br />

(b) Compensated image<br />

The present chapter has tackled the classical additional constraints <strong>and</strong><br />

methodologies used to estimate <strong>motion</strong> in <strong>video</strong> sequences. A direct<br />

confrontation of the di erent approaches is quite complex as a method<br />

may be more pertinent than another in a particular context.


76 Chapter 2. Motion in the Framework of Video Coding<br />

In the context of the present thesis that deals with (VLBR) <strong>video</strong> <strong>coding</strong>,<br />

two peculiar techniques have been identi ed as the best per<strong>for</strong>ming<br />

ones. The Block Matching Algorithm, or derived techniques, provides<br />

<strong>very</strong> fast <strong>estimation</strong> <strong>and</strong> <strong>compensation</strong> stages but its nal visual results<br />

su er from so-called blocking artifacts. This BMA is nonetheless part of<br />

most <strong>video</strong> <strong>coding</strong> st<strong>and</strong>ards. The Hexagonal Matching Algorithm, <strong>and</strong><br />

derived techniques, o ers a better visual quality after <strong>compensation</strong> but<br />

at the cost of more computational e ort as it uses triangular meshes so<br />

as to warp images according to the <strong>motion</strong> in<strong>for</strong>mation.<br />

The fol<strong>low</strong>ing chapters will present some improvements we have tried<br />

to bring into the eld of <strong>motion</strong> <strong>estimation</strong> <strong>and</strong> <strong>compensation</strong>, having<br />

always in mind the (<strong>very</strong>) <strong>low</strong> <strong>bitrate</strong> <strong>video</strong> <strong>coding</strong> context.


Chapter 3<br />

Study of the<br />

Implementation of a<br />

Multiscale Block Matching<br />

Algorithm<br />

In order to minimize or eliminate the most important drawbacks of<br />

the ordinary BMA (cf. Section 2.5.1), Paula Queluz <strong>and</strong> Beno^t Macq<br />

have proposed a new <strong>motion</strong> <strong>estimation</strong> algorithm, the Adaptive Block<br />

Matching Algorithm (ABMA, [110]). Our rst doctoral task has been<br />

the C++ programming <strong>and</strong> ne-tuning of the ABMA in order to per<strong>for</strong>m<br />

simulations in the COMIS scope (cf. Section 1.4.2). The developing environment<br />

was made of a C++ class <strong>for</strong> pictures manipulation. We have<br />

there<strong>for</strong>e extended it into a class <strong>for</strong> <strong>motion</strong> eld manipulation. In this<br />

de nition process, special attention has been paid to accessing (sub-)<br />

blocks of a <strong>motion</strong> eld so as to be appropriate <strong>for</strong> BMA <strong>and</strong> ABMA<br />

computation.<br />

The ABMA proposes a solution to the lack of adaptiveness of the classical<br />

BMA to object contours. Moreover, it tries to solve the fol<strong>low</strong>ing<br />

contradiction: in a normal BMA, small blocks are required to respect<br />

the uni<strong>for</strong>m <strong>motion</strong> assumption <strong>for</strong> each object of the scene, while large<br />

blocks are necessary to avoid the noise in uence. The functioning principle<br />

of the ABMA is presented in Section 3.1.<br />

The chapter also introduces a way of operating a distribution of the<br />

computational load engendered by the ABMA among several proces-


78 Chapter 3. Multiscale Block Matching Algorithm<br />

sors. This work has been achieved in the Master thesis of Francois Vermaut<br />

[131], that we have supervised. Section 3.2 presents his original<br />

work.<br />

3.1 Adaptive Block Matching Algorithm<br />

The overall ABMA algorithm, presented in the fol<strong>low</strong>ing subsections,<br />

can be summarized as fol<strong>low</strong>s: at rst, global camera <strong>motion</strong>, e.g.<br />

panning <strong>and</strong> zooming, is estimated. The reference image is then compensated<br />

according to the estimated global parameters. A change detector<br />

compares this resulting prediction with the original image <strong>and</strong><br />

outputs a binary mask that al<strong>low</strong>s the distinction between globally <strong>and</strong><br />

locally moving regions. The third <strong>and</strong> last step consists in an improved<br />

version of the BMA: a split-<strong>and</strong>-merge procedure considers a hierarchy<br />

structure of block sizes (or scales, as the local part of the ABMA is an<br />

example of multiscale (cf. Section 2.3.2.1) methodology) so as to overcome<br />

the problem of having di erent moving objects in the same BMA<br />

block.<br />

A few intermediate results of the ABMA are presented on gure 3.1.<br />

They should be compared with the BMA results of gure 2.16 as they<br />

both use the same original images.<br />

3.1.1 Global Motion Estimation<br />

The apparent <strong>motion</strong> in most image sequences is the result of the camera<br />

movement aswell as the movement of the objects in a scene. The part<br />

chargeable to the camera is generally referred to as global <strong>motion</strong>, in<br />

opposition to the local <strong>motion</strong> of the objects. The camera movement<br />

can be expressed as a combination of the fol<strong>low</strong>ing categories:<br />

xed;<br />

zooming, i.e. change of the camera focal length;<br />

panning, i.e. rotation around an axis normal to the camera axis;<br />

rotation, i.e. rotation around the camera axis;<br />

dollying, i.e. translation around the camera axis;<br />

tracking <strong>and</strong> booming, i.e. translation in the plane normal to the<br />

camera axis, horizontally (tracking) or vertically (booming).


3.1 Adaptive Block Matching Algorithm 79<br />

Even if only zooming <strong>and</strong> panning are considered to model global <strong>motion</strong>,<br />

Tse <strong>and</strong> Baker [126] have shown that a two stage global/local ME<br />

approach improves <strong>motion</strong> prediction <strong>and</strong> reduces the amount of <strong>motion</strong><br />

side in<strong>for</strong>mation. The ABMA uses the fol<strong>low</strong>ing algorithm to carry out<br />

a suboptimal search <strong>for</strong> the best pan <strong>and</strong> zoom:<br />

1. Division of the picture in large blocks (typically, <strong>for</strong> QCIF images,<br />

there is only one block: the picture itself).<br />

2. Only the blocks with a variance superior or equal Tvar are considered<br />

in the fol<strong>low</strong>ing steps 4 to 6.<br />

3. Selection of a set of zooms 1 + fz, with fz = n: fz <strong>and</strong> n =<br />

0; 1; :::; nmax. Selection of a BMA search window u; v.<br />

4. For e<strong>very</strong> valid block, computation of the BMA with e<strong>very</strong> possible<br />

zoom. Memorization, <strong>for</strong> e<strong>very</strong> block, of the best combination<br />

zoom-<strong>motion</strong> vector (i.e. the one that achieves the <strong>low</strong>est MAE).<br />

5. Selection of the zoom with the highest occurrence among the best<br />

combinations: fzOPT<br />

6. fz = fz =2, <strong>and</strong> repetition of step 4 with f = fzOPT fz.<br />

7. After maxstep iterations, computation of the BMA <strong>for</strong> all the<br />

blocks, with the selected zoom.<br />

8. Compute the globally compensated picture.<br />

This globally compensated image will now serve as reference in the next<br />

steps of the algorithm.<br />

3.1.2 Change Detection<br />

The aim of the change detector is to give the position of local (or <strong>for</strong>eground)<br />

moving areas. Its result is a picture locating changes between<br />

the two successive pictures. The value of its pixels can be changed,<br />

unchanged or uncertain. The nal mask is generated in four steps:<br />

1. If the di erence between the pixel value in the original image<br />

<strong>and</strong> the pixel value in the globally-compensated reference image is<br />

<strong>low</strong>er than Tu, the pixel is considered unchanged. If this di erence<br />

is greater than Tc, the pixel is considered changed. Otherwise, the<br />

pixel is considered uncertain.


80 Chapter 3. Multiscale Block Matching Algorithm<br />

2. E<strong>very</strong> uncertain pixel having at least six unchanged neighbors becomes<br />

unchanged. The other uncertain pixels become changed.<br />

3. A median ltering of size mf mf is applied.<br />

4. A pixel with six changed neighbors becomes changed. A pixel with<br />

six unchanged neighbors becomes unchanged.<br />

Such a resulting mask is depicted on gure 3.1 (a).<br />

(a) (b)<br />

Figure 3.1: Some steps of the Adaptive BMA: (a) Change mask (b)<br />

Local Motion eld (c) Compensated image<br />

(c)


3.1 Adaptive Block Matching Algorithm 81<br />

3.1.3 Local Motion Estimation<br />

This part of the algorithm aims at estimating the <strong>for</strong>eground (or \local")<br />

<strong>motion</strong> between two successive images. Although the overall procedure<br />

is block-based, two substantial improvements over the classical blockmatching<br />

algorithm (BMA, cf. Section 2.5.1) have been implemented.<br />

At rst, on the basis of a coarse-to- ne quad-tree split procedure, a<br />

hierarchical (multiscale, cf. Section 2.3.2.1) structure of block sizes is<br />

considered. Second, at each level of the tree, a merge procedure is applied<br />

to correctly propagate the <strong>motion</strong> vectors from blocks with reliable<br />

<strong>motion</strong> to blocks with uncertain <strong>motion</strong>.<br />

Thanks to its hierarchical structure, the ABMA per<strong>for</strong>ms a segmentation<br />

of the <strong>motion</strong> eld much closer to the object boundaries. Both<br />

improvements help obtaining a compensated image with less blocking<br />

artifacts <strong>and</strong> a more consistent <strong>motion</strong> eld. Figures 3.1 (b) <strong>and</strong> (c)<br />

illustrate these points, with regards to gure 2.16.<br />

The various steps successively involved by the local <strong>estimation</strong> of the<br />

ABMA are described hereunder. The algorithm starts with a pre-de ned<br />

initial size <strong>for</strong> the blocks <strong>and</strong> a pre-de ned maximum number of iterations<br />

(or minimal block size). It is only processed <strong>for</strong> the blocks possessing<br />

at least c% ofchanged pixels in the change detection mask (described<br />

in Section 3.1.2). The steps are:<br />

Local Motion Determination computes a BMA <strong>for</strong> all the<br />

blocks of size (mi;ni) to be processed (i is the iteration number:<br />

i =1; 2; :::imax). Using a search window of size (s; s), it results,<br />

<strong>for</strong> e<strong>very</strong> block labeled k, in an optimal displacement di(k), corresponding<br />

to a minimal mean absolute error MAEi:min(k).<br />

Estimated Motion Certitude (MC) is computed <strong>for</strong> e<strong>very</strong> block<br />

simultaneously with the BMA determination. It is de ned <strong>for</strong> a<br />

block k at iteration i as fol<strong>low</strong>s:<br />

PNs t=1<br />

MCi(k) =<br />

(MAEi(k)(t),MAEi:min(k))<br />

Ns<br />

(3.1)<br />

1+ i:MAEi:min(k)<br />

with Ns = s s the number of tested values in the BMA. If<br />

the <strong>motion</strong> certitude is not high enough (MCi(k)


82 Chapter 3. Multiscale Block Matching Algorithm<br />

achieved only if d 0<br />

i(k) results in a MAE 0<br />

(1 + i)MAEi:min(k).<br />

MAE 0<br />

is the MAE obtained when the <strong>motion</strong> vector d 0<br />

i(k) ofthe<br />

neighbor block is applied to the block undergoing treatment.<br />

It is important to notice that, in order to get rid of the in uence<br />

of the direction image is traveled, all blocks are treated at once. It<br />

means that each block uses the vector value of its neighbor blocks<br />

be<strong>for</strong>e they were treated.<br />

Non-linear ltering is then applied in order to homogenize the<br />

<strong>motion</strong> eld while preserving the contours. The con gurations of<br />

gure 3.2 are searched <strong>for</strong>: the block undergoing ltering is the<br />

central one, <strong>and</strong> the displacement vectors of neighbor blocks are<br />

compared to see if the shadowed blocks possess the same one. If<br />

one con guration occurs, the displacement vector di(k) ischanged<br />

to the common value of the shadowed neighbor blocks. The new<br />

MAE must verify MAEnew (1+ i):MAEold to be accepted. The<br />

two con gurations of gure 3.2(a) have the priority on the four<br />

con gurations of gure 3.2(b). Here also, all blocks are treated at<br />

once. If two con gurations of (a) or (b) simultaneously apply, the<br />

one achieving the best MAE is selected.<br />

Further splitting of the blocks is per<strong>for</strong>med if necessary. The<br />

criterion is that if MAEi(k)


3.1 Adaptive Block Matching Algorithm 83<br />

(a)<br />

Figure 3.2: Possible non-linear lter operations in the ABMA<br />

is not always improved thanks to the ABMA, it is important to notice<br />

that a segmentation of the <strong>motion</strong> eld is provided via the multiscale<br />

representation. Moreover, the ABMA <strong>motion</strong> eld can in many cases be<br />

coded more e ciently.<br />

(b)<br />

Sequence BMA ABMA<br />

Akiyo 38.2886 38.1356<br />

Container Ship 32.7014 32.8145<br />

Hall monitor 32.3618 34.0889<br />

Mother & daughter 36.2976 35.3678<br />

Coast guard 27.3405 27.6874<br />

Foreman 28.7040 27.8277<br />

News 31.4820 32.6105<br />

Silent 32.1818 32.4255<br />

Mobile 23.0172 24.1174<br />

Stefan 22.9990 24.3879<br />

Table tennis 30.1653 30.7733<br />

Table 3.1: Comparison of BMA <strong>and</strong> ABMA per<strong>for</strong>mances (PSNR)<br />

Figure 3.3 shows the comparative result between images 1 <strong>and</strong> 4 of the<br />

Table Tennis sequence.


84 Chapter 3. Multiscale Block Matching Algorithm<br />

(a) Original image at time t-1 (b) Original image at time t<br />

(c) 8x8 BMA <strong>motion</strong> field (d) BMA <strong>compensation</strong><br />

(e) ABMA <strong>motion</strong> field<br />

Figure 3.3: Table Tennis: BMA vs ABMA<br />

(f) ABMA <strong>compensation</strong>


3.2 Distributed Version of the Local Motion Estimation 85<br />

3.2 Distributed Version of the Local Motion Estimation<br />

The main objective ofFrancois Vermaut's dissertation [131, 132] was to<br />

speed up the computation time engendered by the ABMA. In this context,<br />

the envisioned solution is to distribute the algorithm among several<br />

processors. Only the local <strong>motion</strong> <strong>estimation</strong> part (cf. Section 3.1.3) is<br />

tackled since this step is the most e ort-dem<strong>and</strong>ing 1 . Vermaut's work<br />

thereafter involved the fol<strong>low</strong>ing steps that are respectively presented in<br />

subsections 3.2.1 to 3.2.4: it starts with an abstract speci cation of the<br />

local <strong>motion</strong> <strong>estimation</strong> part of the ABMA, which leads to a distributed<br />

model. Once one is in possession of such a model that resembles a state<br />

automaton, one has to practically de ne the type of data structures <strong>and</strong><br />

the state transitions of the various elements. Finally, an implementation<br />

under the Parallel Virtual Machine (PVM) environment demonstrates a<br />

linear speedup.<br />

3.2.1 Pseudo-Code of the Sequential Loop<br />

In order to clarify the problem to be solved, a preliminary step was to<br />

<strong>for</strong>malize the algorithm to be distributed. The algorithm pseudo-code<br />

has thus been rst established.<br />

Algorithm Local Motion Estimation<br />

begin<br />

1 sizeblock := sizemax;<br />

2 while (sizeblock sizemin) do<br />

3 begin<br />

4 <strong>for</strong> each (untreated blocks) BMA;<br />

5 <strong>for</strong> each (untreated blocks) MC Treatment;<br />

6 <strong>for</strong> each (untreated blocks) Filtering;<br />

8 if (sizeblock > sizemin) then<br />

9 <strong>for</strong> each (untreated blocks) Split Or Done;<br />

10 sizeblock := sizeblock /2;<br />

11 end<br />

end<br />

From this pseudo-code, it is possible to better underst<strong>and</strong> how the sequential<br />

algorithm operates. First of all, it is obvious that the main<br />

1 The \re nement" in terms of blocks of the change detection mask has also not<br />

been studied <strong>for</strong> distribution because this task has a <strong>very</strong> <strong>low</strong> computational burden.


86 Chapter 3. Multiscale Block Matching Algorithm<br />

\object" manipulated by this algorithm is the block entity, which has<br />

a \life" that evolves during the algorithm achievement.<br />

If one has a closer look at the sequential loop of the Local Motion Estimation,<br />

it appears that:<br />

1. One resulting <strong>motion</strong> vector is known <strong>for</strong> all the blocks that are not<br />

to treat (including the ones to which the change detection mask<br />

has associated a <strong>motion</strong> vector equal to zero).<br />

2. The BMA computation is rst per<strong>for</strong>med on e<strong>very</strong> block. It is a<br />

totally independent operation.<br />

3. For the treatment of the <strong>motion</strong> certitude (MC Treatment), the<br />

BMA vector, the MAE <strong>and</strong> the <strong>motion</strong> of certitude of all blocks<br />

to be treated are to be known. In fact, <strong>for</strong> a particular block, this<br />

in<strong>for</strong>mation has to be known only <strong>for</strong> the block itself <strong>and</strong> its four<br />

neighbors.<br />

4. So as to achieve the Filtering step, the result of the <strong>motion</strong><br />

certitude treatment iswaited <strong>for</strong>. Here also, a particular block<br />

could be \ ltered" once the <strong>motion</strong> certitude results are known<br />

<strong>for</strong> the block <strong>and</strong> its eight neighbors.<br />

5. All blocks to be treated are always at the same iteration level of<br />

the sequential loop. They all have the same size.<br />

3.2.2 Model of Distribution<br />

From what has been identi ed in the previous section, precise states of<br />

the block \life" can be distinguished. These states, presented on gure<br />

3.4, are separated by transitions (computations) that can be carried<br />

out once precise conditions are satis ed. The local <strong>motion</strong> <strong>estimation</strong><br />

can then be modeled as a nite state automaton [35].<br />

From the pseudo-code of the sequential algorithm (cf. Section 3.2.1),<br />

four intermediate steps emerge out of the block \life": one after the<br />

BMA computation, one after the treatment of the <strong>motion</strong> certitude, one<br />

after the non-linear ltering <strong>and</strong> a nal stage when the block is assigned<br />

a de nitive vector (Done) or is split into four subblocks (Split). However,<br />

all these steps are not real automaton states. The real states<br />

that have to be considered are those which prevent a block from being


3.2 Distributed Version of the Local Motion Estimation 87<br />

Unknown<br />

Known<br />

Propagated<br />

Done Split<br />

Figure 3.4: State Diagram<br />

totally independent from the other ones, i.e. the states that are reached<br />

through a transition requiring precise conditions.<br />

In the initial state, all blocks to be treated are said to be \Unknown": no<br />

in<strong>for</strong>mation is known about their possible <strong>motion</strong> vector. All the other<br />

blocks, that have not to be treated 2 , are in a state called \Done" <strong>for</strong><br />

their treatment isover.<br />

The blocks to be treated can quit their initial state. During the transition,<br />

the BMA is computed <strong>for</strong> the block. Once the BMA computation<br />

ends, the block state becomes \Known" because the BMA vector, MAE<br />

<strong>and</strong> <strong>motion</strong> certitude are determined.<br />

The \Known" state serves as a synchronization point: a particular block<br />

in this state needs its four neighbors to be in the same state so as to<br />

exploit the <strong>motion</strong> certitude in<strong>for</strong>mation. Since this step requires the<br />

propagation of in<strong>for</strong>mation from neighbor blocks, it puts the block in<br />

\Propagated" mode once the treatment is actually achieved.<br />

The \Propagated" state is also a synchronization anchor prior to the<br />

2 Either because they have been assigned a null <strong>motion</strong> vector by the change detection<br />

mask of Section 3.1.2, or because they have been successfully treated.


88 Chapter 3. Multiscale Block Matching Algorithm<br />

ltering. However, if one has a close look at the sequential loop (Section<br />

3.2.1), one can notice that no intermediate state is needed after the<br />

ltering operation: no additional in<strong>for</strong>mation is needed <strong>for</strong> the algorithm<br />

to decide whether the block must be split or not, <strong>and</strong> no synchronization<br />

with the neighbor blocks is needed. This step can there<strong>for</strong>e be consecutively<br />

carried out. After the \Propagated" state, the block directly ends<br />

its \life": either it reaches the \Done" state, where its nal <strong>motion</strong> vector<br />

value is known, either it reaches the \Split" state, where the block<br />

does not exist anymore as it is divided into four new blocks.<br />

The subblocks <strong>for</strong> which the inherited vector satis es the nal conditions<br />

directly reach the \Done" state. Otherwise, the subblocks are<br />

\Unknown" <strong>and</strong> will run the loop during a new iteration.<br />

Such a model gives the possibility to create a sequential version as well<br />

as di erent distributed versions that will still produce the same results<br />

as the sequential one. A simple distributed structure has been chosen:<br />

only the BMA, which is the most e ort-dem<strong>and</strong>ing, is distributed among<br />

several processors. This step can there<strong>for</strong>e be carried out <strong>for</strong> several<br />

blocks in parallel.<br />

A classical master <strong>and</strong> slave structure is then set up ( gure 3.5), where<br />

the slaves only per<strong>for</strong>m one action, i.e. BMA computation, while the<br />

master is used to distribute the di erent blocks, to receive the results<br />

of the computation of the slaves <strong>and</strong> to per<strong>for</strong>m <strong>motion</strong> certitude treatment,<br />

ltering <strong>and</strong> decision. The di erence with the sequential<br />

version is that the three last steps are not applied to all blocks<br />

at once but as soon as a block is ready. It implies one knows what<br />

the precise conditions <strong>for</strong> a transition to occur are, which is described<br />

in the next section.<br />

3.2.3 Practical Implementation<br />

Slaves are al<strong>low</strong>ed to achieve only one action, i.e. the BMA computation.<br />

The master provides them with the position <strong>and</strong> the size of the<br />

\Unknown" block they have to treat (cf. gure 3.5). The master then<br />

gathers the results, that is to say avector, the associated MAE <strong>and</strong> the<br />

associated <strong>motion</strong> certitude.<br />

The master also per<strong>for</strong>ms the <strong>motion</strong> certitude treatment, the ltering<br />

<strong>and</strong> the decision procedure as soon as possible. This idea al<strong>low</strong>s the<br />

block to be an independent computation unit in constant relation with<br />

neighboring blocks.


3.2 Distributed Version of the Local Motion Estimation 89<br />

block position<br />

block size<br />

3.2.3.1 Data Structures<br />

Master<br />

Motion Certitude<br />

Slave Slave ......<br />

Slave<br />

Figure 3.5: Master-Slaves structure<br />

best vector<br />

MAE<br />

It is rst assumed that the master <strong>and</strong> all the slaves possess the two<br />

images at time t,1 <strong>and</strong> t. This is the only in<strong>for</strong>mation slaves have access<br />

to. The master manages four data structures: the multigrid, which is<br />

the nal result of the ABMA, a list with the \Unknown" blocks to<br />

be treated, a list with the available slaves <strong>and</strong> one last structure of<br />

couples (slave id,block id) (one couple <strong>for</strong> e<strong>very</strong> block undergoing BMA<br />

<strong>estimation</strong>).<br />

3.2.3.2 State Transitions<br />

To describe the state transitions, i.e. the conditions that enable the<br />

related computation to be per<strong>for</strong>med, an exemplary block \life" is presented.<br />

As soon as a slave isavailable, a block inthe\Unknown" state is sent<br />

to that slave <strong>for</strong> BMA completion. When the slave returns the results,<br />

the block goesinto the \Known" state.<br />

The process becomes more complex when the master wants to treat the<br />

<strong>motion</strong> certitude ( gure 3.6): not all vectors <strong>for</strong> all multigrid blocks are<br />

needed but only the vectors of four speci c neighbors. The transition<br />

can occur once these four blocks are also in the \Known" state. Some<br />

of the neighbor blocks can already have their certitude treated <strong>and</strong> be<br />

in the \Propagated" state. In order to obtain the same result than the<br />

sequential version (where all blocks are treated at once), it is the vector<br />

value be<strong>for</strong>e <strong>motion</strong> certitude treatment that must be used (<strong>and</strong> stored<br />

to this purpose).


90 Chapter 3. Multiscale Block Matching Algorithm<br />

K<br />

K K<br />

P<br />

K<br />

K<br />

K<br />

K<br />

K<br />

K<br />

K<br />

K<br />

K<br />

P<br />

K<br />

K<br />

K P<br />

Figure 3.6: Known to Propagated transition conditions: all four<br />

neighbors must already be known<br />

P D<br />

K K<br />

Figure 3.7: Known to Propagated transition conditions: larger blocks<br />

have to be split or de nitively treated (Done)<br />

However, these conditions are only su cient during the rst iteration<br />

of the loop. At this stage, all blocks have the same original size. During<br />

the next iterations, blocks of various sizes coexist. A block in the<br />

\Known" state can have as neighbors larger blocks in \Done" or \Propa-<br />

K<br />

K<br />

P<br />

K<br />

K<br />

K<br />

K


3.2 Distributed Version of the Local Motion Estimation 91<br />

gated" states because the previous iteration <strong>for</strong> these neighbors has not<br />

completed yet. One has to wait <strong>for</strong> \Propagated" blocks to fall into the<br />

\Done" or \Split" state ( gure 3.7). In the latter case, the transition<br />

condition is determined by the appropriate subblock.<br />

Similarly to what has just been presented, the non-linear ltering of a<br />

block (<strong>and</strong> the decision algorithm) can start as soon as its 8 neighbors are<br />

either in the \Propagated" state, either in the \Done" or \Split" state.<br />

In e<strong>very</strong> case, it is the vector value resulting from the \Propagated" state<br />

that is to be used. Once again, some neighbor blocks can have a larger<br />

size. They can only be neighbors via the diagonal because horizontal<br />

<strong>and</strong> vertical blocks had to be over the \Propagated" state during the<br />

previous step. One has to wait until such larger blocks fall into the<br />

\Done" or \Split" state.<br />

So as to somehow complete this in<strong>for</strong>mal description of the transition<br />

conditions, some additional properties should be described. Figure 3.8<br />

presents an impossible situation where a block has as neighbors bigger<br />

blocks in \Known" or \Unknown" state. This case is impossible because<br />

the block undergoing treatment results from a bigger block that has<br />

been split after the <strong>motion</strong> certitude treatment <strong>and</strong> the ltering (what<br />

implies that all its neighbors were at least in the \Propagated" state).<br />

P<br />

P<br />

K<br />

Figure 3.8: Impossible situation<br />

But there can be more than one di erence of size between two neighbor<br />

blocks. If it is the case, the biggest blocks must absolutely be in the<br />

\Done" state as illustrated on gure 3.9.<br />

To achieve the practical implementation, Vermaut has developed all the<br />

necessary structures to manage the block data <strong>and</strong> to keep in memory<br />

the block position in the image, its size, its present state <strong>and</strong> related<br />

in<strong>for</strong>mation, <strong>and</strong> a list of neighbors. E cient ways of per<strong>for</strong>ming the<br />

wakening of a block have also been erected.


92 Chapter 3. Multiscale Block Matching Algorithm<br />

D<br />

D<br />

Figure 3.9: Two neighbors with more than one step of size between them<br />

3.2.4 Experimental Results<br />

Such a model of distributed ABMA has been implemented on a network<br />

of conventional workstations, using a PVM plat<strong>for</strong>m. Tests have been<br />

conducted on an Ethernet linking 10 SUN computers. The parameters<br />

were the numberofslaves <strong>and</strong> the complexity of the test sequence in<br />

terms of <strong>motion</strong> in<strong>for</strong>mation (cf. appendix A). Generally speaking,<br />

a linear speedup with an e ciency of 50% was observed. Figure 3.10<br />

illustrates one of the benchmarks.<br />

3.3 Conclusion<br />

The present chapter aimed at brie y introducing a scheme that have<br />

been developed in order to improve the per<strong>for</strong>mances of one out of several<br />

<strong>motion</strong> <strong>estimation</strong> techniques. The ABMA does actually al<strong>low</strong> one to<br />

obtain a <strong>motion</strong> eld that better matches the data than the one provided<br />

by a classical BMA.<br />

However, this improvement of the result is only possible with an increase<br />

of the computational burden. It was then proposed to Francois Vermaut<br />

to tackle this problem <strong>and</strong> to try distributing the load among several<br />

computational units. An implementation of the established distributed<br />

model demonstrates the possibility to per<strong>for</strong>m the ABMA in real-time<br />

using several processors.<br />

D


3.3 Conclusion 93<br />

20<br />

18<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

o Computational Computation Times Time<br />

+ Speedup<br />

0<br />

1 2 3 4 5 6 7 8 9 10 11<br />

Number of slaves<br />

Figure 3.10: Computation time <strong>and</strong> speedup <strong>for</strong> the \Table Tennis"<br />

sequence


Chapter 4<br />

Image Pre-Processing <strong>for</strong><br />

VLBR Video Coding<br />

In the introduction of Chapter 1 to <strong>video</strong> <strong>coding</strong>, the particularity of<br />

<strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> (VLBR) <strong>coding</strong> has been highlighted: VLBR compression<br />

schemes generally have to debase more in<strong>for</strong>mation than the only<br />

irrelevant part of the signal; drastic debasement of the pictures has to<br />

be achieved in order to reach the requested rates.<br />

In this context, the COMIS scheme (cf. Section 1.4.2) has chosen the<br />

fol<strong>low</strong>ing approach: instead of letting (the <strong>bitrate</strong> regulation part of)<br />

the coder automatically per<strong>for</strong>m a strong quantization, the images are<br />

voluntarily simpli ed prior to <strong>coding</strong>. It is then expected that these<br />

simpli ed images will be easier to encode as only their irrelevancies would<br />

have to be removed.<br />

As the insertion of such a pre-processing within COMIS has not al<strong>low</strong>ed<br />

the coder to surpass its usual per<strong>for</strong>mances, the present chapter intends<br />

to analyze whether VLBR transmissions could bene t or not from<br />

this treatment <strong>and</strong> thereafter achieve a better quality (in comparison to<br />

the original images, be<strong>for</strong>e pre-processing) at equivalent rates.<br />

Section 4.1 introduces the hypothesis about the possible gain of preprocessing.<br />

Section 4.2 considers it in a rate-distortion framework <strong>and</strong><br />

theoretically settles the conditions <strong>for</strong> improving the coder per<strong>for</strong>mances.<br />

The next section (Section 4.3) experimentally analyses the behavior <strong>and</strong><br />

the actual improvement brought about by such a pre-processing when<br />

inserted in a precise <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> coder: H.263 [96]. Finally, the last<br />

section draws some conclusions.


96 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

4.1 Intuitive Rationale<br />

The hypothesis that constitutes the core of this chapter is the fol<strong>low</strong>ing:<br />

it could be more pertinent to operate the <strong>motion</strong> <strong>estimation</strong> between two<br />

images that have been \simpli ed" in the same way so as to raise their<br />

correlation. Thereafter, it is also expected that the resulting residues<br />

could be encoded at a <strong>low</strong>er cost, bringing about the whole ratio of<br />

<strong>bitrate</strong> versus quality to be improved. An example will help clarify this<br />

hypothesis.<br />

Because of its <strong>very</strong> nalized status, H.263([96], cf. Section 1.4.1) has<br />

been chosen to test the validity of the idea. Like many other st<strong>and</strong>ards,<br />

H.263 per<strong>for</strong>ms its <strong>motion</strong> <strong>estimation</strong> thanks to the Block-Matching Algorithm<br />

(BMA [50], cf. Section 2.5.1). Let us just remind one that the<br />

BMA is a correlation-based method which assumes that changes between<br />

successive images result from a local translational <strong>motion</strong>. With Tziritas<br />

<strong>and</strong> Labit [127] (in their chapter 4), one may de ne the correlation 1<br />

between two successive pictures I(x; y; t , 1) <strong>and</strong> I(x; y; t):<br />

= E[I(x; y; t):I(x , u; y , v; t , 1)]<br />

E[I 2 (x; y; t)]<br />

(4.1)<br />

where E designs the mathematical expectation <strong>and</strong> (u; v) is the displacement<br />

vector between (blocks of) the two consecutive images. It<br />

has been proven that the e ciency of the BMA directly depends on this<br />

correlation coe cient.<br />

One may then use this coe cient as a rst indication about the BMA<br />

per<strong>for</strong>mances in various <strong>coding</strong> con gurations.<br />

Figures 4.1 (a) <strong>and</strong> (b) depict two original images of the Akiyo sequence 2 .<br />

Their correlation is computed after that image (a) has been compensated<br />

via a 16 16 BMA at pel precision: its value is =0:998. The<br />

<strong>motion</strong> eld detected by the BMA is presented on gure 4.1 (c): as only<br />

the head of the speaker moves (slightly), only three blocks receive a<br />

non-zero vector. The residual image (or DFD), i.e. the di erence image<br />

between image (b) <strong>and</strong> the <strong>motion</strong> <strong>compensation</strong> of image (a) by <strong>motion</strong><br />

eld (c), is presented on gure 4.1 (d). It is characterized by a<strong>very</strong> <strong>low</strong><br />

variance equal to 2 =9:25.<br />

1 Of course only depends on the variable t. x <strong>and</strong> y are present in equation 4.1<br />

to indicate that the expectation is computed over all pixels of the image.<br />

2 Appendix A brie y introduces the various test sequences that are used.


4.1 Intuitive Rationale 97<br />

(a) (b)<br />

(c) (d)<br />

Figure 4.1: Temporal correlation - rst case: (a) <strong>and</strong> (b) Original Akiyo<br />

images # 101 <strong>and</strong> 104, (c) Detected <strong>motion</strong> eld between (a) <strong>and</strong> (b),<br />

(d) Residual image after <strong>motion</strong> <strong>compensation</strong><br />

But, in the <strong>coding</strong> process, H.263 per<strong>for</strong>ms its <strong>motion</strong> <strong>estimation</strong> (<strong>and</strong><br />

<strong>compensation</strong>) between the already decoded version of the previous image<br />

<strong>and</strong> the new original one. At <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong>s, the reference image<br />

is highly debased because of the limited channel capacity. Instead of<br />

possessing I(x , u; y , v; t , 1), the <strong>motion</strong> <strong>estimation</strong> uses the reconstructed<br />

image I(x , u; y , v; t , 1), which is only a rough version of the<br />

original. The correlation factor with the next original image is thereafter


98 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

reduced. Figure 4.2(a) shows the coded version of gure 4.1(a). The<br />

next image to be coded remains the same ( gure 4.2(b)). In this case,<br />

the correlation falls down to =0:984 <strong>and</strong> the resulting <strong>motion</strong> eld is<br />

noisier ( gure 4.2 (c) with respect to gure 4.1 (c)). The residual image<br />

is also much more important: its variance raises to 2 =47:66.<br />

(a) (b)<br />

(c) (d)<br />

Figure 4.2: Temporal correlation - second case: (a) Akiyo image # 101<br />

when H.263 coded at 10 kbits=s, 10Hz, (b) Original Akiyo image #<br />

104, (c) Detected <strong>motion</strong> eld between (a) <strong>and</strong> (b), (d) Residual image<br />

after <strong>motion</strong> <strong>compensation</strong><br />

We here consider that it could be more pertinent to operate the mo-


4.1 Intuitive Rationale 99<br />

(a) (b)<br />

(c) (d)<br />

Figure 4.3: Temporal correlation - third case: (a) Akiyo image # 101<br />

when H.263 coded at 10 kbits=s, 10Hz, (b) Akiyo image # 104 intracoded<br />

with H.263 (QP=11), (c) Detected <strong>motion</strong> eld between (a) <strong>and</strong><br />

(b), (d) Residual image after <strong>motion</strong> <strong>compensation</strong><br />

tion <strong>estimation</strong> between the decoded version of the previous image <strong>and</strong><br />

a pre-processed version of the new one 3 . The underlying assumption is<br />

that <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> <strong>coding</strong> does not simply introduce additive white<br />

3 Originally, coders would per<strong>for</strong>m their <strong>motion</strong> <strong>estimation</strong> between two original<br />

images. Then, experiments demonstrated that it was interesting to use the decoded<br />

image as reference image. It is here suggested to go one step further <strong>and</strong> to \simplify"<br />

the (new) image to be estimated.


100 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

noise but rather deteriorates the image in a way that is dependent on<br />

the image itself: the noise is correlated with the original signal. The goal<br />

of the pre-processing would thereafter be to \simplify" the new picture<br />

in a similar way. For instance, if one rst intra-codes the new incoming<br />

image ( gure 4.3 (b)) <strong>and</strong> tries to predict if from the previously<br />

(de)coded one ( gure 4.3 (a)), the correlation raises back to =0:990,<br />

<strong>and</strong> the estimated <strong>motion</strong> eld appears simpler to encode ( gures 4.3<br />

(c)). However, the complexity of the residual image has not been reduced:<br />

its variance is equal to 2 =50:66. The whole <strong>coding</strong> process<br />

will only bene t from the pre-processing if the gain of the <strong>motion</strong> eld<br />

<strong>coding</strong> is more important than the possible loss of the residual <strong>coding</strong>.<br />

It has to be pointed out that, although the H.263 residual <strong>coding</strong> that<br />

leads to the image of gure 4.3 (a) is achieved with a quantization step<br />

of 16, the new coming image (b) is pre-processed as an intra image with<br />

a quantization step of only 11: it seems logical to have a pre-processing<br />

that enables quality improvement whenever the <strong>bitrate</strong> al<strong>low</strong>s it.<br />

Be<strong>for</strong>e establishing a theoretical framework to <strong>for</strong>mally set the problem,<br />

it is important to stress the di erence between the proposed preprocessing<br />

<strong>and</strong> other processings which also aim at reducing the <strong>bitrate</strong>.<br />

These are:<br />

Image (temporal <strong>and</strong>/or spatial) downsampling. Typically, H.263<br />

encodes QCIF sequences at 8:33 Hz instead of full screen images<br />

at 25 Hz. The proposed pre-processing is applied in addition to<br />

that type of image simpli cation.<br />

Selective ltering of the residual in<strong>for</strong>mation (it can consist in a<br />

mere threshold) according to the <strong>motion</strong> pertinence [43] ortothe<br />

relevance of the residues [90]. Our initial aim is to directly simplify<br />

the whole image without making any selection.<br />

Low-pass ltering applied to the compensated image so as to smooth<br />

it prior to the DFD computation, like the loop lter of H.261 [95].<br />

Here, ltering (pre-processing) is directly applied to the original<br />

image. It there<strong>for</strong>e acts upon both the <strong>motion</strong> eld <strong>and</strong> the residual<br />

image.


4.2 Rate Distortion Conditions 101<br />

4.2 Rate Distortion Conditions<br />

In its presentation of the Rate-Distortion theory [6] (a summary of which<br />

may be found in appendix B), Berger points out that \[...] in systems designed<br />

to transmit pictures <strong>and</strong> telemetry data -namely after a complex<br />

en<strong>coding</strong> <strong>and</strong> de<strong>coding</strong> technique is developed at considerable expense<br />

in order to transmit data over a channel with high reliability at rates<br />

approaching capacity, it is not clear just what in<strong>for</strong>mation should be<br />

sent." This assertion rein<strong>for</strong>ces the idea to pre-treat the images prior to<br />

inter-<strong>coding</strong>: it seems irrelevant to predict data (i.e. the nest details<br />

of the images) that have already disappeared in the previously coded<br />

image which serves as a reference.<br />

The present section aims at analyzing this hypothesis with the help<br />

of the Rate-Distortion theory. Once an analytical model of images is<br />

chosen (Section 4.2.1), the speci city of<strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> is highlighted<br />

(Section 4.2.2). The various modes of transmission are then compared<br />

<strong>and</strong> the conditions <strong>for</strong> a possible gain thanks to the pre-processing are<br />

highlighted.<br />

4.2.1 Image Model<br />

With O'Neal <strong>and</strong> Natarajan [98], one will suppose that images are zeromean,<br />

without any generality loss. Images are described by a Gaussian<br />

(isotropic) model of spatial covariance. The covariance of the image at<br />

time t , 1, It,1 = I(x; y; t , 1), is<br />

,It,1 = E[I(x 0 ;y 0 ;t, 1):I(x 0 , x; y 0 , y; t , 1)] = 2<br />

p<br />

, : x2 +y2 I:e ; (4.2)<br />

where E designs the mathematical expectation <strong>and</strong> 2<br />

I is the image variance.<br />

One obtains the spectral density of the image as a function of the<br />

spatial frequencies !x <strong>and</strong> !y:<br />

It,1(!x;!y) =<br />

Z +1 Z +1<br />

,1<br />

,1<br />

where ! = q ! 2 x + ! 2 y <strong>and</strong> ! 0 = .<br />

,It,1e ,j!xx e ,j!yy :dx:dy = 2 ! 0 2<br />

I<br />

(! 2 0 + ! 2 ) 3 ;<br />

2<br />

(4.3)<br />

Tziritas <strong>and</strong> Labit [127] consider digital (sampled) pictures as b<strong>and</strong>limited<br />

r<strong>and</strong>om elds. The use of an ideal <strong>and</strong> invariant by rotation


102 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

<strong>low</strong>-pass lter al<strong>low</strong>s to use It,1 as de ned by equation (4.3) only when<br />

! p 2 (! is a normalized frequency). It,1 is equal to zero otherwise.<br />

I t-1<br />

<strong>motion</strong> (u,v)<br />

Figure 4.4: Model of <strong>video</strong> sequence: image at time t is a displaced<br />

version of the previous one with some additional noise.<br />

The next image in the <strong>video</strong> sequence can be modeled according to<br />

gure 4.4: It results from a displacement of image It,1 <strong>and</strong> some additive<br />

white noise which represents the illumination changes,::: The <strong>motion</strong><br />

e ect is de ned as a Dirac function (x , u; y , v), <strong>and</strong> the spectral<br />

density of the image at time t is<br />

N(t)<br />

It(!x;!y) = It,1(!x;!y)+ 2<br />

N ; (4.4)<br />

where 2<br />

N is the variance of the white noise. The direct correlation<br />

between successive pictures is de ned as:<br />

0 = E[I(x; y; t):I(x; y; t , 1)]<br />

E[I 2 (x; y; t)]<br />

I<br />

t<br />

(4.5)<br />

<strong>and</strong> is linked to the <strong>motion</strong>-compensated correlation of equation (4.1)<br />

by:<br />

Of course,<br />

0 .<br />

4.2.2 Intra Coding of Images<br />

p<br />

0 ,!0 u2 +v2 = :e : (4.6)<br />

One can compute the Rate-Distortion function of intra-<strong>coding</strong> It,1. It<br />

can be achieved by using polar coordinates [47]. A simpler expression<br />

is obtained when the restrictive hypothesis of memoryless <strong>coding</strong> is applied<br />

[6]. The rate R necessary to encode the image It,1 with a distortion<br />

D is then:<br />

R =<br />

( 1<br />

2 log 2<br />

2<br />

I<br />

D if D 2<br />

I ;<br />

0 otherwise:<br />

(4.7)


4.2 Rate Distortion Conditions 103<br />

Such an equation is used [127] in order to de ne theoretical limits that<br />

al<strong>low</strong> a coder to automatically decide when inter-<strong>coding</strong> should be used<br />

instead of intra-<strong>coding</strong>. However, what is needed here is to derive a<br />

model of images after intra-<strong>coding</strong> so as to determine whether it is useful<br />

or not to pre-treat the images prior to <strong>motion</strong> <strong>estimation</strong>.<br />

Intra-<strong>coding</strong> is modeled in a twofold way:<br />

At rst, <strong>coding</strong> achieves some selection about the in<strong>for</strong>mation to<br />

transmit. Schemes like H.263 [96] or MPEG-4 [42] have chosen<br />

to transmit as many <strong>low</strong>-frequencies as possible in order to rst<br />

reconstruct a rough approximation of the image. Higher frequencies<br />

(i.e. the image details) are only transmitted afterwards, if the<br />

<strong>bitrate</strong> al<strong>low</strong>s it. One may thus consider the reconstructed image<br />

as a <strong>low</strong>-pass version of the original one: <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> uses a<br />

relatively strong <strong>low</strong>-pass lter, while this e ect is almost null at<br />

high <strong>bitrate</strong>s.<br />

Secondly, some <strong>coding</strong> noise is added to the image. It is considered<br />

to be white. It is the unique e ect of <strong>coding</strong> present at high<br />

<strong>bitrate</strong>s, but which exists at any <strong>bitrate</strong>.<br />

The spectral density ofanintra-coded image It,1 is thus:<br />

It,1 (!x;!y) =<br />

8<br />

><<br />

>:<br />

It,1(!x;!y)+ 2<br />

C if ! !cut;<br />

2<br />

C if !cut


104 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

I t-1<br />

I t<br />

I t-1<br />

I t<br />

Intra-<strong>coding</strong><br />

Intra-<strong>coding</strong><br />

Pre-treatment<br />

(a)<br />

(b)<br />

I t-1<br />

I t-1<br />

I t<br />

-<br />

+<br />

-<br />

+<br />

Error signal<br />

E t<br />

Error signal<br />

Figure 4.5: Inter-<strong>coding</strong> of It without <strong>motion</strong> <strong>compensation</strong>: (a) usual<br />

scheme versus (b) pre-processing scheme.<br />

4.2.3.1 Usual Scheme<br />

The spectral density of the prediction error to be encoded is<br />

E = ( It,1 + It , 2 It,1:It<br />

2<br />

=<br />

It,1(1 , 0 )+ 2<br />

N + 2<br />

C if ! !cut;<br />

It,1 + 2<br />

N + 2<br />

C if !cut


4.2 Rate Distortion Conditions 105<br />

D (constant quality scheme).<br />

4.2.3.2 Proposed Scheme<br />

The scheme depicted on gure 4.5 (b) assumes that the new image It has<br />

been pre-processed. So as to simplify the new picture in a way that is<br />

similar with the intra-<strong>coding</strong> of the reference frame, the pre-processing<br />

will consist in applying a <strong>low</strong>-pass lter ( ! 0 cut) to the new image. As this<br />

pre-processing may not be perfect, it also produces some noise C 0 . Of<br />

course, !cut ! 0 cut (cf. Section 4.1), <strong>and</strong> the <strong>coding</strong> noise C 0 is in no way<br />

correlated with the noise C that a ects It,1. The residual in<strong>for</strong>mation<br />

of the pre-processed signal is characterized by:<br />

02<br />

E ' !2 cut<br />

[2(1 , 4 0 ) 2<br />

I + 2<br />

N + 2<br />

C + 2<br />

+ !0 2<br />

cut ,!2 cut [ 2 2 2 2<br />

I + N + C + C0] + 4 ,!02<br />

[ 2<br />

C + 2<br />

C0] :<br />

4<br />

cut<br />

4<br />

C 0]<br />

(4.11)<br />

From the comparison between equations (4.10) <strong>and</strong> (4.11), one can conclude<br />

that the proposed method is more e ective if:<br />

2<br />

C<br />

cut<br />

4 , !02<br />

0 <<br />

4<br />

2<br />

I + 2<br />

N : (4.12)<br />

One may argue that this comparison is valid only if one considers the<br />

error signals E <strong>and</strong> E 0 to be encoded with the same distortion, which<br />

should not be the case. Actually, in the proposed scheme, the error signal<br />

is expected to be encoded with nearly no errors as it helps reconstructing<br />

an already debased picture. This would result in a rate approaching<br />

in nity! Nevertheless, as ! 0 cut is chosen larger than !cut, E 0 can still be<br />

distorted to maintain a constant quality.<br />

Considering equation (4.12), it appears that some gain can be expected<br />

only if the <strong>low</strong>-pass e ect of pre-processing is predominant with regards<br />

to the additive noise. In this case, the reduction of the error signal is<br />

logical as the spectral density of both signals has been limited.<br />

4.2.4 Inter Coding with Motion Compensation<br />

Inter-<strong>coding</strong> with <strong>motion</strong> <strong>compensation</strong> exploits <strong>motion</strong> <strong>estimation</strong> in<br />

order to obtain a prediction of the new image <strong>and</strong> to reduce the error<br />

signal. Nevertheless, <strong>motion</strong> <strong>estimation</strong> rarely successes in detecting the<br />

real <strong>motion</strong> (u; v) <strong>and</strong> guesses it with an error d =( u; v). The <strong>motion</strong>


106 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

I t-1<br />

I t<br />

Intra-<strong>coding</strong><br />

Pre-treatment<br />

I t-1<br />

I t<br />

Motion <strong>compensation</strong><br />

Figure 4.6: Inter-<strong>coding</strong> with <strong>motion</strong> <strong>compensation</strong><br />

-<br />

+<br />

Error signal<br />

<strong>compensation</strong> box of gure 4.6 is there<strong>for</strong>e modeled by a Dirac function<br />

(x , u + u;y, v + v).<br />

According to Tziritas <strong>and</strong> Labit [127], one can consider the <strong>estimation</strong><br />

error d to be centered, isotropic <strong>and</strong> Gaussian. Its characteristic function<br />

is then:<br />

d(!x;!y) =<br />

1<br />

(( d:!<br />

2 )2 +1) 3 2<br />

E t<br />

; (4.13)<br />

with d the variance of this error. The mutual spectral density oftwo<br />

consecutive images is (note the use of instead of 0 in order to take<br />

into account the <strong>motion</strong>):<br />

It,1:It = (1 + d:!0<br />

2 ) 2 : It,1: (4.14)<br />

If one considers that the proposed pre-processing has no in uence on <strong>motion</strong><br />

<strong>estimation</strong> precision, the condition of equation (4.12) is still valid.<br />

However, it is expected that the pre-processing enables the <strong>motion</strong> <strong>estimation</strong><br />

to better per<strong>for</strong>m, which results in a error variance 0 d < d. The<br />

proposed scheme is then more interesting once:<br />

2<br />

C<br />

0 < 4 ,!02<br />

4<br />

+ !2 cut<br />

2<br />

cut<br />

[ 2<br />

I + 2<br />

N ]<br />

2<br />

I : :! 0:( d , 0 d ):<br />

4.2.5 Theoretical Conclusion<br />

1+ ! 0<br />

4 ( 0 d + d)<br />

(1+ d:! 0<br />

2<br />

) 2 (1+<br />

0<br />

d :! 0<br />

2<br />

) 2<br />

:<br />

(4.15)<br />

From all these rate-distortion equations, three major deductions, that<br />

will need to be validated through experiments, can be made. They are:


4.3 Experimental Results 107<br />

Pre-processing (<strong>low</strong>-pass ltering) of the new image decreases the<br />

rate required to encode it.<br />

In the assumption that pre-processing improves the precision of<br />

the <strong>motion</strong> <strong>estimation</strong>, the overall gain of pre-processing will be<br />

rein<strong>for</strong>ced.<br />

The theoretical model announces an improvement of the ratedistortion<br />

ratio. This means that, at an equivalent rate, the quality<br />

of the encoded pre-processed sequence with respect to the original<br />

pre-processed sequence will be higher than the quality of the<br />

encoded original one with respect to the original sequence. This<br />

does not ensure that the quality of the encoded pre-processed sequence<br />

surpasses the quality of the encoded original one, both with<br />

respect to the original sequence.<br />

4.3 Experimental Results<br />

In order to experimentally (in)validate the present hypothesis <strong>and</strong> the<br />

relevance of equation (4.15), a preliminary experiment has consisted in<br />

2<br />

computing the correlation , the variance of the residual error E, <strong>and</strong><br />

the variance of the <strong>motion</strong> eld components ( 2<br />

u <strong>and</strong> 2<br />

v). H.2634 [96]<br />

has been used. Table 4.1 presents the results on several MPEG-4 test<br />

sequences.<br />

The test conditions are the fol<strong>low</strong>ing: e<strong>very</strong> sequence has been coded at<br />

variable <strong>bitrate</strong> with a constant quantization step of 10. It results in a<br />

wide variety of <strong>bitrate</strong> according to the sequence content: starting from<br />

18 kbit=s <strong>for</strong> Akiyo or 22 kbit=s <strong>for</strong> Mother & Daughter, upto235kbit=s<br />

<strong>for</strong> Stefan. For e<strong>very</strong> image to be inter-coded, twoschemes are compared:<br />

the rst line of table 4.1 (<strong>for</strong> a precise sequence) presents the correlation<br />

<strong>and</strong> the variances when <strong>motion</strong> <strong>estimation</strong> & <strong>compensation</strong> is per<strong>for</strong>med<br />

between the new original image <strong>and</strong> the reference coded one. The second<br />

line presents the same results where all the originals images have rst<br />

been intra-treated with a quantization factor of 5.<br />

The results of table 4.1 are somehow in concordance with equation<br />

(4.15): the correlation raises thanks to the pre-processing <strong>and</strong>, in most<br />

cases, the energy of the residual error decreases. On the other h<strong>and</strong>, one<br />

4 Thanks to software tmn 2.0 provided by Telenor at<br />

http://www.nta.no/brukere/DVC/


108 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

Sequence E u v<br />

Akiyo 0.985642 6.158601 2.760048 3.000152<br />

pre-processed 0.987657 6.030156 0.938596 1.278907<br />

Container ship 0.969350 8.288046 9.128978 2.802829<br />

pre-processed 0.972665 8.045071 9.664501 3.490324<br />

Hall monitor 0.967627 9.043882 3.149510 3.888562<br />

pre-processed 0.970485 8.873044 3.612144 3.903613<br />

Mother & daughter 0.967063 6.634926 3.421437 5.444000<br />

pre-processed 0.969207 6.386776 3.909316 5.994232<br />

Coast guard 0.939101 12.353096 6.217164 3.344785<br />

pre-processed 0.940386 12.219138 6.736460 3.714005<br />

Foreman 0.944067 12.883052 8.774803 7.610945<br />

pre-processed 0.944958 12.811791 9.066337 7.844493<br />

News 0.964210 10.356456 6.839849 5.447062<br />

pre-processed 0.966480 10.236923 6.309831 1.970497<br />

Silent 0.964399 10.359665 4.286384 4.248433<br />

pre-processed 0.967709 10.071909 4.306487 4.285518<br />

Mobile 0.902015 19.373144 1.879478 0.847039<br />

pre-processed 0.901617 19.488099 1.879190 0.844782<br />

Stefan 0.781131 20.427418 12.814752 5.307830<br />

pre-processed 0.781557 20.438889 12.876472 5.701028<br />

Table tennis 0.914535 11.021105 7.547181 7.476545<br />

pre-processed 0.919952 10.767871 7.911386 8.445316<br />

Table 4.1: Correlation <strong>and</strong> variances with or without pre-processing<br />

cannot really maintain that the <strong>motion</strong> eld appears to be more coherent:<br />

<strong>for</strong> almost all sequences, the <strong>motion</strong> variance raises, which indicates<br />

a more sparse (<strong>and</strong> probably more di cult to encode) <strong>motion</strong> eld. Only<br />

Akiyo, News <strong>and</strong> Mobile (with an increase of the residual error) present<br />

a \better" <strong>motion</strong> eld. As is has been theoretically demonstrated, the<br />

reduction of E is probably the result of the <strong>low</strong>-pass ltering implicitly<br />

applied by the intra <strong>coding</strong>.<br />

The second experiment then consisted in achieving the pre-processing<br />

inside the <strong>coding</strong> loop so as to measure the real impact of this operation.<br />

As the intra pre-processing (with a quantization step of 10, 20 or 30) is<br />

probably not the one that achieves the best compromise between <strong>low</strong>-


4.3 Experimental Results 109<br />

Mean PSNR<br />

34<br />

33<br />

32<br />

31<br />

30<br />

29<br />

28<br />

27<br />

26<br />

25<br />

Akiyo<br />

Normal <strong>coding</strong><br />

Gaussian 0.5<br />

Gaussian 0.9<br />

Gaussian 1.2<br />

Median 3<br />

Median 5<br />

Median 7<br />

Morpho 3<br />

Morpho 5<br />

Morpho 7<br />

Intra Q10<br />

Intra Q20<br />

Intra Q30<br />

24<br />

0 5 10 15<br />

Bitrate (kbit/s)<br />

20 25 30<br />

Figure 4.7: Results of various pre-processings on \Akiyo"<br />

pass ltering <strong>and</strong> additive noise, three other types of pre-processing have<br />

also been tested:<br />

Gaussian ltering with a st<strong>and</strong>ard deviation of 0.5, 0.9 or 1.2<br />

(which results in separable lters of size 3, 5 or 7).<br />

Median ltering 5 with a square window of size 3, 5 or 7.<br />

Morphological ltering: open-close with reconstruction 6 with square<br />

structuring element of size 3, 5 or 7.<br />

After pre-processing of all images, e<strong>very</strong> sequence has been H-263 coded<br />

with a quantization step of 10, 20 <strong>and</strong> 30. Figures 4.7 presents the mean<br />

5 \Median ltering consists of a sliding window encompassing an odd number of<br />

pixels. The center pixel in the window is replaced by the median of the pixels within<br />

the window. The median of a discrete sequence a1;a2; :::; aN <strong>for</strong> N odd is that<br />

member of the sequence <strong>for</strong> which (N , 1)=2 elements are smaller or equal in value,<br />

<strong>and</strong> (N , 1)=2 elements are larger or equal in value". (out of [105], p. 330)<br />

6 For the de nitions of morphological operators like open <strong>and</strong> close, please refer<br />

to [45, 118].


110 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

Size of the lter 3 5 7<br />

Gaussian 37.60 29.71 27.86<br />

Median 30.66 26.87 25.72<br />

Morpho 29.73 28.66 27.32<br />

Table 4.2: PSNR of \Akiyo" pre-processed sequences<br />

Peak Signal-to-Noise Ratios (PSNR 7 )versus the <strong>bitrate</strong>s <strong>for</strong> the Akiyo<br />

sequence. Curves are similar <strong>for</strong> all the other sequences.<br />

What directly emerges from these diagrams is that pre-processing is<br />

never able to surpass the per<strong>for</strong>mances of the \normal" (original) <strong>coding</strong>.<br />

This is <strong>for</strong>eseeable since the pre-processed sequences, prior to any<br />

<strong>coding</strong>, already have a<strong>very</strong> <strong>low</strong> PSNR when compared to the original<br />

sequence (cf. table 4.2). However, such a conclusion would be too<br />

simple. In order to better underst<strong>and</strong> the e ect of pre-processing, table<br />

4.3 introduces some details regarding the pre-processed <strong>coding</strong> of one<br />

exemplary sequence: Akiyo. PSNR 0 names the Peak SNR with respect<br />

to the equivalent pre-processed sequence prior to <strong>coding</strong>, while PSNR<br />

designates the Peak SNR with respect to the original sequence (without<br />

any pre-processing). The column entitled \Vector" presents the average<br />

amount of bits devoted to the <strong>coding</strong> of the <strong>motion</strong> in<strong>for</strong>mation.<br />

Pre-processing PSNR 0 PSNR Vector Bitrate (kbit=s)<br />

Original 27.88 27.88 57 4.16<br />

Intra Q10 28.48 27.89 57 4.23<br />

Gaussian 0.5 29.32 27.12 55 3.65<br />

Median 3 30.16 26.28 54 3.40<br />

Morpho 3 30.39 25.67 49 3.20<br />

Table 4.3: Coding of \Akiyo" with various pre-processings<br />

7 The Peak SNR (PSNR) between two images I <strong>and</strong> I 0<br />

10 log10( 2552<br />

2 ), where<br />

D<br />

is de ned as PSNR =<br />

.<br />

2<br />

D is the energy of the di erence image D = I , I 0


4.3 Experimental Results 111<br />

From this table, it appears that:<br />

As it was expected from equation (4.15), PSNR 0 raises. On the<br />

other h<strong>and</strong>, PSNR decreases. The constant quantization step<br />

used here accounts <strong>for</strong> these e ects.<br />

One may notice a crescendo in the simpli cation brought about<br />

by the various lters (from the intra pre-processing to the morphological<br />

ltering): each time the PSNR 0 is improved, whereas<br />

the <strong>bitrate</strong> is <strong>low</strong>ered. This tends to prove that the sequence to<br />

encode becomes simpler. This is illustrated on gures 4.8 <strong>and</strong> 4.10<br />

which present the result of the various pre-processings (all lters<br />

with a size of 3), <strong>for</strong> Coast Guard <strong>and</strong> Akiyo respectively.<br />

It is interesting to notice that the amount of bits necessary to encode<br />

the <strong>motion</strong> in<strong>for</strong>mation is reduced according to the degree of<br />

simpli cation. However, it is di cult to assert that it directly results<br />

from the simpli cation itself (which would be in contradiction<br />

with the experiment of table 4.1). This reduction is most probably<br />

due to the global cost optimization of H.263 which checks whether<br />

it is more appropriate to send some <strong>motion</strong> vectors with residues<br />

or directly encode a particular block with the DCT. Simpli ed<br />

images of course have a<strong>very</strong>-<strong>low</strong> cost when DCT-coded.<br />

It appears thus obvious from gure 4.7 that, at an equivalent <strong>bitrate</strong>,<br />

the pre-processing never achieves a more per<strong>for</strong>ming <strong>coding</strong> in terms<br />

of PSNR. One may argue then that the PSNR is an objective measure<br />

which, at <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong>s, is quite far from the real subjective impression<br />

of the viewer 8 .<br />

Figures 4.9 <strong>and</strong> 4.11 present the various pre-processed images of gures<br />

4.8 <strong>and</strong> 4.10 after <strong>coding</strong> with a quantization step of 30 9 . The visual<br />

inspection con rms the verdict of the PSNR curves: the images are too<br />

8 Research <strong>for</strong> an \objective criteria of subjective quality" <strong>for</strong> instance tries to<br />

better match the subjectivity of the viewer by using models based on the Human<br />

Visual System (HVS), like <strong>for</strong> instance a decomposition into perceptual channels [16]<br />

or an extension of these channels to the temporal component [129].<br />

9 Un<strong>for</strong>tunately, we had to per<strong>for</strong>m our H.263 simulations with a xed quantization<br />

step. It would have been more appropriate <strong>for</strong> subjective testing to present the various<br />

types of en<strong>coding</strong> at an equivalent <strong>bitrate</strong>. The fact is that H.263, so as to per<strong>for</strong>m<br />

its <strong>bitrate</strong> regulation, changes both the quantization step <strong>and</strong> the number of skipped<br />

frames. It would be meaningless to compare images from two sequences coded at<br />

5 kbit=s if one contains 100 coded frames <strong>and</strong> the other 95.


112 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

Sequence Original Selective Original Selective<br />

Q=30 Q=30 Q=20 Q=20<br />

Akiyo 4.16 4.10 6.79 6.65<br />

Coast Guard 20.55 16.89 36.45 26.27<br />

Table 4.4: Comparative <strong>bitrate</strong> of selective pre-processing of the background<br />

much debased <strong>and</strong> the gain in terms of <strong>bitrate</strong> is not worth the quality<br />

loss.<br />

However, this assertion should once more be nuanced. Indeed, the quality<br />

is unsatisfactory <strong>for</strong> the crucial part of the images, like the speaker,<br />

but is, in most cases, su cient as far as the background is concerned.<br />

The idea is then to introduce a selective simpli cation where appropriate.<br />

For instance, gures 4.12 <strong>and</strong> 4.13 present a selective pre-processing<br />

of the images along with their encoded versions. In both cases, an openclose<br />

morphological ltering with reconstruction <strong>and</strong> a square structuring<br />

element of size 7 has been applied to the background (only the water<br />

in Coast Guard). The quality is not only subjectively equivalent but the<br />

<strong>bitrate</strong> is also drastically reduced in case of global <strong>motion</strong> (background<br />

movement). Table 4.4 demonstrates it <strong>for</strong> Coast Guard: thanks to the<br />

simpli cation of the background, the <strong>bitrate</strong> can be devoted to encode<br />

the main objects of the scene with a better quality.<br />

This brings us to the concept of the <strong>bitrate</strong> regulation scheme [19] of<br />

COMIS (cf. Section 1.4.2) which gives priority to the regions that are<br />

considered subjectively more important. This implies that the coder<br />

can automatically segment the images into objects, track these segments<br />

along time 10 <strong>and</strong> decide which segments are relevant. Such analysis tools<br />

still need improvements so as to rein<strong>for</strong>ce their precision <strong>and</strong> robustness:<br />

preliminary results of selective pre-processing with automatic segmentation<br />

<strong>and</strong> tracking tools demonstrate a lot of instability, which is visually<br />

<strong>very</strong> annoying.<br />

10 For the examples of gures 4.12 <strong>and</strong> 4.13, h<strong>and</strong>-made segmentations have been<br />

used.


4.3 Experimental Results 113<br />

Original<br />

Median<br />

Intra<br />

Gaussian<br />

Morphological<br />

Figure 4.8: \Coast Guard" sequence: e ect of the various preprocessings


114 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

20.55 kbit/s<br />

35.76 kbit/s<br />

62.85 kbit/s<br />

Original<br />

Median<br />

Intra<br />

34.40 kbit/s<br />

34.10 kbit/s<br />

Gaussian<br />

Morphological<br />

Coding Q=30<br />

<strong>for</strong> all images<br />

Figure 4.9: \Coast Guard" sequence: Result of <strong>coding</strong> after preprocessing


4.3 Experimental Results 115<br />

Original<br />

Median<br />

Intra<br />

Gaussian<br />

Morphological<br />

Figure 4.10: \Akiyo" sequence: e ect of the various pre-processings


116 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

4.16 kbit/s<br />

9.09 kbit/s<br />

12.68 kbit/s<br />

Original<br />

Median<br />

Intra<br />

9.33 kbit/s<br />

9.92 kbit/s<br />

Gaussian<br />

Morphological<br />

Coding Q=30<br />

<strong>for</strong> all images<br />

Figure 4.11: \Akiyo" sequence: Result of <strong>coding</strong> after pre-processing


4.3 Experimental Results 117<br />

Original Original<br />

Q=30<br />

Selective Selective<br />

Q=30<br />

Selective<br />

Q=20<br />

20.55 kbit/s<br />

16.89 kbit/s<br />

26.25 kbit/s<br />

Figure 4.12: Result of <strong>coding</strong> with selective pre-processing <strong>for</strong> the \Coast<br />

Guard" sequence


118 Chapter 4. Image Pre-Processing <strong>for</strong> VLBR Video Coding<br />

Original Original<br />

Q=30<br />

Selective Selective<br />

Q=30<br />

Selective<br />

Q=20<br />

4.16 kbit/s<br />

4.10 kbit/s<br />

6.65 kbit/s<br />

Figure 4.13: Result of <strong>coding</strong> with selective pre-processing <strong>for</strong> the<br />

\Akiyo" sequence


4.4 Conclusion 119<br />

4.4 Conclusion<br />

The hypothesis that, at VLBR, image pre-processing prior to inter<strong>coding</strong><br />

can provide some <strong>coding</strong> gain has rst been <strong>for</strong>mally described<br />

in a Rate-Distortion framework: based on a Gaussian model of images,<br />

VLBR transmission has been considered to apply <strong>low</strong>-pass ltering to<br />

the images <strong>and</strong> to add white noise. The theoretical result of the preprocessing<br />

is then a reduction of the residual error. This error is demonstrated<br />

to be additionally reduced if the pre-processing can be considered<br />

as improving the pertinence of the <strong>motion</strong> eld.<br />

However, experiments prove that it is not exactly the case <strong>and</strong> that the<br />

simpli cation does not always al<strong>low</strong> the <strong>motion</strong> <strong>estimation</strong> to be more<br />

precise. Moreover, the <strong>bitrate</strong> gain o ered by some pre-processing is not<br />

worth the resulting loss of quality, but <strong>for</strong> subjectively less important<br />

regions like the scene background. This proves that a coder like H.263<br />

is <strong>very</strong> well optimized <strong>for</strong> <strong>coding</strong> <strong>low</strong> resolution images (QCIF) at (<strong>very</strong>)<br />

<strong>low</strong> <strong>bitrate</strong>s. Or the QCIF <strong>for</strong>mat can be viewed as a speci c preprocessing<br />

well-matched to the H.263 coder. A pre-processing on higher<br />

resolution pictures <strong>for</strong> other coders running at unusually <strong>low</strong> <strong>bitrate</strong>s (<strong>for</strong><br />

instance, MPEG-1 at 100 kbit=s) should provide more obvious gains.<br />

In case of global camera <strong>motion</strong>, it can be relevant to automatically<br />

segment <strong>and</strong> track the image background so as to simplify it: several<br />

kbit=s can be won without <strong>low</strong>ering the overall subjective quality of the<br />

sequence. According to the target application, one could even go further<br />

<strong>and</strong> decide to voluntarily suppress all camera <strong>motion</strong> that is not relevant<br />

to the scene interpretation, or to only transmit the <strong>motion</strong> parameters<br />

in relevant areas.<br />

As far as the present theoretical <strong>and</strong> experimental framework are concerned,<br />

they could be improved by:<br />

De ning a theoretical link between the variance 2<br />

I of an image <strong>and</strong><br />

the variance of its coded version. It would al<strong>low</strong> a closer analysis of<br />

the implications of the Rate-Distortion equations. These equations<br />

could also be re-developed using other models of images than the<br />

one used here (see <strong>for</strong> instance [14]).<br />

Designing pre-processing lters which take the need <strong>for</strong> temporal<br />

coherency along the sequence into account, <strong>and</strong> which are adapted<br />

on-line to the visual degradations present on the reference image.


Chapter 5<br />

Mesh-Based Motion<br />

Compensation<br />

In Chapter 2, two techniques <strong>for</strong> <strong>motion</strong> <strong>estimation</strong> have been extensively<br />

presented: the Block Matching Algorithm (BMA [50], cf. Section<br />

2.5) <strong>and</strong> warping techniques (cf. Section 2.6).<br />

While BMA is the technique that is mostly used in st<strong>and</strong>ard codecs,<br />

its <strong>compensation</strong> stage provides a predicted image that su ers from<br />

so-called blocking artifacts. In order to reduce such artifacts, overlapped<br />

block <strong>motion</strong> <strong>compensation</strong> has been proposed [116, 4] (cf. Section<br />

2.5.2).<br />

The triangular mesh of Nakaya <strong>and</strong> Harashima [88] (cf. Section 2.6.1)<br />

per<strong>for</strong>ms a better <strong>compensation</strong> stage, because the triangles implicitly<br />

use the a ne trans<strong>for</strong>m (which is able, what with six free parameters,<br />

to tackle rotations <strong>and</strong> zoom e ects in addition to the translation). Although<br />

it can per<strong>for</strong>m in real-time on speci c hardware, such awarping<br />

solution has a computational burden that largely exceeds that of the<br />

BMA, mainly because of its iterative nature. Dudon extended this work<br />

to active meshes that are automatically adapted to the spatial contents<br />

of the image [29] (cf. Section 2.6.2) <strong>and</strong> also establishes other ways of<br />

estimating the <strong>motion</strong> of the vertices (mesh nodes) [28]. Wang <strong>and</strong> Lee<br />

propose to per<strong>for</strong>m a global optimization [136], while Altunbasak <strong>and</strong><br />

Tekalp focus on how to optimally represent a dense <strong>motion</strong> eld [3].<br />

In addition to the computational burden, a major disadvantage of per<strong>for</strong>ming<br />

<strong>motion</strong> <strong>estimation</strong> with meshes (triangular or quadrilateral in<br />

nature) is the transmission cost of the <strong>motion</strong> parameters, which remains<br />

higher than the cost of transmitting the BMA <strong>motion</strong> eld.


122 Chapter 5. Mesh-Based Motion Compensation<br />

The adaptation of the vertices location to the spatial contents enables increasing<br />

the relevance of the estimated <strong>motion</strong> eld by ana priori liaison<br />

between the spatial <strong>and</strong> the temporal in<strong>for</strong>mation. Other advantages of<br />

such adaptive mesh structures is the huge domain of applications <strong>and</strong><br />

added functionalities they permit to cover: some examples are 3-D modeling<br />

[10, 61] or the link with fractal models <strong>for</strong> intra-<strong>coding</strong> [18], but<br />

also <strong>video</strong> editing e ects like synthetic object trans guration [122], augmented<br />

reality <strong>and</strong> other functionalities directly related to Synthetic <strong>and</strong><br />

Natural Hybrid Coding [26] of MPEG-4 (see e.g. the Core Experiment<br />

M2 [121]).<br />

The challenging idea of the present chapter is to combine the advantages<br />

of both previously described techniques: the classical BMA <strong>and</strong> the implicit<br />

a ne <strong>motion</strong> model developed by the triangular mesh (which isin<br />

fact a wireframe). The principle is as fol<strong>low</strong>s: nothing is changed as long<br />

as the <strong>motion</strong> <strong>estimation</strong> is concerned; a classical BMA is applied <strong>and</strong><br />

the resulting <strong>motion</strong> vectors are transmitted as such. The aim of this<br />

chapter is research inhow to modify the <strong>compensation</strong> (reconstruction)<br />

stage: the <strong>motion</strong> vectors are not merely used in order to displace blocks<br />

of the reference frame but rather to serve as an in<strong>for</strong>mation set to warp<br />

a mesh that has been built on this reference image. Such a combination<br />

seems to be signi cant inseveral ways: rst, it enables to heighten the<br />

subjective quality of the reconstructed pictures by taking spatial in<strong>for</strong>mation<br />

into account (adaptive vertices location) <strong>and</strong> al<strong>low</strong>ing a richer<br />

<strong>motion</strong> eld parameterization (a ne trans<strong>for</strong>m) 1 . This is per<strong>for</strong>med<br />

with no increase either in the computational burden of the <strong>estimation</strong><br />

or in transmission cost. Second, <strong>and</strong> most important, it improves the<br />

representation stage <strong>and</strong> al<strong>low</strong>s <strong>for</strong> the use of new functionalities without<br />

losing its compatibility with existing st<strong>and</strong>ards, since the bitstream<br />

is not modi ed 2 .<br />

This idea of an asymmetric scheme has already been touched upon by<br />

Li et al. in their review of <strong>video</strong> <strong>coding</strong> <strong>and</strong> <strong>motion</strong> <strong>estimation</strong> techniques<br />

[63]: \it seems that it is wise to use wireframe models <strong>for</strong> image<br />

synthesis but not <strong>for</strong> image analysis". This chapter aims at implemen-<br />

1 Although it is expected to improve the subjective quality, itisinnoway expected<br />

to raise the PSNR (i.e. objective quality) of the reconstructed pictures. As<br />

the BMA is a correlation-based technique which optimizes its <strong>estimation</strong> <strong>for</strong> a blockbased<br />

reconstruction, another type of reconstruction can of course not be expected<br />

to achieve a better correlation result.<br />

2 This issue will be discussed in the conclusion of the chapter, Section 5.5.


5.1 Estimation, Transcription <strong>and</strong> Reconstruction 123<br />

ting such an asymmetric scheme.<br />

Several aspects of image processing will here be dealt with: ltering<br />

used to select image features serving as vertices of a content-based mesh;<br />

interpolation necessary to transpose the block-based <strong>motion</strong> in<strong>for</strong>mation<br />

to the mesh representation; <strong>and</strong> basic computer graphics involved in the<br />

warping of the mesh.<br />

In order to indicate the objectives of the chapter, Section 5.1 uses an<br />

example to highlight the key features <strong>and</strong> speci cations of the proposed<br />

reconstruction method. Later sections focus on details of the scheme:<br />

corner extraction required to automatically generate the mesh, <strong>and</strong> the<br />

inverse kriging operation, which computes the <strong>motion</strong> vectors of e<strong>very</strong><br />

vertex of the mesh, are presented in sections 5.2 <strong>and</strong> 5.3, respectively.<br />

Finally, Section 5.4 applies the scheme to several images so as to comment<br />

results of the proposed scheme. Some conclusions are drawn in the<br />

last section, 5.5.<br />

5.1 Estimation, Transcription <strong>and</strong> Reconstruction<br />

The aim of this section is to introduce the asymmetric scheme while<br />

outlining the reconstruction process. The section ends with the identi -<br />

cation of speci c problems to be tackled in order to reach a satisfactory<br />

solution.<br />

In the fol<strong>low</strong>ing sections, the images of gure 5.2 (identical to the ones<br />

of Chapter 2) will be used as sample images at time t <strong>and</strong> t , 1. It<br />

has to be noted that the images are noised: the square is thereafter not<br />

made up of at texture <strong>and</strong> its contours are rather fuzzy. Figure 5.2 (c)<br />

displays the luminance values of the pixels of the 50 th line of the image<br />

I(t , 1).<br />

5.1.1 The Proposed Reconstruction<br />

The scheme aims at enhancing the reconstruction (<strong>compensation</strong>) step<br />

of the BMA thanks to the injection of a priori spatial in<strong>for</strong>mation. Instead<br />

of merely applying the <strong>motion</strong> vectors detected <strong>for</strong> e<strong>very</strong> block,<br />

the scheme of gure 5.1 proposes to modify the <strong>motion</strong> reconstruction<br />

in a twofold way so as to apply it to a mesh structure that has been<br />

automatically designed upon the reference image:


124 Chapter 5. Mesh-Based Motion Compensation<br />

CODER<br />

Reference image<br />

at time t-1<br />

New image<br />

at time t<br />

BMA <strong>estimation</strong><br />

TRANSMISSION CHANNEL<br />

BMA <strong>motion</strong><br />

field (backward)<br />

DECODER<br />

Reference image<br />

at time t-1<br />

Mesh creation<br />

Determine<br />

vertices <strong>motion</strong><br />

(<strong>for</strong>ward)<br />

Mesh warping<br />

Predicted image<br />

at time t<br />

Figure 5.1: Outline of the proposed reconstruction scheme.<br />

the transcription of the <strong>motion</strong> in<strong>for</strong>mation is altered (interpolated)<br />

in order to de ne the <strong>motion</strong> components of e<strong>very</strong> vertex of<br />

the mesh structure;<br />

the <strong>compensation</strong> stage then consists in warping the mesh according<br />

to the newly determined <strong>motion</strong> in<strong>for</strong>mation.<br />

The expected result of such a procedure is presented on gure 5.3 <strong>for</strong><br />

the sample images. What emerges in comparison to gures 2.16, 2.22<br />

<strong>and</strong> 2.23 is that the blocking artifacts are suppressed by the use of a<br />

mesh that takes into account the object boundaries. Moreover, the<br />

computational burden at the decoder end is not drastically increased<br />

as the <strong>motion</strong> vectors are not to be estimated but rather computed on<br />

a limited number of vertices.<br />

5.1.2 Problems to be Addressed<br />

The sections to come will attempt to detail the main points of the proposed<br />

scheme but it seems important to rst identify the problems to


5.1 Estimation, Transcription <strong>and</strong> Reconstruction 125<br />

(a) Original image at time t-1 (b) Original image at time t<br />

intensity value<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 20 40 60 80 100 120 140 160 180<br />

(b) Histogram of line 50<br />

pel number<br />

Figure 5.2: The two original sample images.<br />

be solved in order to obtain a stable <strong>and</strong> viable solution. Three speci c<br />

parts of the algorithm necessitate some preliminary explanation as well<br />

as a precise schedule of conditions.<br />

5.1.2.1 Mesh Vertices<br />

The rst step of the reconstruction algorithm is the automatic design<br />

of a mesh on the reference image. Meshes may be designed in several<br />

ways. Dividing the image into a priori equal patches provides a regular<br />

mesh. Such meshes are not adapted to our purposes as they do not<br />

re ect the scene content <strong>and</strong> one patch can contain multiple <strong>motion</strong>s<br />

(because overlaid on several distinct objects). Hierarchical meshes [122]<br />

are not needed either as they are intended <strong>for</strong> re nement with time.<br />

Knowledge-based mesh design that exploits a priori in<strong>for</strong>mation (like


126 Chapter 5. Mesh-Based Motion Compensation<br />

(a) Optimal wireframe (b) Warped wireframe<br />

Figure 5.3: Toy example: expected result of the new reconstruction<br />

process.<br />

<strong>for</strong> facial animation in <strong>video</strong>phony [115]) is useless here since the scheme<br />

intends to deal with any given <strong>video</strong> sequence.<br />

Content-based mesh, that aims at matching boundaries of patches with<br />

important scene features, appears to be the solution. Such an adaptation<br />

of the mesh may be implemented in several ways: based on spatial <strong>and</strong><br />

temporal activity [28], or in order to minimize a function of several<br />

constraints [136]. One of these constraints can be the reconstruction<br />

error. However, the aim here is to exploit in<strong>for</strong>mation already accessible<br />

at the decoder end <strong>and</strong> to t as much in<strong>for</strong>mation as possible to the<br />

objects. The choice is thus to detect some corners -or feature pointsof<br />

the image. The set up of the triangular active mesh can then be<br />

implemented through a Delaunay triangulation [119] of the previously<br />

detected points.<br />

Constraints about the number <strong>and</strong> the location of vertices of the mesh<br />

may be pointed out:<br />

Their number should be high enough so as to surround all objects<br />

present in the scene <strong>and</strong> to al<strong>low</strong> these objects to undergo independent<br />

movements. However, if there are too many vertices, the<br />

mesh could reproduce the discontinuities of the block-based vectors,<br />

<strong>and</strong> thereafter keep the blocking artifacts, which should be


5.1 Estimation, Transcription <strong>and</strong> Reconstruction 127<br />

avoided. In other words, the number of vertices should be limited<br />

in order to smoothen the <strong>motion</strong> eld <strong>and</strong> suppress the blocking<br />

artifacts.<br />

They should be correctly located on object boundaries <strong>and</strong> corners,<br />

as appears experimentally from gure 5.4 (out of [138]): considering<br />

two original images at time t , 1 <strong>and</strong> t ( gure 5.4(a)), one<br />

can see the e ect of mesh warping when the vertices are arbitrarily<br />

placed (b), located on object edges (c) or correctly located<br />

on corners <strong>and</strong> edges (d). Only the latter case prevents spatial<br />

degradations.<br />

time t-1<br />

time t<br />

(a) (b) (c) (d)<br />

Figure 5.4: Experiment on the importance of the location of mesh vertices.<br />

5.1.2.2 Motion Interpolation<br />

The second point of the transcription change is to interpolate the <strong>motion</strong><br />

eld: the BMA in<strong>for</strong>mation can be considered as a coarse subsampling<br />

of a dense <strong>motion</strong> eld. The <strong>motion</strong> of the vertices represents another<br />

subsampling of the same dense eld. The problem is that the dense <strong>motion</strong><br />

eld is not known. Combined with the inversion problem mentioned<br />

hereunder, an appropriate interpolation technique should be found. It<br />

should also remain as simple as possible so as not to increase the computational<br />

burden too much.


128 Chapter 5. Mesh-Based Motion Compensation<br />

5.1.2.3 Reversing the Motion In<strong>for</strong>mation<br />

While adapting the <strong>motion</strong> in<strong>for</strong>mation to the active mesh, an important<br />

point must be taken into account: this in<strong>for</strong>mation should often be<br />

reversed. The BMA, as implemented in most st<strong>and</strong>ards, is e ectively<br />

computed \backwards", while the mesh, implemented on the image to<br />

be <strong>motion</strong> compensated, needs \<strong>for</strong>ward" vectors. It means that the<br />

<strong>motion</strong> vectors of the BMA indicate where a block ofI(t) comes from<br />

in I(t , 1), whereas the mesh is designed on I(t , 1) <strong>and</strong> the <strong>motion</strong><br />

vectors must indicate where e<strong>very</strong> node must arrive I(t).<br />

5.2 Mesh Design<br />

The retained method to automatically design the adaptive mesh thus<br />

consists of i) detecting image corners to use them as mesh vertices; <strong>and</strong><br />

ii) building the mesh by Delaunay triangulation. The latter is well known<br />

in the literature [119] <strong>and</strong> <strong>and</strong> is brie y presented in Appendix D. The<br />

developed corner detector will receive some explanation here.<br />

The literature proposes two classes of algorithms <strong>for</strong> the extraction of<br />

corners. The techniques of the rst class work directly on the greylevel<br />

image. A \cornerness" measure is rst computed <strong>for</strong> each pixel<br />

of the image through measurement of gradients <strong>and</strong> surface curvatures.<br />

Cornerness C is de ned as the product of gradient magnitude <strong>and</strong> the<br />

rate of change of gradient direction. After that, the corners are extracted<br />

by applying a threshold on the used measure. Most known detectors of<br />

this kind are:<br />

the one of Beaudet [5] who proposed a detector which looks <strong>for</strong><br />

extrema of a rotationally invariant operator DET, i.e. the determinant<br />

of the Hessian of the image:<br />

DET = IxxIyy , I 2<br />

xy; (5.1)<br />

where Ii designs the partial derivative ofI with respect to i, <strong>and</strong><br />

Iij a second partial derivative;<br />

the operator proposed by Dreschler <strong>and</strong> Nagel [27] relies on a combination<br />

of maximum <strong>and</strong> hyperbolic points of the Gaussian curvature<br />

of the intensity surface<br />

C =<br />

DET<br />

(1 + I 2 x + I 2 ; (5.2)<br />

2<br />

y)


5.2 Mesh Design 129<br />

the cornerness measure of Kitchen <strong>and</strong> Rosenfeld [56] de ned as<br />

the change of the gradient direction along an edge contour multiplied<br />

by the local gradient magnitude<br />

C =<br />

2<br />

IxxIy + IyyI 2<br />

x , 2IxyIxIy<br />

I 2 x + I 2 y<br />

; (5.3)<br />

the operator of Zuniga <strong>and</strong> Haralick [139], which is based on a facet<br />

model approach where images are considered as bi-cubic polynomial<br />

surfaces; ...<br />

the detector of Harris [46] which makes use of the local autocorrelation<br />

function to simultaneously detect corners <strong>and</strong> edges,<br />

Deriche [25] has highlighted the behavior of such well-known cornerness<br />

measures so as to correct their faulty localization. It results<br />

in a measure that combines Beaudet's measure <strong>and</strong> the zerocrossing<br />

of the Laplacian.<br />

Recently, Rohr [114] has proposed an analytical model of grayvalue<br />

corners in order to further study the properties of direct<br />

corner detectors.<br />

The second class of algorithms at an initial stage explicitly extracts the<br />

edges as chain codes. The corners are then found as points belonging<br />

to those edges which have a high curvature <strong>and</strong> whose curvature is a<br />

local maximum. Edges are often detected by means of the Cany [79]<br />

operator, while the curvature along an edge may be estimated by nite<br />

di erentiation [79,83] of the chain of neighboring pixels or by computing<br />

the partial derivatives [24] using intensity images.<br />

In order to get rid of the image noise as well as to select only the<br />

\strongest" corners, techniques belonging to the rst class are usually<br />

applied to a ltered version of the images, which results in corners displacement.<br />

On the other h<strong>and</strong>, the Cany edge detector used in the second<br />

approach is unable to detect complicated shapes such as locations<br />

where many contours meet. Deriche's technique [25] skirts both these<br />

problems as it is able to detect multiple points <strong>and</strong> tries to correct the<br />

corner displacement caused by ltering. However, the later technique<br />

has revealed itself <strong>very</strong> dem<strong>and</strong>ing <strong>and</strong> precise only at subpel accuracy.<br />

As precise location on real edges seemed a priority (cf. Section 5.1.2.1),<br />

eventually the authors opted <strong>for</strong> a second class algorithm. At rst, the


130 Chapter 5. Mesh-Based Motion Compensation<br />

half boundaries detector of Noble [94] is used to detect edges as well<br />

as multiple points. It is then combined with the nite di erentiation<br />

technique of Najman <strong>and</strong> Vaillant [87] to select high curvature points.<br />

5.2.1 Detecting Edges<br />

With respect to our application, one of the great improvement brought<br />

by Noble's edge detector is that it does not propose to detect edges in<br />

a classical way, but rather to nd half-boundaries. Moreover, it makes<br />

use of morphological operators [45, 118] which per<strong>for</strong>m <strong>very</strong> fast.<br />

Although Noble's algorithm is originally made out of three distinctive<br />

steps, namely feature enhancement, boundary fol<strong>low</strong>ing <strong>and</strong> boundary<br />

stitching, only the two rst parts have been used in our implementation.<br />

While the <strong>very</strong> rst step has not really been modi ed, some improvements<br />

have been brought to the second in order to better ll our needs.<br />

The fol<strong>low</strong>ing subsections detail the working principles of both steps.<br />

5.2.1.1 Enhancing Image Boundaries<br />

Real edges are characterized by a second derivative zero-crossing, i.e.<br />

their response to a second order derivative operator, like the Laplacian, is<br />

null at the edge location. However, such edges are often located between<br />

the pixels (cf. gure 5.5(a)), which <strong>for</strong>ces algorithms to estimate them by<br />

di erentiating the response values from two points on either side of the<br />

zero crossing in the direction of maximum change. Such a measurement<br />

is too local to al<strong>low</strong> the algorithm distinguishing between real edges <strong>and</strong><br />

noise. This is why Noble proposes to track the edges in regions adjacent<br />

to boundaries, which she calls half-boundaries (cf. gure 5.5). The<br />

edge tracking is there<strong>for</strong>e not based on the boundary strength anymore<br />

but on the shape of operator response.<br />

The main requirement to track half boundaries is then to have anoperator<br />

that can distinguish between the sides of a boundary. Such an<br />

operator is the so-called signed max dilation-erosion residue, which<br />

acts as a second derivative lter. If one de nes the erosion 3 residue<br />

fer(f) =f , (f B); (5.4)<br />

3 For the de nitions of morphological operators like erosion <strong>and</strong> dilation, please<br />

refer to [45, 118].


5.2 Mesh Design 131<br />

0 0 0<br />

0<br />

0<br />

0<br />

0 0 0<br />

0<br />

0<br />

0<br />

0 0 0<br />

negative<br />

half-boundary<br />

0 0 0<br />

0<br />

0<br />

0<br />

0 0 0<br />

0<br />

0<br />

0<br />

0 0 0<br />

negative<br />

half-boundary<br />

H<br />

H<br />

H<br />

H<br />

H<br />

H<br />

H<br />

H<br />

H<br />

H<br />

H<br />

H<br />

H H<br />

real edge<br />

(a) Ideal Step<br />

2H<br />

2H<br />

2H<br />

2H<br />

real edge<br />

H H<br />

2H<br />

2H<br />

2H<br />

2H 2H<br />

H<br />

H<br />

H<br />

H<br />

positive<br />

half-boundary<br />

2H 2H<br />

2H<br />

2H<br />

2H<br />

2H<br />

(b) Narrow ramp or blurred step<br />

positive<br />

half-boundary<br />

Figure 5.5: Location of real edges <strong>and</strong> half-boundaries.<br />

where f is the image function <strong>and</strong> B the structuring element, <strong>and</strong> the<br />

dilation residue<br />

fdr(f) =(f B) , f; (5.5)<br />

then the max dilation-erosion residue is de ned by:<br />

fmaxder(f) =max[fer(f);fdr(f)]: (5.6)<br />

This operator is not intended to be used as a morphological edge detec-


132 Chapter 5. Mesh-Based Motion Compensation<br />

tor (it would be as much noise-sensitive as the Laplacian). It is used<br />

to provide connected response regions on either side of a boundary by<br />

classifying pixels according to the type <strong>and</strong> magnitude of the response.<br />

Positive or negative half-boundaries are the lines <strong>for</strong>med by connected<br />

pixels with a (positive or negative) residual value <strong>and</strong> that are closest<br />

to the edges.<br />

Concretely, the boundary enhancement is per<strong>for</strong>med by rst computing<br />

the fmaxder operator <strong>for</strong> the whole image (<strong>for</strong> purpose of computational<br />

burden, the used structuring element is a square of length 5 pixels).<br />

Then the pixels are classi ed according to the type of residual in<strong>for</strong>mation<br />

they possess:<br />

Pixels where none of the (dilation or erosion) residues are signi -<br />

cant are assigned a background label.<br />

If one of the residues is signi cant (i.e. non zero), the pixel receives<br />

a label indicating its localization with respect to edges:<br />

{ If the dilation residue is larger than the erosion one, it means<br />

that the points is adjacent to the boundary on the darker side<br />

of the intensity edge (cf. gure 5.5). The pixel then receives<br />

a `n' label, where n st<strong>and</strong>s <strong>for</strong> negative.<br />

{ Similarly, `p' labels are assigned to positive pixels whose erosion<br />

residue is larger than the dilation one.<br />

{ If both residues are di erent from zero but equal in amplitude,<br />

the pixel is an internal or ramp one, i.e. it is located either<br />

on a real edge either on a narrow ramp (cf. gure 5.5(b)). It<br />

receives a `r' label.<br />

So as to distinguish between <strong>low</strong> amplitude response due to real edge<br />

features <strong>and</strong> non zero response due to noise, a threshold is applied. The<br />

so-called c<strong>and</strong>idate boundary points are those points which are characterized<br />

by a residual value superior to a threshold T 1 <strong>and</strong> which are<br />

connected to a point of opposite type (`n' vs `p', or `r') whose residue is<br />

also larger than T 1.Internal pixels are a special case of c<strong>and</strong>idate points.<br />

One major advantage of Noble's operator, fol<strong>low</strong>ed by thresholding, is<br />

that it does not require any pre-processing. The edges <strong>and</strong> corners are<br />

there<strong>for</strong>e not displaced (cf. the introduction of Section 5.2) <strong>and</strong> the<br />

detected half-boundaries are correctly located along real edges.


5.2 Mesh Design 133<br />

(a)<br />

(c)<br />

(b)<br />

Background points<br />

Negative points<br />

Positive points<br />

Internal ’r’ points<br />

Figure 5.6: Boundaries enhancement: (a) fmaxder after histogram equalization<br />

(b) Pixel classi cation (c) C<strong>and</strong>idate boundary points<br />

Whereas Noble uses a rst-order neighborhood (North, South, East,<br />

West) while searching <strong>for</strong> connected points of opposite types, we here<br />

per<strong>for</strong>m the search among 8 neighbors (N, S, E, W plus diagonals). The<br />

reason <strong>for</strong> this choice will be detailed in Section 5.2.1.2. Another change<br />

is that it seems Noble keeps all internal pixels as c<strong>and</strong>idate points. We<br />

decided to keep only the `r' points whose residual value is superior to<br />

T 1.


134 Chapter 5. Mesh-Based Motion Compensation<br />

Figure 5.6 demonstrates all the intermediate results of this rst step<br />

with respect to original image 5.2(a). Comparing gure (b) with gure<br />

(c), one can see that the selection of c<strong>and</strong>idate points already rejects<br />

several points as background ones.<br />

5.2.1.2 Fol<strong>low</strong>ing Half Boundaries<br />

Tracking the half-boundaries is then the next stage of the algorithm.<br />

This procedure involves two steps: initialization <strong>and</strong> half-boundary fol<strong>low</strong>ing.<br />

The initialization stage consists in selecting c<strong>and</strong>idate points that are<br />

relevant enough so as to start a tracking procedure. These points are<br />

selected according to the fol<strong>low</strong>ing criteria:<br />

the residual value of the point isabove a threshold T 2 (of course<br />

T 2 >T 1);<br />

the point is connected to at least one other c<strong>and</strong>idate point of the<br />

same type (`p' or `n') which has not yet been traced be<strong>for</strong>e <strong>and</strong><br />

which also has a residue greater than T 2.<br />

Whereas Noble uses 4-connection <strong>for</strong> the second criterion, we still use 8connection<br />

as it will be explained later. Noble also adds a third condition<br />

which is that none of the neighbor points can have already been traced.<br />

The reason <strong>for</strong> this is to avoid beginning a contour in a junction region<br />

(where many edges meet). As this condition makes the whole process<br />

more complex <strong>and</strong> does not bring a real improvement, we suppressed it<br />

in our implementation.<br />

Noble then proposes to per<strong>for</strong>m an oriented tracking of half-boundaries.<br />

The contour fol<strong>low</strong>ing direction is de ned so that boundaries are traversed<br />

in the <strong>for</strong>ward direction keeping the darker side of intensity to<br />

the right (i.e. negative labels are tracked in the <strong>for</strong>ward direction keeping<br />

the positive labels on the left). Starting from a valid initial c<strong>and</strong>idate<br />

point, contour tracking is rst per<strong>for</strong>med <strong>for</strong>wards <strong>and</strong> then backwards.<br />

Noble uses a set of twelve 3 3 templates ( gure 5.7) al<strong>low</strong>ing to fol<strong>low</strong><br />

straight lines <strong>and</strong> 90 right or left angles. In the meantime, end of<br />

contours (gaps) as well as multiple junctions (when the next point has<br />

already been traced) are detected.<br />

Moreover, in order to avoid the problems of wrong association in case<br />

of contours passing through an arrow or `W'-junction, highly curved


5.2 Mesh Design 135<br />

n n<br />

x<br />

x<br />

n<br />

n<br />

n<br />

n<br />

x<br />

x n<br />

n x<br />

n n n<br />

x n n n n x n n x x<br />

n<br />

n n<br />

x<br />

x<br />

n<br />

n<br />

n<br />

n<br />

n<br />

n<br />

n<br />

n<br />

n<br />

n<br />

n<br />

n<br />

n<br />

x x<br />

Figure 5.7: Templates <strong>for</strong> the direction of tracking <strong>for</strong> negative labeled<br />

pixels. Crosses indicate location of opposite label types (`p' or `r'). The<br />

central pixel (shaded) is assigned the <strong>for</strong>ward <strong>and</strong> backward directions<br />

as indicated by the arrows.<br />

boundaries or thin bars, Noble proposes to per<strong>for</strong>m the tracking on the<br />

boundary shape image (the image of c<strong>and</strong>idate points) exp<strong>and</strong>ed by<br />

a factor of 2. Figure 5.8 depicts the improvement brought to tracking<br />

thanks to such an expansion, which has been used in our implementation.<br />

However, Noble still uses a neighborhood of size 1 (top, right, bottom,<br />

left). It means that e<strong>very</strong> time a half-boundary undergoes a 45 angle,<br />

it can not be tracked anymore. Of course, another half-boundary will<br />

most certainly be created <strong>for</strong> the remainder of the points. If such a<br />

splitting of a real half-boundary is not annoying <strong>for</strong> a contour detector,<br />

it is <strong>very</strong> disturbing <strong>for</strong> the next step of our algorithm, i.e. curvature<br />

measurement. This step requires contours (half-boundaries) as long as<br />

possible so as to assess its own measure. Since tracking is per<strong>for</strong>med on<br />

an exp<strong>and</strong>ed image, the addition of only eight new templates ( gure 5.9)<br />

subsequently al<strong>low</strong>s tackling 45 angles. These new templates use a<br />

neighborhood of size 2. Purpose of coherence throughout the whole<br />

process justi es the use of such a neighborhood in the previous steps of<br />

x<br />

x<br />

n<br />

x<br />

n<br />

n


136 Chapter 5. Mesh-Based Motion Compensation<br />

End of contour<br />

due to thin bar<br />

Exp<strong>and</strong> (x2)<br />

Figure 5.8: Improvement of the boundary tracking: (a) Wrong association<br />

in regular dimension (b) This problem is overcome by tracking on<br />

an exp<strong>and</strong>ed version of the boundary shape image<br />

the algorithms. It al<strong>low</strong>s to obtain more complete boundaries as they<br />

are no longer split into several pieces e<strong>very</strong> time a 45 angle arises, as<br />

gure 5.10(c) demonstrates with respect to gure 5.10(b).


5.2 Mesh Design 137<br />

n<br />

x n<br />

x<br />

n<br />

n<br />

n<br />

n<br />

n<br />

n<br />

n<br />

n<br />

x n x<br />

n<br />

n<br />

n<br />

x n n x<br />

n<br />

n n<br />

n<br />

n n<br />

Figure 5.9: Additional templates <strong>for</strong> 45 direction of tracking <strong>for</strong> negative<br />

labeled pixels<br />

Figure 5.11 illustrates the result of the improved version of Noble's halfboundaries<br />

nder. What is important is their precise location along real<br />

image borders <strong>and</strong> the fact that <strong>for</strong> instance the whole exterior shape<br />

of the square is detected as a unique contour. Of course, additional<br />

contours are detected due to the noisy character of the image texture<br />

(cf. gure 5.2 (c)).<br />

5.2.2 Corner Extraction<br />

E<strong>very</strong> edge is then traveled in order to determine the high curvature<br />

points. The procedure proposed by Najman <strong>and</strong> Vaillant is applied.<br />

If the chain of pixels <strong>for</strong>ming half-boundary is denoted fpig (with i =<br />

1; ::; n), then <strong>for</strong> e<strong>very</strong> pixel pi <strong>and</strong> <strong>for</strong> k = mink; ::; m, one computes:<br />

,!<br />

ai;k = ,,,!<br />

pipi+k; ,!<br />

bi;k = ,,,!<br />

pipi,k; cosi;k =<br />

x<br />

x<br />

,!<br />

ai;k: ,!<br />

bi;k<br />

j ,!<br />

ai;kj:j ,!<br />

bi;kj<br />

n<br />

n<br />

: (5.7)<br />

For e<strong>very</strong> pi, the highest length k = kmax that matches the fol<strong>low</strong>ing<br />

equation is retained:<br />

cosi;m < cosi;m,1


138 Chapter 5. Mesh-Based Motion Compensation<br />

(a)<br />

(b)<br />

(c)<br />

Figure 5.10: Improvement of taking into account 45 angles <strong>for</strong> halfboundaries<br />

tracking: (a) Original image <strong>and</strong> zoom (b) Half-boundaries<br />

with the initial 12 templates (e<strong>very</strong> color indicates a di erent halfboundary)<br />

(c) With the 8 additional templates<br />

satis es two conditions: i) it is above a given threshold, <strong>and</strong> ii) it is<br />

greater than the curvature of all pixels of the edge whose distance to pi<br />

is <strong>low</strong>er than kmax=2.<br />

Gaps <strong>and</strong> junctions already detected in the previous step are added to<br />

the corner list. For the purpose of mesh design <strong>and</strong> in order to better


5.2 Mesh Design 139<br />

(a) Positive (white) <strong>and</strong> negative<br />

(black) half-boundaries<br />

(b) Contours surimposed on original<br />

Figure 5.11: Example of half-boundary detection<br />

surround objects, \corners" are automatically added e<strong>very</strong> time a (piece<br />

of) contour exceed l (typically, l = 45) pixels.<br />

Finally, only the pairs of adjacent corners (one on a positive <strong>and</strong> one<br />

on a negative half-boundary) are retained. Corners are indeed detected<br />

on both the positive <strong>and</strong> negative half-boundaries. This double in<strong>for</strong>mation<br />

is used to assess the corner extraction <strong>and</strong> get rid of the noise<br />

in uence since a corner on a positive half-boundary has to be neighbor<br />

to a \negative" corner. Then it is up to the application to keep double<br />

corners (the purpose of keeping pairs of corners will be presented in the<br />

next section about the <strong>motion</strong> transcription) or select only the `n' or `p'


140 Chapter 5. Mesh-Based Motion Compensation<br />

points. Alternatively, the points located on a real edge in-between two<br />

corners of opposite type.<br />

Figure 5.12 presents the results of the overall procedure, as well as the<br />

automatically generated mesh once the Delaunay triangulation has been<br />

applied. One may argue that no point is detected on the left side of the<br />

square: it is due to the high irregularity of this side that the algorithm<br />

is prevented from per<strong>for</strong>ming well. Note however that all the other<br />

sides are correctly detected as regular edges <strong>and</strong> thereafter correctly<br />

surrounded. Some false corners are of course also extracted from the<br />

contours due to texture noise (cf. gure 5.2 (c)).<br />

(a) Detected corners (b) Delaunay triangulation<br />

Figure 5.12: Example of corner detection.<br />

On the computational point of view, the morphological operator of No-


5.3 Motion Transcription 141<br />

ble per<strong>for</strong>ms <strong>very</strong> fast, <strong>and</strong> the fol<strong>low</strong>ing contour detection <strong>and</strong> corner<br />

extraction only involve one image scan <strong>and</strong> two contour trackings.<br />

5.3 Motion Transcription<br />

The aim of the <strong>motion</strong> transcription is to determine the <strong>motion</strong> vector to<br />

be applied to e<strong>very</strong> vertex of the mesh designed in the previous step. The<br />

problem has been stated as being of double nature: rst the \backward"<br />

sense of the BMA <strong>estimation</strong> should be reversed, <strong>and</strong> second the values<br />

interpolated.<br />

5.3.1 Reversing the Sense of the Motion In<strong>for</strong>mation<br />

As already stated in the schedule of conditions, the mesh is designed on<br />

I(t , 1) <strong>and</strong> the <strong>motion</strong> vectors must indicate where e<strong>very</strong> vertex must<br />

arrive I(t). It means that, in case of backward BMA, the sense of the<br />

<strong>motion</strong> in<strong>for</strong>mation should be reversed.<br />

If one considers the <strong>estimation</strong> per<strong>for</strong>med by the BMA as a coarse<br />

subsampling of a dense <strong>motion</strong> eld, the estimated vector (^u; ^v) (cf.<br />

Section 2.5.1) can be assigned to the center of the block, at location<br />

( K,1 L,1 ; ). This means that, in the <strong>for</strong>ward direction, it is equivalent<br />

2 2<br />

to a <strong>for</strong>ward vector (,^u; ,^v) located at position ( K,1 L,1 +^u; 2<br />

2 +^v).<br />

5.3.2 Interpolating the Motion Values<br />

Once <strong>motion</strong> samples, placed on an irregular grid in case of backward<br />

BMA, have been obtained, the goal is next to estimate the values of<br />

other samples, placed on an(other) irregular grid.<br />

The simple process one may think of in case of <strong>for</strong>ward BMA <strong>estimation</strong><br />

is displace e<strong>very</strong> mesh vertex according to the <strong>motion</strong> vector of the block<br />

it belongs to. In case of backward BMA <strong>estimation</strong>, it is equivalent toa<br />

nearest neighbor <strong>estimation</strong>, i.e. the vertex is assigned the value of the<br />

nearest known vector. Figures 5.13 <strong>and</strong> 5.14 depict the result, respectively<br />

<strong>for</strong> <strong>for</strong>ward <strong>and</strong> backward <strong>estimation</strong>. If the result is of course<br />

more pleasant in case of <strong>for</strong>ward <strong>estimation</strong> (there are no more discontinuities<br />

in the compensated image), it is though not entirely satisfactory.<br />

The solution is not <strong>very</strong> smooth <strong>and</strong>, even if the objects contours are<br />

already better taken into account, sudden irregularities arise around the


142 Chapter 5. Mesh-Based Motion Compensation<br />

(a) 8 x 8 BMA <strong>compensation</strong> (b) DFD resulting from (a)<br />

(c) 8 x 8 BMA, mesh <strong>compensation</strong> (d) DFD resulting from (c)<br />

(e) 16x16 BMA, mesh <strong>compensation</strong> (f) DFD resulting from (e)<br />

Figure 5.13: Nearest neighbor interpolation after <strong>for</strong>ward <strong>estimation</strong>


5.3 Motion Transcription 143<br />

(a) 8 x 8 BMA <strong>compensation</strong> (b) DFD resulting from (a)<br />

(c) 8 x 8 BMA, mesh <strong>compensation</strong><br />

(e) 16x16 BMA, mesh <strong>compensation</strong><br />

(d) DFD resulting from (c)<br />

(f) DFD resulting from (e)<br />

Figure 5.14: Nearest neighbor interpolation after backward <strong>estimation</strong>


144 Chapter 5. Mesh-Based Motion Compensation<br />

vertices (see the right side of the square on gure 5.14(c) <strong>and</strong> (e)). Moreover,<br />

if one of the used block has been badly estimated the compensated<br />

image is drastically debased. It is <strong>for</strong> instance the case <strong>for</strong> the top left<br />

corner of the square on gure 5.13(e) because the displacement of the<br />

corresponding 16 16 block has been wrongly estimated by the BMA.<br />

A more global interpolation technique which would take the whole available<br />

in<strong>for</strong>mation into account <strong>and</strong> not only a few arbitrary vectors should<br />

be considered. It would also have toachieve some smoothing so as to<br />

avoid local traps. Figure 5.15 4 presents some <strong>very</strong> well-known interpolation<br />

techniques. The original data set consists in assigning each pixel<br />

the sum of the squares of its x <strong>and</strong> y coordinates (represented in the<br />

color spectrum - <strong>low</strong> value is blue, high is red). Nearest neighbor interpolation<br />

has already been commented on. Linear interpolation is a<br />

technique that works best <strong>for</strong> datasets with a small percentage of unknown<br />

values <strong>and</strong> is well-suited <strong>for</strong> separable problems. Interpolation of<br />

<strong>motion</strong> vector elds is not separable. Kernel smoothing is a technique<br />

which, as its name suggests, smoothes the data so as to provide a data<br />

set free of irregularities. Although one here wants to <strong>low</strong>er the impact<br />

of wrong vectors, it is also important tokeep changes of <strong>motion</strong> along<br />

object borders. Moreover, kernel smoothing does not preserve original<br />

data values. Both these drawbacks are solved by weighted interpolation<br />

which takes the various known data points into account according to a<br />

distribution function which relies on a physical distance between these<br />

points <strong>and</strong> the point to be estimated.<br />

Another theory of 2-D interpolation has been established by the South<br />

African mining engineer Krige [58]. In the eld of geostatistics <strong>and</strong> hydrosciences,<br />

this technique is referred to as kriging. The theory has been<br />

further developed by Matheron [78]. Basically, kriging is a weighted interpolation<br />

technique which looks at statistical distances rather than<br />

physical ones. Kriging is a <strong>very</strong> powerful method <strong>for</strong> taking into account<br />

individual elements, like the \clustering e ect" of the red point of<br />

gure 5.16.<br />

However, kriging requires a model of statistical properties (a variogram<br />

in the kriging terminology) so as to de ne the a priori \shape" of the<br />

data covariance. Such avariogram would be <strong>very</strong> di cult to establish<br />

<strong>for</strong> <strong>motion</strong> vectors since one simultaneously wishes to obtain a smooth<br />

4 Images come from the fol<strong>low</strong>ing Web site: http://www.<strong>for</strong>tner.com.


5.3 Motion Transcription 145<br />

Sample data<br />

from original<br />

Nearest neighbor Linear interpolation<br />

Kernel smoothing Weighted interpolation<br />

Figure 5.15: Various interpolation techniques<br />

Sample data Kriging<br />

Weighted interpolation<br />

Figure 5.16: Comparison between kriging <strong>and</strong> weighted interpolation<br />

vector eld within objects while al<strong>low</strong>ing <strong>motion</strong> changes from one object<br />

to another.<br />

Inspired by kriging, Decenciere, de Fouquet <strong>and</strong> Meyer have de ned a<br />

technique of inverse kriging [22] that has already been successfully


146 Chapter 5. Mesh-Based Motion Compensation<br />

applied to <strong>motion</strong> vector interpolation problems [21] in the framework<br />

of the MORPHECO [84] project.<br />

Intuitively, inverse kriging solves the problem in a roundabout way: the<br />

criterion to determine the unknown values of the searched samples is<br />

that, if those were in turn interpolated, they would have to provide<br />

known points with known values.<br />

For this purpose, it uses a kriging operator that maps the unknown<br />

vector eld on the known one. In the present situation, this operator<br />

is directly o ered by the a ne trans<strong>for</strong>m implicitly implemented by the<br />

mesh structure. If one considers a known vector located on a position<br />

X of gure 2.19, whose value is ~ Vexp(X) ( ~ Vexp(X) =X 0 , X), a direct<br />

relation with the unknown <strong>motion</strong> eld ~W of the tops of the surrounding<br />

triangle ABC can be found thanks to equation (2.35):<br />

~W (A)+p:( ~ W (B) , ~ W (A)) + q:( ~ W (C) , ~ W (A)) = ~ Vexp(X) (5.9)<br />

As e<strong>very</strong> vector of the (reversed) BMA <strong>motion</strong> eld belongs to one triangle<br />

of the mesh, such an equation can be iterated. According to the<br />

relative numbers of BMA vectors <strong>and</strong> mesh vertices <strong>and</strong> also to the pertinence<br />

of all inverse kriging equations, the system may be under-, overor<br />

simply determined. In the two rst cases, no exact solution exists, <strong>and</strong><br />

one is in front of a linear least square problem that may be solved using<br />

Singular Value Decomposition (SVD [107]). Nevertheless, one has to be<br />

aware that the resolution of a least-square problem through SVD will be<br />

much more coherent <strong>and</strong> stable if the number of constraints su ciently<br />

exceeds the number of unknowns.<br />

Although the number of equations may be quite large (99 when the<br />

BMA is per<strong>for</strong>med on e<strong>very</strong> 16 16 block ofaQCIF image), it has to<br />

be noted that only three parameters are non zero <strong>for</strong> e<strong>very</strong> equation.<br />

The massive presence of null parameters in the system matrix fasten<br />

the so-called QR decomposition with column pivoting involved by the<br />

SVD algorithm.<br />

5.3.3 Mesh Connectivity<br />

Finally, itmust be checked whether the <strong>motion</strong> vectors do not en<strong>for</strong>ce<br />

the mesh to violate its connectivity constraint (cf. gure 2.21). The<br />

troubling <strong>motion</strong> vectors are recti ed as an interpolation of the <strong>motion</strong><br />

vectors of the neighboring nodes (equations (47) <strong>and</strong> (48) of [3]).


5.3 Motion Transcription 147<br />

5.3.4 First results<br />

Figure 5.17 presents the compensated image <strong>and</strong> the DFD resulting from<br />

the inverse kriging vectors determined <strong>for</strong> the mesh of gure 5.12 (b), in<br />

comparison with the result of a classical BMA <strong>compensation</strong>.<br />

Figure 5.17: Result of inverse kriging <strong>for</strong> the example images (<strong>estimation</strong><br />

with 16 16 blocks: (top) BMA <strong>compensation</strong> (bottom) Asymmetric<br />

<strong>compensation</strong><br />

One can directly notice that the asymmetric <strong>compensation</strong> provides an<br />

image which is visually more pleasant but, if one has a close look at the<br />

DFD of the asymmetric scheme, one can see that the square rotation is<br />

only partial. This apparently surprising result was expected: the global<br />

interpolation provided by inverse kriging has smoothed the vector eld<br />

so as to avoid the local instabilities of gures 5.13 <strong>and</strong> 5.14. Yet, the


148 Chapter 5. Mesh-Based Motion Compensation<br />

major point is that all blocking artifacts have been suppressed.<br />

So as to restrict this smoothing e ect, a solution could be to use pairs of<br />

corners, as provided by the half-boundaries. The Delaunay triangulation<br />

will then be modi ed <strong>and</strong> result in the generation of small triangles along<br />

all objects borders as depicted on gure 5.18(a). As it is not <strong>very</strong> likely<br />

that a BMA <strong>motion</strong> vector is located inside such a small triangle, one<br />

may hope to have more or less two distinct systems (here one <strong>for</strong> the<br />

square <strong>and</strong> one <strong>for</strong> the background) that will al<strong>low</strong> e<strong>very</strong> part to fol<strong>low</strong><br />

its own <strong>motion</strong>. In fact, even if the compensated image of gure 5.18(b)<br />

is already somehow improved with respect to square rotation, the system<br />

is still not able to provide a perfect solution because the SVD algorithm<br />

is applied at once to the whole system. One should consider supervising<br />

the asymmetric <strong>compensation</strong> with a Markov process. This suggestion<br />

is detailed at the end of our conclusion (Section 5.5).<br />

5.4 Results<br />

In order to demonstrate the validity of the proposed scheme 5 , its results<br />

will be compared to those of a classical BMA <strong>compensation</strong>. Figure 5.19<br />

summarizes the various steps of the process with the Akiyo images: the<br />

original images 27 (a) <strong>and</strong> 30 (b) are <strong>motion</strong> estimated via a classical<br />

\backward" BMA. Instead of merely applying the resulting <strong>motion</strong> eld<br />

(c), which would result in image (d), an adaptive mesh is surimposed<br />

(e) on the original image (a). Thanks to inverse kriging equations, this<br />

mesh is warped in order to obtain the nal <strong>compensation</strong> (f). Images<br />

(d) <strong>and</strong> (f) result from the <strong>compensation</strong> stage. No residues have been<br />

added to them.<br />

In terms of Peak Signal-to-Noise (PSNR) ratios, the proposed reconstruction<br />

is 1dB inferior to the BMA one. The increase of complexity<br />

introduced by the asymmetric reconstruction is there<strong>for</strong>e mainly justi ed<br />

by the availability to produce more pleasant reconstruction (without<br />

blocking discontinuities) <strong>and</strong> the possibility to directly interact with the<br />

content of the picture. The latter point is of major importance <strong>for</strong> the<br />

new framework o ered by the MPEG-4 st<strong>and</strong>ard. In this framework, the<br />

mesh-based <strong>compensation</strong> could bene t from the decomposition of the<br />

scenes into several AVOs (cf. Section 1.4.3.2) <strong>and</strong> the <strong>coding</strong> of residuals<br />

with matching pursuits [90, 42]. One separate mesh could be adapted<br />

5 Details of the implementation of the scheme are to be found in Appendix D.


5.4 Results 149<br />

Figure 5.18: Asymmetric <strong>compensation</strong> with double corners: (a) Delaunay<br />

triangulation <strong>and</strong> zoom (b) Resulting <strong>compensation</strong><br />

(a)<br />

(b)


150 Chapter 5. Mesh-Based Motion Compensation<br />

(a) Original image at time t-1 (b) Original image at time t<br />

(c) Backward <strong>motion</strong> field from BMA<br />

(d) BMA <strong>compensation</strong><br />

(e) Automatic mesh design (f) Asymmetric <strong>compensation</strong> using mesh<br />

Figure 5.19: Some steps of the asymmetric process on Akiyo sequence


5.4 Results 151<br />

to e<strong>very</strong> VOC in order to per<strong>for</strong>m the <strong>compensation</strong>, while matching<br />

pursuits would be better suited to encode the residuals (which are not<br />

anymore located according to a priori blocks).<br />

Figure 5.20 zooms on the results of gure 5.19 in order to highlight some<br />

improvements brought by the wireframe reconstruction: the BMA block<br />

on the right part of the chin is suppressed <strong>and</strong> the superior lip of the<br />

mouth is not anymore doubled. The rst correction re ects the signicance<br />

of taking into account some a priori spatial in<strong>for</strong>mation while<br />

the second results from the mesh connectivity that prevents duplicating<br />

in<strong>for</strong>mation.<br />

(a) Zoom on the BMA <strong>compensation</strong><br />

(b) Zoom on the asymmetric reconstruction<br />

Figure 5.20: Comparative zoombetween BMA <strong>and</strong> proposed scheme<br />

Figure 5.21 presents two other Akiyo images (10 <strong>and</strong> 30). As more time<br />

separates the images, the <strong>motion</strong> eld that links them is more sparse


152 Chapter 5. Mesh-Based Motion Compensation<br />

<strong>and</strong> important. Thereafter, the quality of the <strong>compensation</strong> is directly<br />

altered. It is intended to demonstrate once again the advantage of using<br />

an a priori mesh: its connectivity prevents from debasing too much<br />

the image content (left eye <strong>and</strong> mouth). The asymmetric reconstruction<br />

( gure 5.21 (f)) is of course not able to interpolate e<strong>very</strong> lost in<strong>for</strong>mation.<br />

In the worst cases, annoying blocking artifacts are replaced by \funny"<br />

de<strong>for</strong>mations. In this case, one may wonder about the pertinence of the<br />

connectivity constraint imposed by the mesh: the <strong>motion</strong> vectors to be<br />

applied to both lips are so contradictory that it would probably be better<br />

to break this connectivity (provided that an edge of the mesh passes<br />

through the lips). A hole would then appear in the mouth <strong>and</strong> would<br />

have to be lled with some residual in<strong>for</strong>mation. A way of managing<br />

such a split of the mesh structure is proposed at the <strong>very</strong> end of this<br />

section.<br />

Since the aim of the asymmetric reconstruction is to al<strong>low</strong> further manipulation<br />

of the images while al<strong>low</strong>ing a better subjective reconstruction, it<br />

is non-sense to propose any table of objective numbers, like the PSNR.<br />

However, in order to demonstrate the capabilities of the scheme with<br />

various types of content, some other sequences are demonstrated: gure<br />

5.22 presents images 50 <strong>and</strong> 53 of Hall Monitor, gure 5.23 presents<br />

images 40 <strong>and</strong> 42 of Silent <strong>and</strong> gure 5.24 presents images 1 <strong>and</strong> 3 of<br />

Table Tennis. One can once more see the cancelling of the blocking artifacts.<br />

Of course, in the Hall Monitor case, one can notice that some<br />

moving objects (the leg of the man) have been immobilized. It is the<br />

drawback of suppressing all blocking artifacts but it is not always acceptable.<br />

Moreover, jerky e ects would arise if the scheme is inserted in<br />

a de<strong>coding</strong> loop.<br />

The result of inserting the scheme in a (de)<strong>coding</strong> loop is presented on<br />

gure 5.25. Of course, a sequence with <strong>low</strong> <strong>motion</strong> activity (Akiyo)<br />

has been chosen in accordance with previous comments. The <strong>low</strong>est<br />

possible <strong>bitrate</strong> is used, i.e. only the <strong>motion</strong> in<strong>for</strong>mation is exploited<br />

<strong>and</strong> no residues are transmitted. One can directly notice an absence<br />

of blocking artifacts on the mesh-compensated image. The drawback is<br />

that successive warping operations strongly interpolate the image <strong>and</strong><br />

result, after a certain time, in a relatively blurred image. Some residual<br />

in<strong>for</strong>mation should easily correct this e ect but, once again, an adaptive<br />

technique such as the the matching pursuits one should be chosen.<br />

One last, but important, comment is that all images of the present chapter<br />

have been generated with the same set of parameters: no particular


5.4 Results 153<br />

(a) Original image at time t-1 (b) Original image at time t<br />

(c) Backward <strong>motion</strong> field from BMA (d) BMA <strong>compensation</strong><br />

(e) Automatic mesh design<br />

(f) Asymmetric <strong>compensation</strong> using mesh<br />

Figure 5.21: Results with sparse <strong>motion</strong> eld due to more unpredictable<br />

movement


154 Chapter 5. Mesh-Based Motion Compensation<br />

(a) Original image at time t-1 (b) Original image at time t<br />

(c) Backward <strong>motion</strong> field from BMA<br />

(d) BMA <strong>compensation</strong><br />

(e) Automatic mesh design (f) Asymmetric <strong>compensation</strong> using mesh<br />

Figure 5.22: Asymmetric process on Hall Monitor


5.4 Results 155<br />

(a) Original image at time t-1 (b) Original image at time t<br />

(c) Backward <strong>motion</strong> field from BMA<br />

(d) BMA <strong>compensation</strong><br />

(e) Automatic mesh design (f) Asymmetric <strong>compensation</strong> using mesh<br />

Figure 5.23: Asymmetric process on Silent


156 Chapter 5. Mesh-Based Motion Compensation<br />

(a) Original image at time t-1 (b) Original image at time t<br />

(c) Backward <strong>motion</strong> field from BMA<br />

(d) BMA <strong>compensation</strong><br />

(e) Automatic mesh design (f) Asymmetric <strong>compensation</strong> using mesh<br />

Figure 5.24: Asymmetric process on Table Tennis


5.4 Results 157<br />

Original #31 Original #61<br />

BMA in the loop #31<br />

Mesh in the loop #31<br />

BMA in the loop #61<br />

Mesh in the loop #61<br />

Figure 5.25: Comparison of results within a de<strong>coding</strong> loop


158 Chapter 5. Mesh-Based Motion Compensation<br />

tuning has been achieved on the algorithms according to the type of<br />

input images. These parameters are provided in Appendix D.<br />

5.5 Conclusion<br />

The aim of this chapter was to combine the advantages of two separate<br />

techniques used <strong>for</strong> <strong>motion</strong> <strong>estimation</strong> & <strong>compensation</strong>. On the one<br />

h<strong>and</strong>, the Block Matching Algorithm has been selected <strong>for</strong> its e cient<br />

<strong>estimation</strong> as well as <strong>for</strong> its compact representation of the <strong>motion</strong> eld.<br />

On the other h<strong>and</strong>, a ne models o er a more sophisticated representation<br />

of the <strong>motion</strong> that avoids blocking artifacts on the objects contours.<br />

When a ne models are implemented via active mesh that establishes an<br />

a priori link with spatial in<strong>for</strong>mation, they provide the user with new options:<br />

3-D modeling, <strong>video</strong> editing, trans guration, augmented reality,<br />

etc.<br />

An asymmetric scheme has thus been proposed: after a classical BMA<br />

<strong>estimation</strong> of the <strong>motion</strong> eld, the <strong>compensation</strong> stage is implemented<br />

via mesh warping. In so doing, solutions to several problems have been<br />

o ered: a fast <strong>and</strong> accurate corner nder has been set up, the <strong>motion</strong><br />

in<strong>for</strong>mation has been reversed <strong>and</strong> it has been interpolated thanks to the<br />

inverse kriging technique. All these innovations al<strong>low</strong> to automatically<br />

change the representation of the <strong>motion</strong> in<strong>for</strong>mation, from block-based<br />

rigid <strong>motion</strong> elds to active ones designed on an adaptive mesh.<br />

The proposed scheme has been told compatible with the bitstream of<br />

existing compression st<strong>and</strong>ards. We mean that the bitstream structure<br />

has not to be modi ed because we still use a BMA <strong>estimation</strong>. Of course,<br />

if one wants to insert such ascheme in the de<strong>coding</strong> loop, the asymmetric<br />

process should also be included in the coder, what would in turn modify<br />

the value of the residual <strong>coding</strong>. Nevertheless, the scheme could be <strong>for</strong><br />

instance used to edit <strong>and</strong> manipulate MPEG <strong>video</strong> sequences stored in<br />

a database without having to recalculate all the <strong>motion</strong> in<strong>for</strong>mation.<br />

In addition to the new options the scheme o ers, it has been shown to<br />

somehow improve subjectively the quality of the compensated images.<br />

As a summary, table 5.1 compares the per<strong>for</strong>mances of the proposed<br />

scheme with those of Block Matching (cf. Section 2.5.1), Hexagonal<br />

Matching (cf. Section 2.5.1) <strong>and</strong> Adaptive Hexagonal Matching (cf.<br />

Section 2.5.1). What directly emerges from this table is the increase<br />

of computational burden at the decoder side. This goes against the


5.5 Conclusion 159<br />

BMA HMA AHMA Asymmetric<br />

PSNR + ++ +++ +<br />

Interactivity - + ++ ++<br />

Compliance with<br />

st<strong>and</strong>ard bitstream Yes Possible No Yes<br />

Coder complexity <strong>low</strong> high high <strong>low</strong><br />

( xed) ( xed) ( contents) ( xed)<br />

Decoder complexity <strong>low</strong> medium medium high<br />

( xed) ( xed) ( contents) ( contents)<br />

Table 5.1: Comparison between di erent schemes<br />

philosophy of many <strong>video</strong> algorithms <strong>and</strong> st<strong>and</strong>ards but is necessary to<br />

o er more functionalities (in terms of manipulation) to the user.<br />

Further improvements<br />

In addition to the prospects already mentioned with the results, an<br />

improvement one may directly think of is to improve the corner direction<br />

by taking into account the in<strong>for</strong>mation of the previous frames of the<br />

sequence: previous vertices location, spatio-temporal activity,...<br />

A second improvement could be to use a constrained triangulation instead<br />

of a Delaunay one. Since object edges (half-boundaries) have been<br />

detected, prior to corners, they could be exploited by <strong>for</strong>cing the triangles<br />

to coincide with them. This would ensure a better link with the<br />

spatial contents of the image. A practical solution is to consider a graph<br />

made out of segments corresponding to strong edges which can then be<br />

extended to triangulation thanks to the geometrical Delaunay criterion<br />

(cf. Appendix D).<br />

Another improvement is related to gure 5.18(a) which presents a wireframe<br />

generated with double corners along objects boundaries. The<br />

result of applying the asymmetric <strong>compensation</strong> to this wireframe is not<br />

as good as expected. We think it could be possible to pro t more from<br />

this speci c wireframe structure if one al<strong>low</strong>s di erent parts of the wireframe<br />

to undergo totally di erent movements. This could be introduced<br />

thanks to a Markov model (cf. appendix C): a Markovian wireframe<br />

would be authorized to \split" itself into several sub-meshes if some<br />

contradictory <strong>motion</strong> arises along an edge between two vertices. This is<br />

intended to improve the quality, segment the <strong>motion</strong> eld <strong>and</strong> automa-


160 Chapter 5. Mesh-Based Motion Compensation<br />

tically detect the areas with discovered background. The Markov model<br />

could be the fol<strong>low</strong>ing:<br />

the sites are the wireframe vertices;<br />

the dual lattice is the triangulation (set of wireframe edges) that<br />

links the sites;<br />

the neighborhood system is de ned around e<strong>very</strong> vertex: two edges<br />

that are related to a same vertex are neighbors;<br />

a clique is made of any possible combination of such neighbor edges<br />

around a vertex;<br />

the aim of the Markovian process is to determine a line process to<br />

know which edges must be \broken";<br />

two potential functions ensure the link with the data:<br />

{ the change of the image gradient intensity along the edge<br />

because a discontinuity is more probable along object edges,<br />

<strong>and</strong><br />

{ the divergence of the <strong>motion</strong> constraint at the two extreme<br />

vertices of the edge;<br />

one potential function introduces a regularization term. The two<br />

most probable con gurations are: no discontinuity at all around<br />

the vertex, or two (surrounding of an object corner). Only one<br />

discontinuity is highly unlikely. Three, four, ve,... discontinuities<br />

(corresponding to a corner common to three, four, ve,... objects)<br />

are also less probable.<br />

Inside the di erent areas determined by the Markov model, the <strong>motion</strong><br />

values could be independently computed by inverse kriging. However,<br />

one major problem related to mesh connectivity arises: as the mesh is<br />

not unique anymore, its global connectivity can not be ensured but only<br />

the own connectivity ofe<strong>very</strong> sub-mesh. Thereafter, \holes" as well as<br />

multiple prediction will appear in the compensated image. Away of<br />

dealing with these artifacts should be designed.<br />

Yet another improvement, in terms of complexity, would be to restrict<br />

the use of the asymmetric reconstruction to areas where complex movement<br />

arises: the e ect of the wireframe is e ectively useless (but <strong>for</strong><br />

further manipulation of the images) in at areas that obey the translational<br />

model of the BMA.


Conclusion<br />

With the emergence on the market of <strong>low</strong>-cost cards al<strong>low</strong>ing customers<br />

to decode the ISO MPEG-1 <strong>and</strong> MPEG-2 st<strong>and</strong>ards, digital <strong>video</strong> <strong>coding</strong><br />

is becoming a <strong>very</strong> common functionality ofmultimedia personal<br />

computers. For several years, <strong>video</strong> <strong>coding</strong> has moved towards <strong>very</strong><strong>low</strong><br />

<strong>bitrate</strong>s (under 64kbits=s). A major result of this trend is the ITU<br />

H.263 st<strong>and</strong>ard which also has the advantage that the software version<br />

of its decoder per<strong>for</strong>ms in real-time on a PC. The future ISO MPEG-4<br />

st<strong>and</strong>ard goes one step further as it o ers the possibility to the user to<br />

somehowinteract with the contents of the pictured scene. Because of the<br />

convergence of the target <strong>bitrate</strong>s with the ever-increasing capabilities of<br />

existing networks, some people already claim that these two st<strong>and</strong>ards<br />

are putting an end to research in \pure" <strong>video</strong> <strong>coding</strong>.<br />

Among the tools used to achieve compression of the <strong>video</strong> signal, <strong>motion</strong><br />

<strong>estimation</strong> is probably the one o ering the greatest compression ratio. A<br />

lot of research has been devoted to it during the last two decades. It not<br />

only resulted in great outcomes in <strong>video</strong> <strong>coding</strong> but also in other elds<br />

where <strong>motion</strong> analysis plays a key role, like semantic interpretation or<br />

target tracking. The exploitation of <strong>motion</strong> in a <strong>video</strong> <strong>coding</strong> scheme<br />

typically involves three steps: <strong>estimation</strong>, transmission <strong>and</strong> <strong>compensation</strong>.<br />

The <strong>video</strong> <strong>coding</strong> context <strong>and</strong> the state of the art in related <strong>motion</strong><br />

<strong>estimation</strong> techniques have respectively been presented in Chapter One<br />

<strong>and</strong> Chapter Two. Some emphasis was put on the <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong><br />

environment <strong>and</strong> on the most used <strong>motion</strong> <strong>estimation</strong> techniques (BMA<br />

<strong>and</strong> warping techniques). In this sense, the present thesis contributes to<br />

the investigation of the possibility to enhance the various steps of the<br />

<strong>motion</strong> chain by somehow taking into account the spatial contents of the<br />

image(s).<br />

Chapter Three modi es the <strong>estimation</strong> per<strong>for</strong>med by a classical BMA


162 Conclusion<br />

so as to adapt the size of the blocks to the object boundaries. It results<br />

in an split-<strong>and</strong>-merge procedure that manages a multiscale algorithm.<br />

The so-called Adaptive BMA outputs a quad-tree structure which enables<br />

condensing the <strong>motion</strong> in<strong>for</strong>mation in spatial areas where complex<br />

movements arise. The adaptation to the contents turns out to be <strong>very</strong><br />

bene cial <strong>for</strong> the <strong>motion</strong> <strong>estimation</strong> as it al<strong>low</strong>s obtaining a much more<br />

reliable <strong>motion</strong> eld. Un<strong>for</strong>tunately, such an adaptation increases the<br />

computational burden. Chapter Three there<strong>for</strong>e proposes a model to<br />

distribute the load among several processors. Preliminary results demonstrate<br />

a linear speed-up thanks to the distribution using a \mastersalve"<br />

structure. However, it would be <strong>very</strong> interesting to pursue the<br />

e ort <strong>and</strong> implement the parallel software on an architecture dedicated<br />

to Digital Signal Processing (DSP), like Texas Instrument TMS C80.<br />

One could then see whether real-time software per<strong>for</strong>mance, as it was<br />

more or less the case <strong>for</strong> the BMA, is possible or not.<br />

Chapter Four evaluates the impact of images pre-treatment on the <strong>coding</strong><br />

process. Although the theoretical model developed in the Rate-<br />

Distortion framework announces a possible reduction of the residual<br />

error <strong>and</strong> then of the <strong>bitrate</strong>, experimentation proves that it directly<br />

results into a drastic loss of quality. Pre-treatment seems thus to be<br />

at deadlock. However, prospective experiments demonstrate that some<br />

gain can be obtained <strong>for</strong> speci c parts of the <strong>video</strong> signal. Those parts<br />

generally belong to moving backgrounds. They are made out of textures<br />

that are irrelevant to the human eye, but relevant enough to be considered<br />

by the quantizer of the coder. If one wants to further investigate<br />

the impact of pre-treatment on <strong>video</strong> <strong>coding</strong>, one should probably rst<br />

tackle the problem of automatic segmentation <strong>and</strong> tracking of objects<br />

(<strong>and</strong> especially background) in a <strong>video</strong> sequence. Then, a psycho-visual<br />

analysis of the texture could be used to determine which parts of the<br />

signal can be pre-treated without altering the overall subjective quality.<br />

In relation with Chapter Four, one could also consider developing the<br />

Rate-Distortion framework while introducing a more explicit model as<br />

far as <strong>motion</strong> <strong>estimation</strong> is concerned. The model could then be reversed<br />

so as to theoretically determine which pre-treatment would achieve the<br />

best results.<br />

Finally, Chapter Five focuses on <strong>compensation</strong>: it presents an asymmetric<br />

scheme whose aim is to achieve the <strong>compensation</strong> of a BMA <strong>motion</strong><br />

eld thanks to a contents-adapted mesh. The solution includes a novel<br />

corner detector <strong>for</strong> the mesh design <strong>and</strong> the use of inverse kriging <strong>for</strong>


Conclusion 163<br />

interpolation purposes. The scheme provides a rst demonstration of<br />

the subjective improvement o ered by such an asymmetric scheme. However,<br />

the scheme would be worth a lot of additional research. Of course<br />

it would be interesting to direct the scheme with a Markov process as<br />

suggested in the conclusion of the chapter. Though the initial aim is to<br />

improve the subjective quality of the reconstructed pictures, the insertion<br />

of the scheme in a complete codec would enable objectively quantifying<br />

its per<strong>for</strong>mance. In this case, we would suggest to use Matching<br />

Pursuits as the residual <strong>coding</strong> method because of their localized (<strong>and</strong><br />

subjective) nature that is in concordance with our scheme.<br />

The signal-adapted <strong>compensation</strong> o ered by the mesh warping could<br />

also be used to reconstruct <strong>motion</strong> elds estimated by other methods<br />

than the BMA, e.g. parametric models. In this case, one has to de ne<br />

the kriging operator that maps the in<strong>for</strong>mation from one structure to<br />

another.<br />

The present thesis has thus analyzed three possibilities of improving the<br />

quality of existing <strong>video</strong> schemes. The chosen approach was to inject as<br />

much signal-adapted in<strong>for</strong>mation as possible inside existing algorithms,<br />

like the BMA. If the experiments indicate that this idea is not always<br />

applicable as such, they also highlight the possible gain in speci c situations.<br />

E<strong>very</strong> user might expect future systems to be as much interactive<br />

as possible. Since one usually wants to interact with depicted objects<br />

that have a semantic meaning to him/her, it appears important that<br />

existing systems should take into account the contents of the data they<br />

are manipulating. Bits <strong>and</strong> bytes are indi erent to the user.<br />

By way of nal conclusion, an attempt is made to answer the question<br />

raised in the introduction, namely: <strong>for</strong> which part(s) of the <strong>motion</strong> exploitation<br />

chain (<strong>estimation</strong> - transmission - <strong>compensation</strong>) is it useful<br />

to take the spatial contents of the images into account? It is obviously<br />

<strong>very</strong> useful in the <strong>estimation</strong> phase, <strong>and</strong> it is probably also the place<br />

where it is the easier to achieve. New algorithms in <strong>motion</strong> <strong>estimation</strong>,<br />

segmentation <strong>and</strong> tracking should al<strong>low</strong> coders to selectively transmit<br />

the only \interesting" parts of the signal. In this respect, a lot of research<br />

should be carried out in order to stabilize the existing techniques<br />

<strong>and</strong> to further investigate the human perception of images. One could<br />

then conceive algorithms that are able to automatically characterize the<br />

content of images. The challenge of the new ISO MPEG-7 work item<br />

is here of great interest. As far as <strong>compensation</strong> is concerned, quality


164 Conclusion<br />

gain is not as obvious as added-value in terms of editing <strong>and</strong> interaction.<br />

Nevertheless, emphasis was put on the possibility <strong>for</strong> the decoder<br />

to exploit the in<strong>for</strong>mation in a way that is di erent from the one usually<br />

applied. This can be of great relevance in a context of indexing of<br />

large visual databases: <strong>for</strong> instance to per<strong>for</strong>m a global analysis of coded<br />

material without having to strictly decode it.<br />

My <strong>very</strong> last word is that I share the general feeling that research in<br />

\pure" compression is not a <strong>very</strong> \hot" research topic anymore. I do<br />

believe that research is deeply necessary in signal analysis in order to<br />

o er cleverer services to the user. One mayeven hope that one day terms<br />

like \multimedia" or \interactivity" will not be hackneyed anymore but<br />

will really encounter the user's expectations.


Appendices


Appendix A<br />

VLBR-like <strong>video</strong> sequences<br />

This appendix brie y presents the <strong>video</strong> sequences that are used to<br />

demonstrate the designed algorithms <strong>and</strong> illustrate the text. These sequences<br />

have been distributed in the framework of MPEG-4, <strong>for</strong> the<br />

tests held in Dallas, in November 1995[97].<br />

Eleven di erent sequences are used, coming out of three classes of sequences.<br />

Class A ( gure A.1) addresses sequences with <strong>low</strong> <strong>motion</strong> <strong>and</strong><br />

<strong>low</strong> spatial texture: \Akiyo" is a speaker presenting the news; \Container"<br />

is a short lm showing a container boat sailing s<strong>low</strong>ly through<br />

the screen; \Hall Monitor" is a security camera controlling what happens<br />

in a corridor <strong>and</strong> \Mother & Daughter" shows two people having<br />

a <strong>video</strong> phone conversation.<br />

Class B ( gure A.2) is made of sequences including either a larger<br />

amount ofmovement or more textures. The \Coastguard" watches the<br />

activity onariver; the \Foreman" speaks in front of a building with a<br />

<strong>very</strong> unstable mobile camera; two speakers present the \News" with a<br />

movie in the background <strong>and</strong> \Silent Voice" shows a disabled woman<br />

telling her friends with the sign that she is going to Paris.<br />

Class C ( gure A.3) is normally intended to be coded at higher <strong>bitrate</strong>s<br />

but is used here because of its pertinence to test <strong>motion</strong> algorithms.<br />

All the sequences contain intensive <strong>motion</strong> <strong>and</strong> high spatial activity.<br />

\Mobile & Calendar" is a child room scene, while \Stefan" <strong>and</strong> \Table<br />

Tennis" are two sport movies.<br />

All these images are presented <strong>and</strong> used (sometimes not but then mentioned<br />

within the text) in QCIF <strong>for</strong>mat (176 144 pels). This quarter


168 Chapter A. VLBR-like <strong>video</strong> sequences<br />

Akiyo Container<br />

Hall Monitor<br />

Mother & Daughter<br />

Figure A.1: The MPEG-4 \Class A" test sequences<br />

CIF <strong>for</strong>mat has been speci cally designed <strong>for</strong> <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> communication<br />

(cf. table 1.1).


VLBR-like <strong>video</strong> sequences 169<br />

Coastguard<br />

Foreman<br />

News Silent Voice<br />

Figure A.2: The MPEG-4 \Class B" test sequences


170 Chapter A. VLBR-like <strong>video</strong> sequences<br />

Mobile & Calendar Stefan<br />

Table Tennis<br />

Figure A.3: Some MPEG-4 \Class C" test sequences


Appendix B<br />

Rate Distortion theory<br />

As the title of Berger's book announces it, Rate Distortion theory o ers<br />

a mathematical basis <strong>for</strong> data compression [6]. Its is indeed de ned as:<br />

Rate Distortion Theory The theoretical discipline that treats data<br />

compression from the point of view of in<strong>for</strong>mation theory.<br />

In<strong>for</strong>mation Theory Mathematical theory dealing with the more fundamental<br />

aspects of communication systems.<br />

In practice, Rate Distortion theory aims at addressing two problems:<br />

What in<strong>for</strong>mation should be transmitted?<br />

How should it be transmitted?<br />

From the scheme of gure B.1, it is clear that Rate Distortion theory<br />

is concerned with the relation between the channel capacity C <strong>and</strong> the<br />

distortion of Y with regards to X. The problem may then be <strong>for</strong>mulated<br />

as fol<strong>low</strong>s: \given the source, the user <strong>and</strong> the channel, under what<br />

conditions is it possible to design a system that reproduces the source<br />

output <strong>for</strong> the user with an average distortion that does not exceed some<br />

speci ed upper limit D?" The answer is a curve like the one of gure B.2<br />

that ensures a quality superior or equal to D if <strong>and</strong> only if C>R(D).<br />

In order to account <strong>for</strong> the appearance <strong>and</strong> the properties of the R(D)<br />

curve, appendix B will rst revise some basic notions of In<strong>for</strong>mation<br />

Theory [38]. Then, the Rate Distortion function will be introduced in<br />

the context of discrete memoryless sources <strong>and</strong> single-letter distortion,<br />

according to Berger's book [6]. Some properties of the curve will also


172 Chapter B. Rate Distortion theory<br />

Source<br />

User<br />

X<br />

Source<br />

encoder<br />

Y Source<br />

decoder<br />

Encoder<br />

Decoder<br />

Channel<br />

encoder<br />

Channel<br />

decoder<br />

X<br />

Channel of capacity C<br />

Figure B.1: Typical transmission chain<br />

R<br />

R(0)


B.1 In<strong>for</strong>mation Theory 173<br />

B.1 In<strong>for</strong>mation Theory<br />

Let us consider an alphabet AM of size M, with letters aj:<br />

AM = fa 0;a 1; :::; aM,1g = f0; 1; :::;M , 1g: (B.1)<br />

The probability distribution of this alphabet is a function p(.):<br />

p(:) : AM ! [0; 1]<br />

P M,1<br />

j=0 p(j) =1:<br />

(B.2)<br />

Based on the nite ensemble (AM;p), a r<strong>and</strong>om variable (r.v.) may<br />

be de ned: the identity r.v. X(:) (X(j)=j assumes the letter j with<br />

a probability p(j)). A r.v. is a real r.v. if it ranges in (a subset of) the<br />

real line. The expected value of a real r.v. f(:) is de ned as:<br />

E[f]=<br />

M,1 X<br />

j=0<br />

p(j):f(j): (B.3)<br />

The in<strong>for</strong>mation contained in a message is related to the probability of<br />

occurrence of this particular message. More precisely, the in<strong>for</strong>mation<br />

is proportional to the inverse of the message probability. For instance,<br />

to hear a neighbor telling you that the moon has fallen last night brings<br />

much more in<strong>for</strong>mation than to hear him/her saying \Hello". The probability<br />

of the rst event, namely the falling of the moon, is indeed <strong>very</strong><br />

<strong>low</strong> <strong>and</strong> it o ers a lot of (surprising because improbable) in<strong>for</strong>mation in<br />

comparison to the other one. The self-in<strong>for</strong>mation I of an event j is<br />

thus de ned as:<br />

I(:):I(j) = log 1<br />

= , log p(j); (B.4)<br />

p(j)<br />

<strong>and</strong> is expressed in nat if the neperian logarithm is used, or in Shannon<br />

if log 2 is used. In the latter case, the bit is also commonly used. The logarithm<br />

is used to make the in<strong>for</strong>mation provided bytwoevents inversely<br />

proportional to the product of their probabilities.<br />

The expected value of the in<strong>for</strong>mation is de ned as the entropy of the<br />

source:<br />

X<br />

M,1<br />

H(X)=, p(j) log p(j) (B.5)<br />

j=0


174 Chapter B. Rate Distortion theory<br />

measures the average a priori uncertainty onX.<br />

Considering two alphabets AM <strong>and</strong> AN , one can de ne a product space<br />

AMN:<br />

AMN : f(j; k)jj 2 AM;k2 AN g : (B.6)<br />

The joint distribution p(j; k) <strong>and</strong> the joint ensemble (AMN;p(j; k))<br />

help de ning the marginal distributions:<br />

p(j)=<br />

N,1 X<br />

k=0<br />

p(j; k); q(k) =<br />

<strong>and</strong> the conditional distributions:<br />

p(jjk)=<br />

p(j; k)<br />

q(k)<br />

M,1 X<br />

j=0<br />

p(j; k); (B.7)<br />

p(j; k)<br />

; q(kjj)= : (B.8)<br />

p(j)<br />

A r.v. Z that assumes the event (j; k) with a probability p(j; k) maybe<br />

de ned as: Z =(X; Y ) with X : p(j) <strong>and</strong> Y : q(k). The conditional<br />

self-in<strong>for</strong>mation is the in<strong>for</strong>mation one receives when told that the<br />

event X = j has occurred if one already knows the occurrence of the<br />

event Y = k:<br />

<strong>and</strong> the conditional entropy is:<br />

I(jjk)=, log p(jjk); (B.9)<br />

H(XjY )=, X<br />

p(j; k) log p(jjk): (B.10)<br />

The mutual in<strong>for</strong>mation of both r.v.'s X <strong>and</strong> Y is thus:<br />

j;k<br />

I(j; k)=I(j) , I(jjk)=I(k; j); (B.11)<br />

<strong>and</strong> the average mutual in<strong>for</strong>mation or average amount of transmitted<br />

in<strong>for</strong>mation:<br />

H(X; Y )=H(X) , H(XjY )= X<br />

p(j; k)<br />

p(j; k) log : (B.12)<br />

p(j)q(k)<br />

j;k


B.2 Discrete Memoryless Sources <strong>and</strong> Single-Letter Distortion 175<br />

B.2 Discrete Memoryless Sources <strong>and</strong> Single-<br />

Letter Distortion<br />

Let consider a discrete parameter family of r.v.'s fXt; t =0;<br />

<strong>and</strong> its probability distribution pt(:). Its entropy is<br />

1; 2;:::g<br />

H(Xt) =, X<br />

pt(j) log pt(j): (B.13)<br />

j<br />

If t becomes a time index t =(t1;t2;t3; :::; tn), then pt(x) denotes the<br />

probability of the vector r.v. Xt =(Xt1;Xt2; :::) to equal the vector<br />

x =(x1;x2; :::). The corresponding entropy is:<br />

H(Xt) =, X<br />

pt(x) log pt(x): (B.14)<br />

all x<br />

The r<strong>and</strong>om sequence fXtg is called a discrete source <strong>and</strong> AM is called<br />

the source alphabet. x (x 2 A n M)isasource word of length n. Such<br />

a source is known to be stationary if 8k; t; x; n : pt+k(x) =pt(x). The<br />

probability may then be simply written p(x) <strong>and</strong> the entropy H(X).<br />

The entropy rate of a stationary source is de ned as:<br />

H = lim<br />

n!1 n ,1 H(X) = lim<br />

n!1 n ,1 H(X 1; :::; Xn): (B.15)<br />

The sequence of r.v.'s fZtg = fXt;Ytg also constitutes a discrete source.<br />

If such a joint source is assumed stationary, then its entropy rate is:<br />

H 0 = lim<br />

n!1 n ,1 H(Y ) + lim<br />

n!1 n ,1 H(XjY ) (B.16)<br />

where H(XjY ) is the average uncertainty, <strong>and</strong> limn!1 n ,1 H(XjY ) the<br />

equivocation or average rate at which the in<strong>for</strong>mation is lost while<br />

going through the system.<br />

With regards to gure B.1, a discrete memoryless channel (d.m.c.)<br />

with input alphabet A ~M <strong>and</strong> output alphabet A ~N is described completely<br />

by specifying <strong>for</strong> e<strong>very</strong> ordered pair (j; k) the conditional or transition<br />

probability ~q(kjj) that the letter k appears at the channel output when<br />

the input letter is j. Achannel is said to be memoryless if it processes<br />

the successive letters of an input word ~x =(~x 1; :::; ~xn) independently<br />

from one another:


176 Chapter B. Rate Distortion theory<br />

~q(~yj~x) =<br />

nY<br />

t=1<br />

~q(~ytj~xt): (B.17)<br />

E<strong>very</strong> probability distribution ~p(:) of the input alphabet de nes a joint<br />

input-output distribution ~p(j; k) =~p(j)~q(kjj) <strong>and</strong> also an output pro-<br />

P ~M,1<br />

bability distribution ~q(k) = j=0 ~p(j; k). The capacity of the channel<br />

is de ned by the relation<br />

C = max H( ~ X; ~ Y ) = max X<br />

~p(j; k)<br />

~p(j; k) log ; (B.18)<br />

~p(j)~q(k)<br />

j;k<br />

where the maximum is taken with respect to all possible choices of the<br />

input distribution ~p(:). The Shannon's \noisy <strong>coding</strong> theorem"<br />

states that \<strong>for</strong> a discrete memoryless channel of capacity C <strong>and</strong> a discrete<br />

stationary source with entropy H, the source may be encoded over<br />

the channel with an arbitrary small number of errors if H C; while<br />

if H>C, it is impossible to encode it with an equivocation less than<br />

H , C".<br />

A discrete memoryless source (d.m.s.) is a stationary source that<br />

satis es one additional requirement:<br />

p(x) =<br />

nY<br />

t=1<br />

p(xt): (B.19)<br />

It means that the successive letters generated by a d.m.s. are independent<br />

<strong>and</strong> identically distributed (i.i.d.) r.v.'s. To evaluate the<br />

quality of the reconstruction of the r.v. fXt;pg, aword distortion<br />

measure n(x; y) that speci es the penalty charged <strong>for</strong> reproducing the<br />

source word x by the output vector y must be designed. n(x; y) isa<br />

non-negative cost function. A delity criterion is a sequence of word<br />

distortion measures in which:<br />

F = f n(x; y); 1 n 1g: (B.20)<br />

A delity criterion is called a single-letter delity criterion if<br />

n(x; y)= 1<br />

n<br />

nX<br />

t=1<br />

(xt;yt): (B.21)


B.3 Rate Distortion Function 177<br />

B.3 Rate Distortion Function<br />

The Rate Distortion function R(D) (cf. gure B.2) speci es the minimum<br />

rate that enables one to reproduce the source with an average<br />

distortion that does not exceed D. To design the R(D) of a d.m.s. with<br />

respect to a single-letter delity criterion F , one needs to know the<br />

average distortion d(q) associated with the transition probabilities<br />

q(kjj) of the channel:<br />

d(q)= X<br />

p(j)q(kjj) (j; k): (B.22)<br />

j;k<br />

The channel de ned by q(kjj) is called D-admissible if <strong>and</strong> only if<br />

d(q) D. QD denotes the set of all D-admissible transition probability<br />

assignments: QD = fq(kjj)jd(q) Dg. In parallel, e<strong>very</strong> assignment<br />

gives rise to an average mutual in<strong>for</strong>mation:<br />

H(q) = X<br />

j;k<br />

p(j)q(kjj) log q(kjj)<br />

: (B.23)<br />

q(k)<br />

The rate distortion function R(D) offXt;pg with respect to F is then<br />

de ned by:<br />

R(D) = min H(q); (B.24)<br />

q2QD<br />

with D 2 [0; 1). Since the source is given <strong>and</strong> not the channel, such<br />

an equation determines the minimum rate at which the in<strong>for</strong>mation<br />

about the source must be conveyed to the user in order to achieve a<br />

prescribed delity (the reverse - given channel - generates a distortion<br />

rate function). It means that R(D) C is a necessary condition <strong>for</strong><br />

the existence of a communication system that operates with<br />

delity D.<br />

As illustrated on gure B.2, R(D) is a monotonic, decreasing, convex U<br />

function in the interval of interest from D =0toD = Dmax. R(D) =0<br />

<strong>for</strong> D = Dmax. Dmax is the average distortion achieved when only the<br />

statistics of the source are known but when nothing at all about the<br />

particular realization of the source has been transmitted. In this case,<br />

the decoder can only guess what the source could have been.


178 Chapter B. Rate Distortion theory<br />

signal s<br />

Predictor<br />

Figure B.3: Basic predictor<br />

B.4 Extension to Moving Pictures<br />

+<br />

-<br />

error e<br />

prediction s<br />

The Rate Distortion function that has just been presented may beextended<br />

to various sources [6]. Since it is our interest here, one will detail<br />

how <strong>and</strong> under which assumptions it is possible to apply this theory to<br />

(moving) images. But, let us begin with a note on the basic structure of<br />

predictive <strong>coding</strong> <strong>and</strong> its main properties (with Tziritas <strong>and</strong> Labit [127]).<br />

B.4.1 Note on Predictive Coding<br />

A<strong>very</strong> simple predictor, like the one of gure B.3, o ers a prediction<br />

gain<br />

Gp =<br />

2<br />

s<br />

2<br />

e<br />

; (B.25)<br />

where 2<br />

s (resp. 2<br />

e) is the variance of the signal s (resp. the error signal<br />

e). If the predictor is optimum, such a gain is always greater or equal to<br />

1. The entropy gain <strong>for</strong> the prediction error e with respect to the entropy<br />

of signal s is 1=2 log Gp (because : log x is commonly admitted as an<br />

approximation <strong>for</strong> the entropy H).<br />

In order to obtain a higher compression ratio, a quantizer must be included.<br />

As quantization introduces distortion, the decoder will not be<br />

able to reconstruct the signal s, but a signal s instead. The quantizer<br />

is thereafter included in the prediction loop so as to avoid error propagation<br />

<strong>and</strong> to obtain the same prediction at the coder <strong>and</strong> the decoder<br />

(Figure B.4).<br />

The distortion induced by the quantization is equal to<br />

" = s , s = e +^s , (e +^s) =e , e: (B.26)


B.4 Extension to Moving Pictures 179<br />

s<br />

+<br />

s<br />

-<br />

e e<br />

Quantizer Coder<br />

Predictor<br />

Figure B.4: Predictive coder with quantization<br />

This equation means that the <strong>coding</strong> distortion is equal to the quantization<br />

error. Quantizing with N quantization levels a variable whose<br />

variance is 2 achieves a mean quadratic distortion<br />

D = A ; (B.27)<br />

2 N<br />

with A a constant parameter depending on the source probability distribution.<br />

With the approximation that As Ae, <strong>and</strong> <strong>for</strong> a speci c <strong>bitrate</strong><br />

(i.e. Ns = Ne), the distortion ratio is equal to the prediction gain:<br />

Ds<br />

De<br />

+<br />

2<br />

+<br />

s<br />

Gp: (B.28)<br />

Predictive <strong>coding</strong> reduces the distortion by a factor equal to the prediction<br />

gain.<br />

Considering now two successive images I(x; y; t , 1) <strong>and</strong> I(x; y; t), one<br />

can advance the hypotheses that the image signal is stationary <strong>and</strong> zeromean,<br />

<strong>and</strong> that the displacement vector (u; v) is a constant parameter<br />

over the whole image. One assumes that the covariance function of<br />

I(x; y; t) is separable in time <strong>and</strong> space. Figure B.5 introduces the<br />

generic structure of a temporal prediction, where the lter h(x; y) is<br />

linear <strong>and</strong> shift-invariant.<br />

I(fx;fy) isthepower spectral density of the spatial part of I(x; y; t).<br />

A single temporal correlation coe cient exists then between the two<br />

images, <strong>and</strong> is equal to:<br />

c


180 Chapter B. Rate Distortion theory<br />

I(x,y,t)<br />

I(x,y,t-1)<br />

δ (x-u,y-v)<br />

h(x,y)<br />

Figure B.5: Generic structure of temporal prediction<br />

= E[I(x; y; t)I(x , u; y , v; t , 1)]<br />

+<br />

-<br />

e(x,y)<br />

E[I 2 : (B.29)<br />

(x; y; t)]<br />

The power spectral density function of e(x; y) is<br />

e(fx;fy) =(1+jH(fx;fy)j 2 , 2


B.4 Extension to Moving Pictures 181<br />

B.4.2 Intra Images<br />

Considering memoryless <strong>coding</strong>, i.e. without spatial processing, <strong>and</strong> a<br />

mean quadratic criterion, the rate distortion function of an image coded<br />

with only intra quantization is given by:<br />

R =<br />

( 1<br />

2 log 2<br />

2<br />

I<br />

D 0 2<br />

I<br />

B.4.3 Inter Images without Motion Compensation<br />

In the same framework of memoryless <strong>coding</strong>, the function becomes<br />

R = 1<br />

2 log 2<br />

2 2 I(1 , e<br />

D<br />

p<br />

,2 f0 u2 +v2 )<br />

+1<br />

!<br />

(B.34)<br />

; (B.35)<br />

where f0 0:05 p (equation (4.23) out of Labit <strong>and</strong> Tziritas [127]). One<br />

can notice that the prediction gain is greater than 1 only if<br />

2 ,I(u; v) > 2<br />

I ; (B.36)<br />

where ,I(x; y; ) is the spatial covariance of I(x; y; t).<br />

B.4.4 Inter Images with Motion Compensation<br />

If <strong>motion</strong> <strong>compensation</strong> is used, the rate distortion function is (equation<br />

(4.30) out of Labit <strong>and</strong> Tziritas [127])<br />

R =<br />

Z 1= p<br />

0<br />

f log 2<br />

where f = q f 2 x + f 2 y , <strong>and</strong><br />

2(1 , (fx;fy)) I(fx;fy)<br />

D<br />

(fx;fy) =<br />

1<br />

(( df) 2 +1) 3 2<br />

+1 df; (B.37)<br />

: (B.38)<br />

Such acharacteristic function is obtained by assuming an isotropic probability<br />

density function of the <strong>motion</strong> <strong>estimation</strong> error (dx;dy):<br />

p(dx;dy) = 2<br />

2<br />

d<br />

p<br />

2 d2 x +d<br />

,<br />

e 2 y<br />

d : (B.39)


Appendix C<br />

Markov R<strong>and</strong>om Fields<br />

The use of Markovian techniques in digital image processing is based<br />

on the possibility of modeling the image texture, the noise, <strong>and</strong> even<br />

some other criteria, like stochastic processes. Since their remarkable<br />

use in image restoration by Geman <strong>and</strong> Geman [39], the theory of<br />

Markovian processes [99] has been extended to Markov R<strong>and</strong>om Fields<br />

(MRF) [40], <strong>and</strong> their eld of application extended to images [137] <strong>and</strong><br />

<strong>motion</strong> elds [37, 109].<br />

The present appendix aims at giving a quick overview of MRF <strong>and</strong> the<br />

di erent concepts involved by their use. It will rst present Bayesian<br />

modeling, which is often related to MRF. Then, image description <strong>and</strong><br />

neighborhood relationship will introduce the de nition of MRF. The<br />

link with the Gibbs distribution is also tackled. Finally, the possible<br />

algorithms to solve the overall minimization problem are referred to.<br />

C.1 Bayesian Model<br />

Bayesian modeling assumes a priori statistical distribution <strong>for</strong> the solution<br />

of <strong>estimation</strong> problems. This a priori model (or prior model),<br />

p(u), is a probabilistic description of the solution u or of its properties<br />

be<strong>for</strong>e any sense data is collected.<br />

A second component, the sensor model (or likelihood model), p(dju),<br />

is a description of the noisy or stochastic processes that connects the<br />

original (unknown) state u to the sampled input image or sensor values,<br />

the data d. Using Bayes' rule, both can be combined in order to obtain<br />

an a posteriori model (or posterior model), p(ujd), which describes


184 Chapter C. Markov R<strong>and</strong>om Fields<br />

the estimate of the solution u according to the collected data d:<br />

where p(d)= P u p(dju):p(u).<br />

p(ujd)= p(dju):p(u)<br />

p(d)<br />

(C.1)<br />

Bayesian modeling is often used to determine the Maximum A Posteriori<br />

(MAP) estimate, i.e. the value of the solution u that maximizes<br />

the conditional probability p(ujd). The Maximum Likelihood (ML)<br />

estimate does not statistically describe u: it considers it as a vector of<br />

parameters. The ML is a special case of the MAP, in which the a priori<br />

model p(u) is constant (uni<strong>for</strong>m distribution). The aim then becomes<br />

to maximize the conditional probability p(dju).<br />

Working with the logarithm of the posterior density, one obtains:<br />

log(p(ujd)) = log(p(dju)) + log(p(u)) , log(p(d)); (C.2)<br />

where the last term does not depend on u <strong>and</strong> can be neglected in<br />

maximization processes with respect to u. The MAP estimate is thus:<br />

[ @log(p(dju))<br />

@u<br />

+ @log(p(u))<br />

]ju=^uMAP =0; (C.3)<br />

@u<br />

where u =^uMAP is the solution at the maximum. Similarly, the ML<br />

estimate is, with u =^uML the solution at the maximum:<br />

C.2 Markov R<strong>and</strong>om Fields<br />

[ @log(p(dju))<br />

]ju=^uML =0: (C.4)<br />

@u<br />

Let consider an image I, in which pixels de ne a lattice S of sites s:<br />

S = fs =(i; j)g (C.5)<br />

To e<strong>very</strong> site s is associated a r<strong>and</strong>om variable [99] As, whose values<br />

as belong to . For instance, = f0; :::; 255g represents the possible<br />

luminance values of a black <strong>and</strong> white picture <strong>and</strong> = f0; :::; 255g q the<br />

values that can be associated with any pixel of a mutispectral image<br />

with q channels.


C.2 Markov R<strong>and</strong>om Fields 185<br />

n = 1 n = 2 n = 3<br />

Figure C.1: Homogeneous neighborhoods of order 1,2,3<br />

The image can then be considered as a r<strong>and</strong>om vector A =(As;s2 S)<br />

whose vector a =(as;s2 S) is its particular realization.<br />

p(A = a) =p(a)<br />

is the probability of con guration a. In fact, it is a joint probability<br />

p(As = as;s2 S).<br />

p(As = as) =p(as)<br />

is the marginal law ofAs.<br />

A neighborhood system N =(Ns;s2 S) is made out of parts Ns of S<br />

with the fol<strong>low</strong>ing properties:<br />

s 62 Ns<br />

s 2 Nt , t 2 Ns<br />

(C.6)<br />

(C.7)<br />

The set Ns is called the neighborhood of s; t is said neighbor of s if<br />

t 2 Ns. Among the di erent types of neighborhood, the homogeneous<br />

ones ( gure C.1) are characterized by their order n:<br />

N n<br />

s = ft jks , tk 2<br />

kn;t6= sg (C.8)<br />

with kn taking the values 1,2,4,5,8,... <strong>for</strong> n =1; 2; 3; 4; 5;:::<br />

A r<strong>and</strong>om eld A is a Markov R<strong>and</strong>om Field (MRF) associated with<br />

the system N if <strong>and</strong> only if:


186 Chapter C. Markov R<strong>and</strong>om Fields<br />

n = 1<br />

n = 2<br />

Figure C.2: Cliques <strong>for</strong> homogeneous neighborhoods of order 1 <strong>and</strong> 2<br />

p(a) > 0; (C.9)<br />

p(asjat;t2 S ,fsg) =p(asjat;t2 Ns): (C.10)<br />

IT means that the marginal law ofany site only depends on the small<br />

numbers of its neighbors.<br />

A clique c is a subset of S which is related to the neighborhood system<br />

N according to the fol<strong>low</strong>ing properties:<br />

c is a singleton (pixel site), or<br />

any two pixels of c are neighbors, with respect to N (8s; t 2 c :<br />

s 6= t ) s 2 Nt).<br />

C is the set of all cliques of S. The order of a clique depends on the<br />

number of its elements. The cliques <strong>for</strong> homogeneous neighborhoods of<br />

order 1 <strong>and</strong> 2 are depicted on gure C.2.<br />

C.3 Gibbs measure <strong>and</strong> Markov elds<br />

Considering the neighborhood N = fNs;s2 S; Ns Sg <strong>and</strong> the set C of<br />

cliques de ned on N, the Hammersley-Cli ord theorem [7] demonstrates<br />

that a r<strong>and</strong>om eld A is a Markov eld associated with the neighborhood


C.4 System solution 187<br />

system N if <strong>and</strong> only if its probability distribution p(A=a) is a Gibbs<br />

measure de ned by:<br />

8a 2 ; p(a) =<br />

,Ep (a)<br />

e Tp<br />

Zp<br />

; (C.11)<br />

where Tp is a temperature that controls the degree of \peaking", Zp is a<br />

,Ep (a)<br />

e Tp ), <strong>and</strong> Ep<br />

normalization constant (\partition function", Zp = P a<br />

a function called \energy", de ned by:<br />

Ep(a) = X<br />

Vc(a=c): (C.12)<br />

c2C<br />

The potential functions Vc are arbitrary functions which only depend<br />

on the elements a belonging to the clique c (which is referred to as the<br />

notation a=c). They help the system designer to clearly de ne its a priori<br />

knowledge about the neighborhood of the Markovian model. Generally,<br />

there are two kinds of potential functions combined so as to obtain a<br />

good model <strong>for</strong> the problem to solve:<br />

Potential functions that ensures the link with the data. They<br />

de ne the properties that the pixels of a same clique must have.<br />

Potential functions that introduce regularization constraints inside<br />

cliques in order to obtain a smooth <strong>and</strong> coherent nal solution.<br />

C.4 System solution<br />

Several methods to obtain the MAP estimate exist. They are only referred<br />

to here, <strong>and</strong> the problem of choosing the right one is to be tackled<br />

with concrete problems. Simulation methods include the Gibbs sampler<br />

introduced by Geman <strong>and</strong> Geman [39] <strong>and</strong> the Metropolis algorithm [81],<br />

while optimization methods are the simulated annealing (SA, [55]), the<br />

Iterated Conditional Modes (ICM, [8]) <strong>and</strong> the High Con dence First<br />

(HCF, [15]). Deterministic methods (like the two last ones) are often<br />

preferred because of their fast computation. Nevertheless, one has to be<br />

aware that they do not guarantee to reach the absolute maximum but<br />

only some local maximum.


Appendix D<br />

Complements to Chapter 5<br />

D.1 Triangulation<br />

A triangulation process al<strong>low</strong>s one to obtain the partition of an image<br />

from a given set of points. Nevertheless, several di erent triangulations<br />

can be produced from the same set of points. Some criteria have thus to<br />

be chosen in order to de ne the triangulation in a unique <strong>and</strong> optimal<br />

way. These criteria can be purely geometrical, like in the Delaunay<br />

case, or take some other initial data into account (reconstruction error,<br />

surface energy, convexity, :::).<br />

This section aims at brie y introducing the basic de nitions of triangulation<br />

as well as the properties of the Delaunay triangulation that is<br />

used in the present work.<br />

D.1.1 De nitions<br />

De nition 1 G =(V;S) designs a graph made of the sets:<br />

V = fvij1 i Ng, which is the set of points (or nodes or<br />

vertices);<br />

S = fsij1 i mg, which contains a set of edges such that<br />

T<br />

si sj 2fV;;g.<br />

De nition 2 A triangulation T (G) of a given graph G =(V;S) is a<br />

graph G 0 =(V;S 0 ), where S S 0 .<br />

Lemma 1 The triangulation T (G) of a given graph G =(V;S) includes<br />

2(N , 1) , B triangles, <strong>and</strong> 3(N , 1) , B edges, where B is the number<br />

of vertices of the convex envelope of the set of points V .


190 Chapter D. Complements to Chapter 5<br />

D.1.2 Delaunay Triangulation<br />

A Delaunay triangulation D of V is the geometric dual of the Dirichlet<br />

(or Vorono or Thiessen) tessellation P constructed on V . Such a<br />

tessellation divides the plane into polygonal regions, called tiles. Each<br />

tile Pi contains all the points of the plan closer to the tile generating<br />

point vi than to other generating points vk (in the sense of the Euclidian<br />

distance):<br />

Pi = fN 2< 2 ; 8k 6= i; d(N; vi) d(N; vk)g: (D.1)<br />

Graph<br />

Delaunay triangulation<br />

Figure D.1: Graph <strong>and</strong> associated (constrained) Delaunay triangulation<br />

Delaunay triangulation is optimal from the interpolation point of view,<br />

<strong>for</strong> its triangles are as much equiangular as possible (\locally equiangular").<br />

It avoids having \ at" triangles which are not good <strong>for</strong> spline<br />

interpolation where the approximation error depends on the triangle<br />

\thickness." The Delaunay triangulation is de ned as:<br />

De nition 3 The generalized Delaunay triangulation of G =(V;S)<br />

is a triangulation T (G) =(V;S 0 ) <strong>for</strong> which the circumscribed circle of<br />

e<strong>very</strong> triangle 4vivjvk does not contain any other vertex visible from vi,<br />

vj or vk. The edges of the set S 0 , S are called the Delaunay edges. A<br />

vertex u is visible from a vertex v if the segment [u; v] does not cut any<br />

edge of the set S on an interior point.<br />

Based on this property of the circumscribed circle <strong>and</strong> on the fact that<br />

a locally optimal triangulation in the sense of Delaunay is globally a


D.2 Pseudo-code of the implementation 191<br />

Delaunay triangulation, several implementations have been designed (see<br />

<strong>for</strong> instance [17, 119]).<br />

Figure D.1 presents a graph G <strong>and</strong> its (constrained) Delaunay triangulation.<br />

Dashed lines represent the Delaunay edges.<br />

D.2 Pseudo-code of the implementation<br />

The aim of the present section is to provide implementation details of<br />

the system discussed in Chapter 5. It consists some kind of pseudocode<br />

of the main routines along with the typical thresholds <strong>and</strong> other<br />

parameters. Our implementation uses C++ classes to describe images<br />

(class picture), <strong>motion</strong> vector elds (class <strong>motion</strong>) <strong>and</strong> mesh structure<br />

(class wireframe). The latter classes are derived from the rst one. In the<br />

pseudo-code presented here, only a few <strong>very</strong> intuitive member functions<br />

have been kept (height, width, (i,j)). A lot of declarations have also<br />

been omitted We hope this code will help interested people.<br />

D.2.1 Compensation scheme<br />

The overall scheme is made out of ve consecutive steps:<br />

corner detection whose code is provided in Section D.2.2,<br />

Delaunay triangulation which is implemented according to [119],<br />

establishment of the equation system <strong>for</strong> inverse kriging interpolation,<br />

presented in Section D.2.3,<br />

resolution of the system with SVD using the routines provided<br />

by [107],<br />

image warping which involves some <strong>very</strong> basic Computer Graphics<br />

to determine the value of the pixels of the new image after warping.<br />

Here is the pseudo-code of the scheme. A small trick is to per<strong>for</strong>m all<br />

operations on an extended picture in order to easily manage vectors that<br />

point out of the picture.


192 Chapter D. Complements to Chapter 5<br />

/* Routine that implements the mesh-based <strong>compensation</strong> of<br />

a reference image from a block-based <strong>motion</strong> field */<br />

picture mesh_<strong>compensation</strong>(picture pReference, <strong>motion</strong> mField)<br />

{<br />

/* So as to deal with vectors pointing out of the picture<br />

(as authorized within the H.263 <strong>and</strong> MPEG-4 st<strong>and</strong>ards),<br />

the reference image is appropriately extended.<br />

The rest of the algorithm will always manipulate<br />

images of this size */<br />

// extend picture<br />

mField.give_limits(MinVector,MaxVector);<br />

if ((MinVector


D.2 Pseudo-code of the implementation 193<br />

/* Solve the system with SVD, routines svdcmp <strong>and</strong> svbksb are<br />

provided in [107] */<br />

svdcmp(A,#vector*2,#corner*2,W,V);<br />

Wmax= 0.0;<br />

<strong>for</strong> (i = 0; i < #corner*2+1; i++)<br />

{<br />

if (W[i] < 1.0e-6)<br />

W[i] = 0;<br />

if (W[i] > Wmax)<br />

Wmax = W[i];<br />

}<br />

Wmin = Wmax*1.0e-6;<br />

<strong>for</strong> (sI = 0; sI < #corner*2+1; sI++)<br />

if (W[i] < Wmin)<br />

W[i]= Wmin;<br />

svbksb(A,W,V,#vector*2,#corner*2,B,x);<br />

/* Displace vertices while ensuring the connectivity of the mesh.<br />

It just involves some basic Computer Graphics to implement the<br />

affine trans<strong>for</strong>m <strong>and</strong> interpolate the new pixel values. Cubic<br />

spline interpolation [54] has been used in the present work */<br />

displace(wF,x);<br />

// extract image from wireframe<br />

pExtendedOutput = (picture) wF;<br />

// return to the normal picture size<br />

pOutput = extract(pExtendedOutput,MaxVector);<br />

return pOutput;<br />

}<br />

D.2.2 Corner detection<br />

The corner detector of Section 5.2 involves many di erent routines. Only<br />

the main ones are presented hereunder. Implementation of Mathematical<br />

Morphology operators is best described in the literature [130]. The<br />

section presents rst the used data structures <strong>and</strong> the overall routine <strong>for</strong><br />

corner extraction. It then introduces the core of edge detection along<br />

with the recursive function <strong>for</strong> tracking according to the twenty templates.<br />

One important comment is that the order in which the various congurations<br />

of the templates of gures 5.7 <strong>and</strong> 5.9 are searched <strong>for</strong> is<br />

important. According to the type of tracked half-boundary (positive or<br />

negative), it should be made in accordance with gure D.2.<br />

/* define types of border points */<br />

#define B 255 // Background<br />

#define N 0 // Negative<br />

#define P 100 // Positive<br />

#define R 200 // inteRnal


194 Chapter D. Complements to Chapter 5<br />

Negative half-boundary Positive half-boundary<br />

1<br />

2<br />

3<br />

4 4<br />

1<br />

Figure D.2: Order of use of tracking templates<br />

/* define types of edge feature */<br />

#define O 0 // bOundary<br />

#define G 50 // Gap<br />

#define S 100 // Self-intersection<br />

#define I 150 // Intersection<br />

#define C 200 // Closed loop<br />

/* define the structure of a point on<br />

a half-boundary */<br />

typedef struct point* Point;<br />

typedef struct point{<br />

short contour; // contour number<br />

short vpos; // vertical position in the image<br />

short hpos; // horizontal position in the image<br />

Point next; // pointer to the next point of the half-b.<br />

Point previous; // pointer to the previous point of the half-b.<br />

double angle; // <strong>for</strong> angle measurement<br />

short href; // <strong>for</strong> extraction of highest curvature points<br />

short feature; // type of point: O,G,S,I,C<br />

};<br />

/* define the structure of a half-boundarty */<br />

typedef struct contour {<br />

short number; // reference number of the half-boundary<br />

short type; // type (P or N) of the half-boundary<br />

Point first; // pointer to the first point of the contour<br />

};<br />

/* Routine that implements the corner detector based on the<br />

description of Chapter 5, i.e. a modified version of Noble's<br />

edge detector combined with Najman <strong>and</strong> Vaillant measure of<br />

angles.<br />

Typical parameters are given in the calling routine. */<br />

3<br />

2


D.2 Pseudo-code of the implementation 195<br />

picture detect_corner(picture pImage,short Threshold1,short sThreshold2,<br />

short MinLength,short MaxLength,short Ag,short Step);<br />

{<br />

/* initialize images <strong>for</strong> morphological operators, c<strong>and</strong>idate points,<br />

<strong>and</strong> detected corners*/<br />

picture pFDER (pImage.height(),pImage.width());<br />

picture pC<strong>and</strong>i(pImage.height(),pImage.width());<br />

picture pCorner(pImage.height(),pImage.width());<br />

/* define a structure <strong>for</strong> memorizing edges */<br />

contour *edge;<br />

edge = (contour*) malloc(0);<br />

// express angle in Radian<br />

double Angle = cos(((double)Ag)*PI/180);<br />

/* Compute the Morphological Erosion Dilation residue<br />

as expressed in [94]; place the absolute value in pFDER,<br />

<strong>and</strong> achieve a preliminary classification into positive P,<br />

negative N or internal (ramp) R points in pCorner. */<br />

Erosion_Dilation_Residue(pImage, pFDER, pCorner);<br />

// select c<strong>and</strong>idate points according to Threshold1<br />

<strong>for</strong> (all pixels at position (i,j))<br />

{<br />

if ((pCorner(i,j) == N) && (pFDER(i,j) > Threshold1))<br />

{<br />

// then check if valid P point or R point in the neighboring<br />

test = 0;<br />

<strong>for</strong> (e<strong>very</strong> neighbor (k,l) of (i,j))<br />

if (((pCorner(k,l) == P) || (pCorner(k,l) == P)) &&<br />

(pFDER(k,l) > Threshold1))<br />

test = 1;<br />

// if no valid neighbor, the point becomes part of the<br />

// background points<br />

if (test)<br />

pC<strong>and</strong>i(i,j) = N;<br />

else<br />

pC<strong>and</strong>i(i,j) = B;<br />

}<br />

else if ((pCorner(i,j) == P) && (pFDER(i,j) > Threshold1))<br />

{<br />

// then check if valid N point or R point in the neighboring<br />

test = 0;<br />

<strong>for</strong> (e<strong>very</strong> neighbor (k,l) of (i,j))<br />

if (((pCorner(k,l) == N) || (pCorner(k,l) == P)) &&<br />

(pFDER(k,l) > Threshold1))<br />

test = 1;<br />

// if no valid neighbor, the point becomes part of the<br />

// background points<br />

if (test)<br />

pC<strong>and</strong>i(i,j) = P;<br />

else<br />

pC<strong>and</strong>i(i,j) = B;<br />

}


196 Chapter D. Complements to Chapter 5<br />

else if ((pCorner(i,j) == R) && (pFDER(i,j) > Threshold1))<br />

pC<strong>and</strong>i(i,j) = R;<br />

else<br />

pC<strong>and</strong>i(i,j) = B;<br />

}<br />

/* Then, the edge detection proposed by Noble has to<br />

be per<strong>for</strong>med with the 20 configurations templates <strong>for</strong> tracking<br />

The code of this routine is provided be<strong>low</strong>. */<br />

track_contours(pC<strong>and</strong>i,pFDER, Threshold2, &edge);<br />

/* A measure of angles has to be per<strong>for</strong>med along edges, <strong>and</strong><br />

the highest curvature point must be retained. This is strictly<br />

based on Najman <strong>and</strong> Vaillant [87] technique with the only<br />

addition of a MinLength parameter. */<br />

pCorner = determine_highCurvature(edge,Minlength,Maxlength,Angle);<br />

/* According to the application, one may then choose to automatically<br />

add a corner e<strong>very</strong> Step pixels along an edge. At the present stage<br />

of the algorithm, corners are detected on both the positive <strong>and</strong><br />

the negative half-boundaries. If wished, only pairs of corners may<br />

be retained <strong>and</strong>, among pairs, only the positive or the negative<br />

corners finally used.<br />

*/<br />

return pCorner;<br />

}<br />

/* Routines which implements the edge tracking according to the<br />

12 original templates of Noble [94] + 8 additional ones */<br />

track_contours(picture pC<strong>and</strong>i, picture pFDER, short Threshold2,<br />

contour **edge);<br />

{<br />

/* Extend image <strong>and</strong> associated structure by a factor 2 so<br />

as to improve the tracking */<br />

picture pDouble = upsize (pC<strong>and</strong>i,2);<br />

contour *edged;<br />

edged = (contour*) malloc(0);<br />

// image to memorize points that are already tracked<br />

picture pTrace(2*pC<strong>and</strong>i.height(),2*pC<strong>and</strong>i.width());<br />

picture pTraced(2*pC<strong>and</strong>i.height(),2*pC<strong>and</strong>i.width());<br />

// contour number<br />

short num;<br />

// No points traced yet<br />

pTrace = 0;<br />

pTraced = 0;<br />

num = 1;<br />

// selecting starting points <strong>for</strong> tracking<br />

<strong>for</strong> (all pixels (i,j))<br />

// look <strong>for</strong> points that are not yet traced<br />

if ((!pTrace(i,j)) && ((pC<strong>and</strong>i(i,j) == N) || (pC<strong>and</strong>i(i,j) == P)))<br />

{


D.2 Pseudo-code of the implementation 197<br />

// define contour type <strong>and</strong> opposite type<br />

type = pC<strong>and</strong>i(i,j);<br />

if (type == N)<br />

typer = P;<br />

else<br />

typer = N;<br />

test = 0;<br />

// CRITERION 1: value > Threshold2 ?<br />

if (pFDER(i,j) > sThresh2)<br />

test = 1;<br />

// 2. CRITERION 2: no neighbor of same type already traced<br />

<strong>for</strong> (e<strong>very</strong> neighbor (k,l) of (i,j))<br />

if (test)<br />

if ((pTrace(k,l)) && (pC<strong>and</strong>i(k,l) == type))<br />

test = 0;<br />

// CRITERION 3. at least 1 valid neighbor (> Threshold2)?<br />

if (test)<br />

{<br />

test = 0;<br />

<strong>for</strong> (e<strong>very</strong> neighbor (k,l) of (i,j))<br />

if ((!test) && (pFDER(k,l) > Threshold2) ||<br />

(pC<strong>and</strong>i(k,l) == type))<br />

{<br />

test = 1;<br />

neighbi = k;<br />

neighbj = l;<br />

}<br />

}<br />

// Tracking of this contour can start<br />

if (test)<br />

{<br />

// locate init point on the double-size image according<br />

// to location of point of opposite type in normal image<br />

if (neighbi = i+1)<br />

k = 2*i+1;<br />

else<br />

k = 2*i;<br />

if (neighbj = j+1)<br />

l = 2*j+1;<br />

else<br />

l = 2*j;<br />

// register init point in the contour structure<br />

pTraced(k,l) = num;<br />

edged = (contour *) realloc(edged,num*sizeof(contour));<br />

edged[num-1].number = num;<br />

edged[num-1].type = type;<br />

edged[num-1].first = new point;<br />

(*edged[num-1].first).contour = num;<br />

(*edged[num-1].first).vpos = k;<br />

(*edged[num-1].first).hpos = l;<br />

(*edged[num-1].first).next = NULL;<br />

(*edged[num-1].first).previous = NULL;


198 Chapter D. Complements to Chapter 5<br />

// start tracking on the double-size image<br />

// find next <strong>and</strong> preceding points of the first one<br />

// look <strong>for</strong> configurations in a precise order!<br />

// type N<br />

if (type == N)<br />

{<br />

test = 0;<br />

// 1. check <strong>for</strong> turn left configurations<br />

if (...) {test = 1; fea=O; posvNext=.;<br />

poshNext=.; posvPrev=.; poshPrev=.;}<br />

// 2. check <strong>for</strong> 45 degree left configurations<br />

if ((!test) && ...) {test = 1; fea=O; posvNext=. ;<br />

poshNext=.; posvPrev=.; poshPrev=.;}<br />

// 3. check <strong>for</strong> straight line configurations<br />

if ((!test) && ...) {test = 1; fea=O; posvNext=. ;<br />

poshNext=.; posvPrev=.; poshPrev=.;}<br />

// 4. check <strong>for</strong> turn right configurations<br />

if ((!test) && ...) {test = 1; fea=O; posvNext=. ;<br />

poshNext=.; posvPrev=.; poshPrev=.;}<br />

// 5. if no configuration is found yet, it means that<br />

// the init point is a GAP -> look <strong>for</strong> GAPs (only one<br />

// neighbor.<br />

if ((!test) && ...) {test = 1; fea=G ; posvNext=-1 ;<br />

poshNext=-1; posvPrev=.; poshPrev=.;}<br />

if ((!test) && ...) {test = 1; fea=G ; posvNext=. ;<br />

poshNext=.; posvPrev=-1; poshPrev=-1;}<br />

}<br />

// type P<br />

if (type == P)<br />

{<br />

// Ibidem with the other order of search 4,3,2,1,5<br />

}<br />

// update point in<strong>for</strong>mation<br />

// point<br />

(*edged[num-1].first).feature = fea;<br />

// next point<br />

if (posvNext != -1)<br />

{<br />

(*edged[num-1].first).next = new point;<br />

(*(*edged[num-1].first).next).contour = num;<br />

(*(*edged[num-1].first).next).vpos = posvNext;<br />

(*(*edged[num-1].first).next).hpos = poshNext;<br />

(*(*edged[num-1].first).next).next = NULL;<br />

(*(*edged[num-1].first).next).previous=edged[num-1].first;<br />

}<br />

// previous point<br />

if (posvPrev != -1)<br />

{<br />

(*edged[num-1].first).previous = new point;<br />

(*(*edged[num-1].first).previous).contour = num;<br />

(*(*edged[num-1].first).previous).vpos = posvPrev;<br />

(*(*edged[num-1].first).previous).hpos = poshPrev;<br />

(*(*edged[num-1].first).previous).next=edged[num-1].first;<br />

(*(*edged[num-1].first).previous).previous = NULL;<br />

}


D.2 Pseudo-code of the implementation 199<br />

}<br />

num--;<br />

// loop <strong>for</strong> the next points: this function is provided be<strong>low</strong><br />

if ((*edged[num-1].first).next != NULL)<br />

test = track_<strong>for</strong>ward(edged,type,typer,num,pDouble,pTraced,<br />

(*edged[num-1].first).next);<br />

// loop <strong>for</strong> the previous points if not a Closed loop<br />

if (((*edged[num-1].first).previous != NULL) && (test != C))<br />

track_bacward(edged,type,typer,num,pDouble,pTraced,<br />

(*edged[num-1].first).previous);<br />

// go back to the normal dimension image <strong>and</strong> contours<br />

reduction_contour(edge,edged,num,pTrace,pTraced);<br />

// increment contour number<br />

num++;<br />

}<br />

// free all the 'edged' structure<br />

...;<br />

return;<br />

}<br />

/* Recursive tracking of half-boundaries in <strong>for</strong>ward direction */<br />

short track_<strong>for</strong>ward(contour *edged,short type,short typer,short num,<br />

picture pDouble, picture pTraced, Point current)<br />

{<br />

short i,j,k,l;<br />

short posv, posh;<br />

short test;<br />

/* Determine where the contour comes from in order to<br />

only test adequate templates */<br />

k = (*current).vpos;<br />

l = (*current).hpos;<br />

i = k - (*(*current).previous).vpos;<br />

j = l - (*(*current).previous).hpos;<br />

test = 0;<br />

// track contour according to templates (types on pDouble)<br />

case i = ..<br />

case j = ..<br />

{<br />

// type N<br />

if (type == N)<br />

{<br />

// turn left?<br />

if (...) {test=1; posv=.; posh=.;}<br />

// 45 degree left?<br />

if ((!test) && ...) {test=1; posv=.; posh=.;}<br />

// straight on?<br />

if ((!test) && ...) {test=1; posv=.; posh=.;}


200 Chapter D. Complements to Chapter 5<br />

// turn right?<br />

if ((!test) && ...) {test=1; posv=.; posh=.;}<br />

}<br />

else<br />

// type P: Ibidem from right to left<br />

{}<br />

}<br />

// point is now traced<br />

pTraced(k,l) = num;<br />

if (test)<br />

{<br />

// self - intersection?<br />

if (pTraced(posv,posh) == num)<br />

{<br />

// closed loop?<br />

if (((*edged[num-1].first).vpos == posv) &&<br />

((*edged[num-1].first).hpos == posh))<br />

{<br />

(*current).feature = C;<br />

// next point is the first one<br />

(*current).next = edged[num-1].first;<br />

// update previous point fo the first one<br />

delete (*edged[num-1].first).previous;<br />

(*edged[num-1].first).previous = current;<br />

return C;<br />

}<br />

else<br />

{<br />

(*current).feature = S;<br />

// next point is already in the contour<br />

(*current).next = new point;<br />

(*(*current).next).contour = num;<br />

(*(*current).next).vpos = posv;<br />

(*(*current).next).hpos = posh;<br />

(*(*current).next).next = NULL;<br />

(*(*current).next).previous = current;<br />

return S;<br />

}<br />

}<br />

// intersection<br />

else if (pTraced(posv,posh))<br />

{<br />

(*current).feature = I;<br />

// next point belongs to another contour<br />

(*current).next = new point;<br />

(*(*current).next).contour = pTraced(posv,posh);<br />

(*(*current).next).vpos = posv;<br />

(*(*current).next).hpos = posh;<br />

(*(*current).next).next = NULL;<br />

(*(*current).next).previous = current;<br />

return I;<br />

}


D.2 Pseudo-code of the implementation 201<br />

// normal case<br />

else<br />

{<br />

(*current).feature = O;<br />

// next point<br />

(*current).next = new point;<br />

(*(*current).next).contour = num;<br />

(*(*current).next).vpos = posv;<br />

(*(*current).next).hpos = posh;<br />

(*(*current).next).next = NULL;<br />

(*(*current).next).previous = current;<br />

return track_<strong>for</strong>ward(edged,type,typer,num,pDouble,pTraced,<br />

(*current).next);<br />

}<br />

}<br />

else<br />

// the point is a gap<br />

{<br />

(*current).feature = G;<br />

// no next point<br />

(*current).next = NULL;<br />

return G;<br />

}<br />

return 0;<br />

}<br />

/* Recursive tracking of half-boundaries in backwardd direction */<br />

short track_backward(contour *edged,short type,short typer,short num,<br />

picture pDouble, picture pTraced, Point current)<br />

{<br />

// similar to track_<strong>for</strong>ward routine but <strong>for</strong> the<br />

// (*current).previous point<br />

}<br />

D.2.3 Inverse Kriging System<br />

/* Determine the system matrix A <strong>and</strong> the solution vector B<br />

<strong>for</strong> the Inverse Kriging Interpolation, as exposed in Chapter 5 */<br />

determine_system(wireframe wF, <strong>motion</strong> mField, double **A, double *B);<br />

{<br />

// Initialization<br />

A = 0;<br />

<strong>for</strong> (e<strong>very</strong> block #i of the <strong>motion</strong> vector field)<br />

{<br />

/* Backwards <strong>motion</strong> field to be reversed<br />

Determine origin of the <strong>motion</strong> vector according to the<br />

center position (posv[i],posh[i]) of the block */<br />

VertOrigin = MaxVector+posv[i]+vertical_component(mField,i);<br />

HoriOrigin = MaxVector+posh[i]+horizontal_component(mField,i);<br />

/* Determine into wich triangle, made out of vertices<br />

(Top1V,Top1H), (Top2V,Top2H), <strong>and</strong> (Top3V,Top3H), is located<br />

the origin of the <strong>motion</strong> vector */


202 Chapter D. Complements to Chapter 5<br />

}<br />

reference_triangle(VertOrigin, HoriOrigin, wF, Top1V,Top1H,<br />

Top2V,Top2H,Top3V,Top3H);<br />

/* Compute the invariants p <strong>and</strong> q of the related affine<br />

trans<strong>for</strong>m */<br />

Heigh_12 = Top2V - Top1V;<br />

Width_12 = Top2H - Top1H;<br />

Height_13 = Top3V - Top1V;<br />

Width_13 = Top3H - Top1H;<br />

denominator = Height_12 * Width_13 - Width_12 * Height_13;<br />

p = Height_13*(Top1H-HoriOrigin)+Width_13*(VertOrigin-Top1V);<br />

q = Height_12*(HoriOrigin-Top1H)+Width_12*(Top1V-VertOrigin);<br />

p /= denominator;<br />

q /= denominator;<br />

/* Insert the equation in the system */<br />

// vertical component of the vector<br />

A[i*2+1][indice_of_top1]= (double)1-p-q;<br />

A[i*2+1][indice_of_top2]= p;<br />

A[i*2+1][indice_of_top3]= q;<br />

// horizontal component of the vector<br />

A[i*2+1+1][indice_of_top1]= (double)1-p-q;<br />

A[i*2+1+1][indice_of_top2]= p;<br />

A[i*2+1+1][indice_of_top3]= q;<br />

// independent term<br />

b[i*2+1] = -vertical_component(mField, i);<br />

b[i*2+1+1] = -horizontal_component(mField, i);<br />

}


Bibliography<br />

[1] COST 211ter Simulation Subgroup. Focus document. Ankara,<br />

October 1996. Cost 211ter Workshop. SIM(96)41.<br />

[2] N. Ahmed, T. Natarajan, <strong>and</strong> K.R. Rao. Discrete cosine trans<strong>for</strong>m.<br />

IEEE Transactions on Computers, 23:90{93, January 1974.<br />

[3] Y. Altunbasak <strong>and</strong> M. Tekalp. Closed-<strong>for</strong>m connectivitypreserving<br />

solutions <strong>for</strong> <strong>motion</strong> <strong>compensation</strong> using 2-d meshes.<br />

IEEE Transactions on Image Processing, 6(9):1255{1269, September<br />

1997.<br />

[4] C. Auyeung, J. Kosmach, M. Orchard, <strong>and</strong> T. Kalafatis. Overlapped<br />

block <strong>motion</strong> <strong>compensation</strong>. volume 1818, pages 561{572,<br />

Boston, November 1992. SPIE Visual Communications <strong>and</strong> Image<br />

Processing.<br />

[5] P.R. Beaudet. Rotational invariant image operators. pages 579{<br />

583. Int. Conf. Pattern Recognition, 1978.<br />

[6] T. Berger. Rate Distorsion Theory. A mathematical basis <strong>for</strong> data<br />

compression. Prentice Hall, Englewood Cli s, N.J., 1971.<br />

[7] J. Besag. Spatial interaction <strong>and</strong> the statistical analysis of lattice<br />

systems. Journal of Royal Statistics Society, 2:192{236, 1974.<br />

[8] J. Besag. On the statistical analysis of dirty pictures. Journal of<br />

Royal Statistics Society, 48:259{302, 1986.<br />

[9] G. Bjontegaard. Very <strong>low</strong> <strong>bitrate</strong> <strong>video</strong> <strong>coding</strong> using h.263 <strong>and</strong><br />

<strong>for</strong>eseen extensions. pages 825{838, Louvain-la-Neuve, May 1996.<br />

European Conference on Multimedia Applications, Services <strong>and</strong><br />

Techniques (ECMAST).


204 Bibliography<br />

[10] G. Bozdagi, M. Tekalp, <strong>and</strong> L. Onural. 3-d <strong>motion</strong> <strong>estimation</strong><br />

<strong>and</strong> wireframe adaptation including photometric e ects <strong>for</strong> modelbased<br />

<strong>coding</strong> of facial image sequences. IEEE Transactions on<br />

Circuits <strong>and</strong> Systems <strong>for</strong> Video Technology, 4(3):246{256, June<br />

1994.<br />

[11] O. Bruyndonckx, E. Hanssens, B. Macq, X. Marichal, M.P.<br />

Queluz, <strong>and</strong> B. Simon. Coding on multigrids image sequences.<br />

page section B1, Berlin, October 1994. WIASIC.<br />

[12] Rec. ITU-R BT.601-5. Studio en<strong>coding</strong> parameters of digital television<br />

<strong>for</strong> st<strong>and</strong>ards 4:3 <strong>and</strong> wide-screen 16:9 aspect ratios. Technical<br />

report, ITU-R, Geneva, Switzerl<strong>and</strong>.<br />

[13] C. Chen <strong>and</strong> T.R. Hsing. Review: Digital <strong>coding</strong> techniques <strong>for</strong><br />

visual communications. pages 1{15, 1991.<br />

[14] C.F. Chen <strong>and</strong> K. Pang. The optimal trans<strong>for</strong>m of <strong>motion</strong>compensated<br />

frame di erence images in a hybrid coder. Journal<br />

of Royal Statistics Society, 40(6):393{397, June 1993.<br />

[15] P.B. Chou <strong>and</strong> C.M. Brown. The theory <strong>and</strong> practice of bayesian<br />

image labeling. International Journal of Computer Vision, 4:185{<br />

210, 1990.<br />

[16] S. Comes. Les traitements perceptifs d'images numerisees. PhD<br />

thesis, Universite catholique de Louvain, June 1995.<br />

[17] Y. Correc <strong>and</strong> E. Chapuis. Fast computation of delaunay triangulations.<br />

Adv. Eng. Software, 9(2):77{83, 1987.<br />

[18] F. Davoine. Compression d'images par fractales basee sur la triangulation<br />

de Delaunay. PhD thesis, Institut National Polytechnique<br />

de Grenoble, December 1995.<br />

[19] C. De Vleeschouwer, T. Delmot, X. Marichal, <strong>and</strong> B. Macq. A<br />

fuzzy logic system <strong>for</strong> content-based <strong>bitrate</strong> allocation. Signal Processing:<br />

Image Communication, pages 115{142, July 1997.<br />

[20] C. De Vleeschouwer, X. Marichal, T. Delmot, <strong>and</strong> B. Macq. A<br />

fuzzy logic system able to automatically detect interesting areas.<br />

volume 3016, pages 234{245, San Jose, February 1997. SPIE Human<br />

Vision <strong>and</strong> Electronic Imaging II.


Bibliography 205<br />

[21] E. Decenciere <strong>and</strong> P. Salembier. Morpheco deliverable. chapter<br />

3: Application of kriging to image <strong>coding</strong>. Morpheco Consortium,<br />

R2053/UPC/GPS/DS/R/016/b1, January 1996.<br />

[22] E. Decenciere Ferr<strong>and</strong>iere, C. de Fouquet, <strong>and</strong> F. Meyer. Applications<br />

of kriging to image sequence <strong>coding</strong>. Accepted <strong>for</strong> publication<br />

in Signal Processing: Image Communication, 1998.<br />

[23] T. Delmot, C. De Vleeschouwer, B. Macq, <strong>and</strong> X. Marichal. The<br />

comis scheme: an approach towards <strong>very</strong>-<strong>low</strong> bit-rate image <strong>coding</strong>.<br />

pages 883{902, Louvain-la-Neuve, May 1996. European Conference<br />

on Multimedia Applications, Services <strong>and</strong> Techniques (EC-<br />

MAST).<br />

[24] R. Deriche <strong>and</strong> O. Faugeras. 2-d curve matching using high curvature<br />

points: Application to stereovision. pages 567{576. Int. Conf.<br />

Pattern Recognition, 1990.<br />

[25] R. Deriche <strong>and</strong> G. Giraudon. A computational approach <strong>for</strong> corner<br />

<strong>and</strong> vertex detection. International Journal of Computer Vision,<br />

10:101{124, 1993.<br />

[26] P.K. Doenges, T.K. Capin, F. Lavagetto, J. Ostermann, I.S.<br />

P<strong>and</strong>zic, <strong>and</strong> E.D. Petajan. Mpeg-4: Audio/<strong>video</strong> <strong>and</strong> synthetic<br />

graphics/audio <strong>for</strong> mixed media. Signal Processing: Image Communication,<br />

9(4):433{464, May 1997.<br />

[27] L. Dreschler <strong>and</strong> H.H. Nagel. On the selection of critical points<br />

<strong>and</strong> local curvature extrema of region boundaries <strong>for</strong> interframe<br />

matching. pages 542{544. Int. Conf. Pattern Recognition, 1981.<br />

[28] M. Dudon. Modelisation du mouvement par Treillis Actifs et methodes<br />

d'<strong>estimation</strong> associees. Application au codage de sequences<br />

d'images. PhD thesis, Universite de Rennes I, Decembre 1996.<br />

[29] M. Dudon, O. Avaro, <strong>and</strong> G. Eude. Object-oriented <strong>motion</strong> <strong>estimation</strong>.<br />

pages 284{287, Sacramento, September 1994. Picture<br />

Coding Symposium (PCS).<br />

[30] M. Dudon, G. Eude, <strong>and</strong> C. Roux. Motion <strong>estimation</strong> <strong>and</strong> triangular<br />

active mesh. Revue HF, (4):47{53, December 1995.


206 Bibliography<br />

[31] F. Dufaux <strong>and</strong> F. Moscheni. Motion <strong>estimation</strong> techniques <strong>for</strong><br />

digital tv: a review <strong>and</strong> a new contribution. Proceedings of IEEE,<br />

83(6):858{876, June 1995.<br />

[32] F. Du aux. Multigrid Block Matching Motion Estimation <strong>for</strong><br />

Generic Video Coding. PhD thesis, Ecole Polytechnique Federale<br />

de Lausanne, 1994.<br />

[33] D.P. Elias <strong>and</strong> K.K. Pang. Obtaining a coherent <strong>motion</strong> eld dor<br />

<strong>motion</strong>-based segmentation. pages 541{546, Melbourne, March<br />

1996. Picture Coding Symposium (PCS).<br />

[34] M. Etoh, C.S. Boon, <strong>and</strong> S. Kadono. An object-based image <strong>coding</strong><br />

scheme using alpha channel <strong>and</strong> multiple templates. pages<br />

853{871, Louvain-la-Neuve, May 1996. European Conference on<br />

Multimedia Applications, Services <strong>and</strong> Techniques (ECMAST).<br />

[35] R.W. Floyd <strong>and</strong> R. Beigel. The language of Machines - An Introduction<br />

to Computability <strong>and</strong> Formal Languages. Computer<br />

Science Press, 1994.<br />

[36] C.-S. Fuh <strong>and</strong> P. Maragos. A ne models <strong>for</strong> image matching <strong>and</strong><br />

<strong>motion</strong> detection. pages 2409{2412, Toronto, Canada, May 1991.<br />

Int. Conf. on Acoustics, Speech <strong>and</strong> Signal Processing (ICASSP).<br />

[37] T. Gaidon. Quanti cation vectorielle algebrique et ondelettes pour<br />

la compression de sequences d'images. PhD thesis, Universite de<br />

Nice-Sophia Antipolis, Ecole Doctorale Sciences Pour l'Ingenieur,<br />

December 1993.<br />

[38] R.G. Gallager. In<strong>for</strong>mation Theory <strong>and</strong> Reliable Communication.<br />

John Wiley & Sons, Inc., New-York, 1968.<br />

[39] S. Geman <strong>and</strong> D. Geman. Stochastic relaxation, gibbs distributions,<br />

<strong>and</strong> the bayesian restoration of images. IEEE Transactions<br />

on Pattern Analysis <strong>and</strong> Machine Intelligence, 6(6):721{741,<br />

November 1984.<br />

[40] C. Gra gne. Approche region: methodes markoviennes. In J.P.<br />

Cocquerez <strong>and</strong> S. Philipp, editors, Analyse d'images: ltrage et<br />

segmentation, chapter XI, pages 281{304. Masson, Paris, October<br />

1995.


Bibliography 207<br />

[41] R.M. Gray. Vector quantization. IEEE ASSP Magazine, pages<br />

4{29, April 1984.<br />

[42] MPEG Video group. MPEG-4 Video Veri cation Model, version<br />

10.1, ISO/IEC SC29WG11 (M3464). Tokyo, March 1998.<br />

[43] E. Hanssens, B. Chupeau, J.D. Legat, <strong>and</strong> B. Macq. Selective<br />

prediction of error transmission using <strong>motion</strong> in<strong>for</strong>mation. Signal<br />

Processing: Image Communication, 12(1):71{81, March 1998.<br />

[44] R.M. Haralick <strong>and</strong> L.G. Shapiro. Glossary of computer vision<br />

terms. In Computer <strong>and</strong> Robot Vision,volume 2, chapter 21, pages<br />

571{614. Addison-Wesley Publishing Company, 1993.<br />

[45] R.M. Haralick, S.R. Sternberg, <strong>and</strong> X. Zhuang. Image analysis<br />

using mathematical morphology. IEEE Transactions on Pattern<br />

Analysis <strong>and</strong> Machine Intelligence, 9(4):532{550, July 1987.<br />

[46] C. Harris <strong>and</strong> M. Stephens. A combined corner <strong>and</strong> edge detector.<br />

pages 147{151, Manchester, 1988. Proc. Alvey Vision Conference.<br />

[47] J.F. Hayes, A. Habibi, <strong>and</strong> P.A. Wintz. Rate distortion function<br />

<strong>for</strong> a gaussian source model of images. IEEE Transactions on<br />

In<strong>for</strong>mation Theory, IT-16(4):507{509, July 1970.<br />

[48] B. Horn. MIT Press, Cambridge, Massachusetts, the mit electrical<br />

engineering <strong>and</strong> computer science series edition, 1986.<br />

[49] B. Horn <strong>and</strong> B. Schunck. Determining optical ow. Arti cial<br />

Intelligence, (17):185{203, 1981.<br />

[50] J.R. Jain <strong>and</strong> A.K. Jain. Displacement measurement <strong>and</strong> its application<br />

in interframe image <strong>coding</strong>. IEEE Transactions on Communications,<br />

29(12):1799{1806, December 1981.<br />

[51] R. Jain. Direct computation of the focus of expansion. IEEE<br />

Transactions on Pattern Analysis <strong>and</strong> Machine Intelligence,<br />

5(1):58{64, January 1983.<br />

[52] A.J. Jerri. The shannon sampling theorem - its various extensions<br />

<strong>and</strong> applications: A tutorial review. Proceedings of IEEE,<br />

65(11):1565{1596, November 1977.


208 Bibliography<br />

[53] H. Jozawa. Motion compensated <strong>video</strong> <strong>coding</strong> using rotation <strong>and</strong><br />

scaling models. volume 1, pages 309{312, Melbourne, March 1996.<br />

Picture Coding Symposium (PCS).<br />

[54] R.G. Keys. Cubic convolution interpolation <strong>for</strong> digital image processing.<br />

IEEE Transactions on Acoustics, Speech <strong>and</strong> Signal Processing,<br />

29(6):1153{1160, December 1981.<br />

[55] S. Kirkpatrick, C.D. Gelatt, <strong>and</strong> M.P. Vecchi. Optimization by<br />

simulated annealing. Science, 220, May 1983.<br />

[56] L. Kitchen <strong>and</strong> A. Rosenfeld. Gray-level corner detection. Pattern<br />

Recognition Letters, 1:95{102, December 1982.<br />

[57] R. Koenen, F. Pereira, <strong>and</strong> L. Chiariglione. Mpeg-4: Context <strong>and</strong><br />

objectives. Signal Processing: Image Communication, 9(4):295{<br />

304, May 1997.<br />

[58] D.G. Krige. A statistical approach to some mine valuations <strong>and</strong><br />

allied problems on the witwatersr<strong>and</strong>, 1951.<br />

[59] M. Kunt, M. Benard, <strong>and</strong> R. Leonardi. Recent results in highcompression<br />

image <strong>coding</strong>. IEEE Transactions on Circuits <strong>and</strong><br />

Systems, 34(11):1306{1336, November 1987.<br />

[60] G.G. Langdon Jr. An introduction to arithmetic <strong>coding</strong>. IBM<br />

Journal of Research <strong>and</strong> Development, 28(2):135{149, March 1984.<br />

[61] F. Lavagetto <strong>and</strong> S. Curinga. Object-oriented scene modeling <strong>for</strong><br />

interpersonal <strong>video</strong> communication at <strong>very</strong> <strong>low</strong> bit-rate. Signal<br />

Processing: Image Communication, (6):379{395, June 1994.<br />

[62] D. Le Gall. Mpeg: A <strong>video</strong> compression st<strong>and</strong>ard <strong>for</strong> multimedia<br />

applications. Communications of the ACM, 34(4):46{58, April<br />

1991.<br />

[63] H. Li, A. Lundmark, <strong>and</strong> R. Forchheimer. Image sequence <strong>coding</strong><br />

at <strong>very</strong> <strong>low</strong> <strong>bitrate</strong>s: A review. IEEE Transactions on Image<br />

Processing, 3(5):589{609, September 1994.<br />

[64] B. Macq. A universal entropy coder <strong>for</strong> trans<strong>for</strong>m or hybrid <strong>coding</strong>.<br />

pages 12.1.1{12.1.2, Boston, 1990. Picture Coding Symposium<br />

(PCS).


Bibliography 209<br />

[65] B. Macq, M.P. Queluz, <strong>and</strong> B. Simon. Very <strong>low</strong> bit-rate image<br />

<strong>coding</strong> on adaptive multigrids. Signal Processing: Image Communication,<br />

7(4-6):313{331, November 1995.<br />

[66] S.G. Mallat. A theory <strong>for</strong> multiresolution signal decomposition:<br />

The wavelet representation. IEEE Transactions on Pattern Analysis<br />

<strong>and</strong> Machine Intelligence, 11(7):674{693, July 1989.<br />

[67] X. Marichal. Universal entropy <strong>coding</strong> of segmentation trees, June<br />

1994.<br />

[68] X. Marichal. An original approachtowards <strong>very</strong>-<strong>low</strong> bit-rate image<br />

<strong>coding</strong>. Revue HF, (4):29{46, December 1995.<br />

[69] X. Marichal, C. De Vleeschouwer, T. Delmot, <strong>and</strong> B. Macq. Automatic<br />

detection of interest areas of an image or of a sequence<br />

of images. volume III, pages 371{374, Lausanne, September 1996.<br />

Int. Conf. on Image Processing (ICIP).<br />

[70] X. Marichal, C. De Vleeschouwer, T. Delmot, <strong>and</strong> B. Macq. Object<br />

based <strong>coding</strong> through multigrid representation. volume 2668,<br />

pages 6{17, San Jose, January 1996. SPIE Digital Visual Communications:<br />

algorithms <strong>and</strong> technologies.<br />

[71] X. Marichal, C. De Vleeschouwer, <strong>and</strong> B. Macq. Towards visual<br />

search engine based on fuzzy logic. pages 135{141, Louvain-la-<br />

Neuve, June 1997. Workshop on Image Analysis <strong>for</strong> Multimedia<br />

Interactive Services (WIAMIS'97).<br />

[72] X. Marichal, T. Delmot, C. De Vleeschouwer, <strong>and</strong> B. Macq. Determination<br />

automatique des regions d'inter^et d'une image ou d'une<br />

sequence d'images. pages 228{234, Grenoble, February 1996.<br />

Journees d'etudes et d'echanges du CNET.<br />

[73] X. Marichal, T. Delmot, B. Macq, F. Oger, V. Warscotte, J.P. Thiran,<br />

<strong>and</strong> B. Simon. Towards object-based <strong>coding</strong> through multigrid<br />

representation. pages H{1, Tokyo, November 1995. International<br />

Workshop on Coding Techniques <strong>for</strong> Very Low Bit-rate<br />

Video (VLBV).<br />

[74] X. Marichal <strong>and</strong> B. Macq. Asymmetric <strong>motion</strong> <strong>estimation</strong>/<strong>compensation</strong>.<br />

volume III, pages 775{778, Lausanne, September<br />

1996. Int. Conf. on Image Processing (ICIP).


210 Bibliography<br />

[75] X. Marichal <strong>and</strong> B. Macq. Motion reconstruction with wireframe<br />

structures. pages 633{637, Melbourne, March 1996. Picture Coding<br />

Symposium (PCS).<br />

[76] X. Marichal <strong>and</strong> B. Macq. Active mesh reconstruction of blockbased<br />

<strong>motion</strong> in<strong>for</strong>mation. volume V, pages 2605{2608, Seattle,<br />

May 1998. Int. Conf. on Acoustics, Speech <strong>and</strong> Signal Processing<br />

(ICASSP).<br />

[77] X. Marichal, B. Macq, <strong>and</strong> M.P. Queluz. Generic coder <strong>for</strong> binary<br />

sources: the m-coder. IEE Electronics Letters, 31(7):544{545,<br />

March 1995.<br />

[78] G. Matheron. Estimating <strong>and</strong> Choosing. Springer-Verlag, Berlin,<br />

1989.<br />

[79] G. Medioni <strong>and</strong> Y. Yasumuto. Corner detection <strong>and</strong> curve representation<br />

unsing cubic b-splines. Computer Vision, Graphics, <strong>and</strong><br />

Image Processing, 39:267{278, 1987.<br />

[80] Jerry M. Mendel. Fuzzy Logic Systems <strong>for</strong> Engineering: A Tutorial.<br />

Proceedings of IEEE, 83(3):345{377, March 1995.<br />

[81] N. Metropolis, A.W. Resenbluth, M.N. Resenbluth, A.H. Teller,<br />

<strong>and</strong> E. Teller. Equations of state caculations by fast computing<br />

machines. Journal Chem. Phys., 21:1087{1091, 1953.<br />

[82] D.P. Mitchell <strong>and</strong> A.N. Netravali. Reconstruction lters in computer<br />

graphics. Computer Graphics, 22(4):221{228, August 1988.<br />

[83] F. Mokhtarian <strong>and</strong> A.K. Mackworth. Scale-based description <strong>and</strong><br />

recognition of planar curves <strong>and</strong> 2d shapes. IEEE Transactions<br />

on Pattern Analysis <strong>and</strong> Machine Intelligence, 8(1):34{43, 1986.<br />

[84] MORPHECO. Morphological segmentation-based <strong>coding</strong> of image<br />

sequences. pages 2.2.1{2.2.8, Hannover, December 1993. COST<br />

211ter European Workshop on New Techniques <strong>for</strong> Coding of<br />

Video Signals at Very Low Bitrates.<br />

[85] H.G. Musmann, M. Hotter, <strong>and</strong> J. Ostermann. Object-oriented<br />

analysis-synthesis <strong>coding</strong> of moving images. Image Communication,<br />

1(2):117{138, October 1989.


Bibliography 211<br />

[86] H.G. Musmann, P. Pirsch, <strong>and</strong> H.-J. Grallert. Advances in picture<br />

<strong>coding</strong>. Proceedings of IEEE, 73(4):523{548, April 1985.<br />

[87] L. Najman <strong>and</strong> R. Vaillant. Topological <strong>and</strong> geometrical corners by<br />

watershed. volume LNCS 970, pages 262{269, Prague, September<br />

1995. Computer Analysis of Images <strong>and</strong> Patterns.<br />

[88] Y. Nakaya <strong>and</strong> H. Harashima. An iterative <strong>motion</strong> <strong>estimation</strong><br />

method using triangular patches <strong>for</strong> <strong>motion</strong> <strong>compensation</strong>. pages<br />

546{557. Society of Photo-Instrumentation Engineers, November<br />

1991.<br />

[89] N.M. Nasrabadi <strong>and</strong> R.A. King. Image <strong>coding</strong> using vector quantization:<br />

A review. IEEE Transactions on Communications,<br />

36(8):81{95, August 1988.<br />

[90] R. Ne <strong>and</strong> A. Zakhor. Very <strong>low</strong> bit rate <strong>video</strong> <strong>coding</strong> based on<br />

matching pursuits. IEEE Transactions on Circuits <strong>and</strong> Systems<br />

<strong>for</strong> Video Technology, 7(1):158{171, February 1997.<br />

[91] A.N. Netravali <strong>and</strong> J.D. Robbins. Motion-compensated television<br />

<strong>coding</strong>: Part i. BELL System Technical Journal, 58(3):631{670,<br />

March 1979.<br />

[92] H. Nicolas <strong>and</strong> C. Labit. Motion <strong>and</strong> illumination variation <strong>estimation</strong><br />

using a hierarchy of models: Application to image sequence<br />

<strong>coding</strong>. Technical report, IRISA, Rennes, June 1993.<br />

[93] J. Nieweg<strong>low</strong>ski, T. Campbell, <strong>and</strong> H. Haavisto. Anovel <strong>video</strong><br />

<strong>coding</strong> scheme based on temporal prediction using digital image<br />

warping. IEEE Transactions on Consumer Electronics, 39:141{<br />

150, August 1993.<br />

[94] J.A. Noble. Finding half boundaries <strong>and</strong> junctions in images. Image<br />

<strong>and</strong> Vision Computing, 10(4):219{232, May 1992.<br />

[95] Telecommunication St<strong>and</strong>ardization Sector of ITU. Itu-t recommendation<br />

h.261 - <strong>video</strong> codec <strong>for</strong> audiovisual services at p x 64<br />

kbit/s. ITU Recommendations, March 1993.<br />

[96] Telecommunication St<strong>and</strong>ardization Sector of ITU. Draft itu-t<br />

recommendation h.263. ITU Recommendations, July 1995.


212 Bibliography<br />

[97] AdHoc Group on MPEG-4 Test Procedures. Mpeg-4<br />

test/evaluation procedures document - revision 2.0. MPEG 4<br />

meeting, May 1995.<br />

[98] J.B. O'Neal <strong>and</strong> T.R. Natarajan. Coding isotropic images. IEEE<br />

Transactions on In<strong>for</strong>mation Theory, 23(6):697{707, November<br />

1977.<br />

[99] Papoulis. Probability, R<strong>and</strong>om Variables, <strong>and</strong> Stochastic Processes.<br />

Mc Graw Hill, Inc., New-York, 3rd edition, 1991.<br />

[100] F. Pereira. Mpeg4: a new challenge <strong>for</strong> the representation of audiovisual<br />

in<strong>for</strong>mation. pages 7{16, Melbourne, March 1996. Picture<br />

Coding Symposium (PCS).<br />

[101] F. Pereira. First proposals <strong>for</strong> mpeg-7 visual requirements. Bristol<br />

meeting, April 1997.<br />

[102] F. Pereira <strong>and</strong> P. Geada. Sketch-based database searching: a<br />

demonstration of an mpeg-7 application. Bristol meeting, April<br />

1997.<br />

[103] F. Pereira <strong>and</strong> R. Koenen. Very <strong>low</strong> bit-rate audio-visual applications.<br />

Signal Processing: Image Communication, 9(1):55{77,<br />

November 1996.<br />

[104] F. Pereira, K. O'Connell, R. Koenen, <strong>and</strong> M. Etoh. Special issue<br />

on mpeg-4, part 1: Invited papers. Signal Processing: Image<br />

Communication, 9(4), May 1997.<br />

[105] W.K. Pratt. John Wiley & Sons, New York, 1978.<br />

[106] W.K. Pratt. Photometry <strong>and</strong> colorimetry. In Digital Image Processing<br />

[105], chapter 3, pages 50{90.<br />

[107] W.H. Press, S.A. Teukolsky, W.T. Vetterling, <strong>and</strong> B.P. Flannery.<br />

Numerical Recipes in C - The Art of Scienti c Computing. Cambridge<br />

University Press, Cambridge, 2 edition, 1992.<br />

[108] M.P. Queluz. Motion <strong>estimation</strong> <strong>for</strong> <strong>video</strong> <strong>coding</strong>: a review. Revue<br />

HF, (4):5{28, December 1995.<br />

[109] M.P. Queluz. Multiscale Motion Estimation <strong>and</strong> Video Compression.<br />

PhD thesis, Universite catholique de Louvain, April 1996.


Bibliography 213<br />

[110] M.P. Queluz <strong>and</strong> B. Macq. Signal-adapted <strong>motion</strong> <strong>compensation</strong><br />

<strong>for</strong> <strong>video</strong> compression. Paris, September 1992. ISSES, URSI.<br />

[111] M.P. Queluz, X. Marichal, <strong>and</strong> B. Macq. E cient entropy <strong>coding</strong><br />

of tree data structures. pages 478{481, Sacramento, September<br />

1994. Picture Coding Symposium (PCS).<br />

[112] M.P. Queluz, B. Simon, <strong>and</strong> B. Macq. Towards a spatio-temporal<br />

segmentation technique <strong>for</strong> <strong>very</strong>-<strong>low</strong> <strong>bitrate</strong> image <strong>coding</strong>. Hannover,<br />

December 1993. COST 211ter European Workshop on New<br />

Techniques <strong>for</strong> Coding of Video Signals at Very Low Bitrates.<br />

[113] M. Rabbani <strong>and</strong> P.W. Jones. Digital Image Compression Techniques.<br />

SPIE Optical Engineering Press, Georgia Institute of Technology,<br />

2 edition, 1991.<br />

[114] K. Rohr. Localization properties of direct corner detectors. Journal<br />

of Mathematical Imaging <strong>and</strong> Vision, 4:139{150, 1994.<br />

[115] M. Rydfalk. C<strong>and</strong>ide: A parametrised face. Technical report,<br />

Dept. Electr. Eng., Linkoping University, Linkoping, Sweden,<br />

1987.<br />

[116] D.G. Sampson, E.A. da Silva, <strong>and</strong> M. Ghanbari. Interframe image<br />

sequence <strong>coding</strong> using overlapped <strong>motion</strong> <strong>estimation</strong> <strong>and</strong> wavelet<br />

lattice quantisation. pages 16{20, Edinburgh, July 1995. Fifth<br />

International conference on IMAGE PROCESSING <strong>and</strong> its applications.<br />

[117] H. Sanson. Motion a ne models identi cation <strong>and</strong> application to<br />

television image <strong>coding</strong>. volume 1605-2, pages 570{581, Boston,<br />

November 1991. Visual Communication <strong>and</strong> Image Processing.<br />

[118] J. Serra. Image Analysis <strong>and</strong> Mathematical Morphology. London<br />

Academic Press, 1982.<br />

[119] S.W. Sloan. A fast algorithm <strong>for</strong> constructing delaunay triangulations<br />

in the plane. Adv. Eng. Software, 9(1):34{55, 1987.<br />

[120] P. Strobach. Tree-structured scene adaptive coder. IEEE Transactions<br />

on Communications, 38(4):477{486, April 1990.<br />

[121] M. Tekalp et al. The status of core experiment m2. Technical<br />

report, MPEG 96/1102, July 1996.


214 Bibliography<br />

[122] C. Toklu, T. Erdem, I. Sezan, <strong>and</strong> M. Tekalp. Tracking <strong>motion</strong><br />

<strong>and</strong> intensity variations using hierarchical 2-d mesh modeling<br />

<strong>for</strong> synthetic object trans guration. Computer Vision, Graphics,<br />

<strong>and</strong> Image Processing: Graphical Models <strong>and</strong> Image Processing,<br />

58(6):553{573, November 1996.<br />

[123] R.Y. Tsai <strong>and</strong> T.S. Huang. Estimating three-dimensional <strong>motion</strong><br />

parameters of a rigid planar patch. IEEE Transactions on<br />

Acoustics, Speech <strong>and</strong> Signal Processing, 29(6):1147{1152, December<br />

1981.<br />

[124] R.Y. Tsai <strong>and</strong> T.S. Huang. Estimating three-dimensional <strong>motion</strong><br />

parameters of a rigid planar patch, iii: Finite point correspondences<br />

<strong>and</strong> the three-view problem. IEEE Transactions on Acoustics,<br />

Speech <strong>and</strong> Signal Processing, 32(2):213{220, April 1984.<br />

[125] R.Y. Tsai, T.S. Huang, <strong>and</strong> W.L. Zhu. Estimating threedimensional<br />

<strong>motion</strong> parameters of a rigid planar patch, ii: Singular<br />

value decomposition. IEEE Transactions on Acoustics, Speech <strong>and</strong><br />

Signal Processing, 30(4):525{534, August 1982.<br />

[126] Y. Tse <strong>and</strong> R. Baker. Global zoom/pan <strong>estimation</strong> <strong>and</strong> <strong>compensation</strong><br />

<strong>for</strong> <strong>video</strong> compression. pages 2725{2728. Int. Conf. on Acoustics,<br />

Speech <strong>and</strong> Signal Processing (ICASSP), 1991.<br />

[127] G. Tziritas <strong>and</strong> C. Labit. Motion Analysis <strong>for</strong> Image Sequence<br />

Coding. Elsevier, Amsterdam, 1994.<br />

[128] J. Vaisey <strong>and</strong> A. Gersho. Image compression with variable block<br />

size segmentation. IEEE Transactions on Acoustics, Speech <strong>and</strong><br />

Signal Processing, 40(8):2040{2060, August 1992.<br />

[129] C. van den Br<strong>and</strong>en Lambrecht <strong>and</strong> O. Verscheure. Perceptual<br />

quality measure using a spatio-temporal model of the human visual<br />

system. volume 2668, pages 450{461, San Jose, February 1996.<br />

SPIE Electronic Imaging: science <strong>and</strong> technology.<br />

[130] M. Van Droogenbroeck <strong>and</strong> H. Talbot. Fast computation of morphological<br />

operations with arbitrary structuring elements. Pattern<br />

Recognition Letters, 17:1451{1460, 1996.<br />

[131] F. Vermaut. Un algorithme distribue pour la <strong>compensation</strong> de<br />

mouvements: Distributed abma, June 1997.


Bibliography 215<br />

[132] F. Vermaut, Y. Deville, B. Macq, <strong>and</strong> X. Marichal. A distributed<br />

adaptive block matching algorithm : Dis-abma. page accepted <strong>for</strong><br />

publication, Isl<strong>and</strong> of Rhodes, September 1998. EUSIPCO'98.<br />

[133] A. Verri <strong>and</strong> T. Poggio. Motion eld <strong>and</strong> optical ow: qualitative<br />

properties. IEEE Transactions on Pattern Analysis <strong>and</strong> Machine<br />

Intelligence, 11:490{498, May 1989.<br />

[134] L. Vincent <strong>and</strong> P. Soille. Watersheds in digital spaces: An e cient<br />

algorithm based on immersion simulations. IEEE Transactions on<br />

Pattern Analysis <strong>and</strong> Machine Intelligence, 13(6):583{598, June<br />

1991.<br />

[135] G.K. Wallace. The jpeg still picture compression st<strong>and</strong>ard. Communications<br />

of the ACM, 34(4):30{44, April 1991.<br />

[136] Y. Wang <strong>and</strong> O. Lee. Active mesh - a feature seeking <strong>and</strong> tracking<br />

image sequence representation scheme. IEEE Transactions on<br />

Image Processing, 3(5):610{624, September 1994.<br />

[137] J.W. Woods. Markov image modeling. IEEE Transactions on<br />

Automatic Control, 23(5):846{850, October 1978.<br />

[138] Y. Yokoyama, Y. Miyamoto, <strong>and</strong> M. Ohta. Very <strong>low</strong> <strong>bitrate</strong> <strong>video</strong><br />

<strong>coding</strong> using warping prediction adaptive to object contours. pages<br />

M{4, Tokyo, November 1995. International Workshop on Coding<br />

Techniques <strong>for</strong> Very Low Bit-rate Video (VLBV).<br />

[139] O.A. Zuniga <strong>and</strong> R.M. Haralick. Corner detection using the facet<br />

model. pages 30{37. Int. Conf. Pattern Recognition, 1983.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!