Novel Methods in H.264/AVC - Library of Ph.D. Theses | EURASIP

Novel Methods in H.264/AVC - Library of Ph.D. Theses | EURASIP Novel Methods in H.264/AVC - Library of Ph.D. Theses | EURASIP

theses.eurasip.org
from theses.eurasip.org More from this publisher
12.07.2015 Views

Novel Methods in H.264/AVCInter Prediction, Data Hiding, Bit Rate TranscodingSPYRIDON K. KAPOTASJune 25, 2011HELLENIC OPEN UNIVERSITYSchool of Science and TechnologyDigital Systems & Media Computing Laboratory

<strong>Novel</strong> <strong>Methods</strong> <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong>Inter Prediction, Data Hid<strong>in</strong>g, Bit Rate Transcod<strong>in</strong>gSPYRIDON K. KAPOTASJune 25, 2011HELLENIC OPEN UNIVERSITYSchool <strong>of</strong> Science and TechnologyDigital Systems & Media Comput<strong>in</strong>g Laboratory


Abstract<strong>H.264</strong> Advanced Video Cod<strong>in</strong>g has become the dom<strong>in</strong>ant video cod<strong>in</strong>g standard <strong>in</strong> themarket, with<strong>in</strong> a few years after the first version <strong>of</strong> the standard was completed by theISO/IEC MPEG and the ITU-T VCEG groups <strong>in</strong> May 2003. That happened ma<strong>in</strong>ly due tothe great cod<strong>in</strong>g efficiency <strong>of</strong> <strong>H.264</strong>. Compared to MPEG-2, the previous dom<strong>in</strong>antstandard, the <strong>H.264</strong> compression ratio is about twice as higher for the same video quality.That makes <strong>H.264</strong> ideal for a numerous <strong>of</strong> applications, such as video broadcast<strong>in</strong>g, videostream<strong>in</strong>g and video conferenc<strong>in</strong>g. However, the <strong>H.264</strong> efficiency is achieved at theexpense <strong>of</strong> the codec’s complexity. <strong>H.264</strong> complexity is about four times that <strong>of</strong> MPEG-2. As a consequence, many video cod<strong>in</strong>g issues, which have been addressed <strong>in</strong> previousstandards, need to be re-considered. For example the <strong>H.264</strong> encod<strong>in</strong>g <strong>of</strong> a video <strong>in</strong> realtime is now an open issue. Re-apply<strong>in</strong>g older solutions is feasible but <strong>in</strong>sufficient becausethe new <strong>H.264</strong> characteristics are not taken <strong>in</strong>to account and thus the problems caused bythese characteristics are not properly addressed. On the other hand, these characteristicsmake possible a series <strong>of</strong> applications that either were not possible or showed <strong>in</strong>feriorresults prior the <strong>H.264</strong> era.This dissertation aims at <strong>in</strong>vestigat<strong>in</strong>g novel methods, which take advantage <strong>of</strong> the newcharacteristics <strong>in</strong>troduced by <strong>H.264</strong>. These methods are <strong>of</strong> two categories, namelyenhancements and applied methods. The goal <strong>of</strong> the enhancements is to improve theperformance <strong>of</strong> the <strong>H.264</strong> encoder by reduc<strong>in</strong>g its complexity. We focused on the <strong>in</strong>terprediction part <strong>of</strong> the encoder. Three representative methods <strong>of</strong> this category are<strong>in</strong>troduced; a fast full search algorithm, which reduces the motion estimation time(53.7%), a predictor, which optimizes the search area dur<strong>in</strong>g the motion estimation and areference frame selector, which reduces the motion estimation time (80%) by reduc<strong>in</strong>g thenumber <strong>of</strong> the reference frames dur<strong>in</strong>g the motion estimation. The applied methods, onthe other hand, exploit the special <strong>H.264</strong> characteristics <strong>in</strong> order to improve theirperformance. Two data hid<strong>in</strong>g methods are <strong>in</strong>troduced, which result <strong>in</strong> high capacity <strong>of</strong>hidden data, e.g. 18 Kbits <strong>of</strong> data <strong>in</strong> 10 sec (30 fps) <strong>of</strong> video. In particular, the data hid<strong>in</strong>gmethods opened new directions <strong>in</strong> the research <strong>of</strong> the data hid<strong>in</strong>g <strong>in</strong> video not onlybecause <strong>of</strong> their unique capabilities (high data capacity, real time operation, reusability <strong>of</strong>the marked streams, etc.) but also because they moved the cost <strong>of</strong> the hidden data fromthe PSNR to the bit rate <strong>in</strong> contrast to all <strong>of</strong> the previously exist<strong>in</strong>g methods. In addition


to the data hid<strong>in</strong>g methods, a bit rate transcoder, which controls the bit rate directly <strong>in</strong> thecompressed doma<strong>in</strong>, is also <strong>in</strong>troduced. F<strong>in</strong>ally, a mov<strong>in</strong>g object detection method and ascene change detection method complete the repertoire <strong>of</strong> the applied methods.


ΠερίληψηTo πρότυπο κωδικοποίησης video <strong>H.264</strong> κυριάρχησε στη αγορά µέσα σε λίγα χρόνιααφότου η πρώτη έκδοσή του ολοκληρώθηκε από τις οµάδες εργασίας MPEG και VCEGτων οργανισµών ISO και ITU αντίστοιχα, τον Μάιο του 2003. Αυτό οφείλεται κυρίωςστην αποτελεσµατικότητα του Η.264 όσον αφορά στην κωδικοποίηση του video.Χαρακτηριστικά, σε σύγκριση µε το MPEG-2, το προηγούµενο κυρίαρχο πρότυπο, ολόγος συµπίεσης που επιτυγχάνει το Η.264 είναι διπλάσιος για τη ίδια ποιότητα video.Αυτό καθιστά ιδανικό το <strong>H.264</strong> για πολλές εφαρµογές, όπως τηλεοπτικές µεταδόσεις,video stream<strong>in</strong>g και τηλεδιασκέψεων. Ωστόσο, η αποτελεσµατικότητα του <strong>H.264</strong>επιτυγχάνεται εις βάρος της πολυπλοκότητας του κωδικοποιητή. Η πολυπλοκότητα τουκωδικοποιητή Η.264 είναι περίπου τέσσερις φορές όσο αυτή του MPEG-2. Κατάσυνέπεια, πολλά προβλήµατα κατά την κωδικοποίηση, τα οποία έχουν αντιµετωπιστείστα προηγούµενα πρότυπα, πρέπει να επαναθεωρηθούν. Για παράδειγµα η κωδικοποίησηενός video σε πραγµατικό χρόνο είναι τώρα ένα ανοικτό θέµα. Παλαιότερες λύσεις είναιεφικτές, αλλά ανεπαρκείς, διότι τα νέα χαρακτηριστικά <strong>H.264</strong> δεν λαµβάνονται υπόψηκαι έτσι τα προβλήµατα που προκαλούνται από τα χαρακτηριστικά αυτά δεναντιµετωπίζονται αποτελεσµατικά. Από την άλλη πλευρά, τα χαρακτηριστικά αυτάκαθιστούν δυνατές εφαρµογές, οι οποίες είτε δεν ήταν εφικτές είτε παρουσίαζαν φτωχάαποτελέσµατα πριν την έλευση του <strong>H.264</strong>.Αυτή η διατριβή αποσκοπεί στη διερεύνηση νέων µεθόδων, οι οποίες επωφελoύνται απότα νέα χαρακτηριστικά του <strong>H.264</strong>. Οι µέθοδοι αυτές χωρίζονται σε δύο κατηγορίες, σεαναβαθµίσεις (enhancements) και σε εφαρµοσµένες µεθόδους (applied methods). Οστόχος των αναβαθµίσεων είναι να βελτιώσουν τις επιδόσεις του κωδικοποιητή <strong>H.264</strong>µείωνοντας την πολυπλοκότητά του. Εστιάσαµε την προσοχή µας στο κοµµάτι τουκωδικοποιητή που αφορά στην χρονική πρόβλεψη (<strong>in</strong>ter prediction). Αναπτύχθηκαν τρειςαντιπροσωπευτικές µέθοδοι αυτής της κατηγορίας. Μία µέθοδος γρήγορης πλήρουςαναζήτησης (fast full search algorithm), η οποία µειώνει το χρόνο εκτίµησης κίνησης(motion estimation) κατά 53,7%, µια µέθοδος, η οποία βελτιστοποιεί την περιοχήαναζήτησης κατά την εκτίµηση της κίνησης και έναν επιλογέα εικόνας αναφοράς, πουµειώνει το χρόνο εκτίµησης της κίνησης (80%), µειώνοντας τον αριθµό των εικόνωναναφοράς κατά την εκτίµηση της κίνησης. Οι εφαρµοσµένες µέθοδοι, από την άλληπλευρά, εκµεταλλεύονται τα ειδικά χαρακτηριστικά <strong>H.264</strong> προκειµένου να βελτιώσουν


τις επιδόσεις τους. Αναπτύχθηκαν δύο µέθοδοι απόκρυψης δεδοµένων (data hid<strong>in</strong>g), πουοδηγούν σε υψηλή χωρητικότητα των κρυφών δεδοµένων (data capacity), π.χ. 18 Kbitsδεδοµένων σε 10 δευτερόλεπτα (30 fps) του video. Ειδικότερα, οι µέθοδοι απόκρυψηςδεδοµένων ανοίγουν νέες κατευθύνσεις στο πεδίο έρευνας της απόκρυψης δεδοµένων σεvideo, όχι µόνο λόγω των µοναδικών δυνατοτήτων τους (δεδοµένα υψηλήςχωρητικότητας, δυνατότητα επαναχρησιµοποίησης των δυαδικών ακολουθιών(bitstream), πραγµατικό χρόνο λειτουργίας κ.λπ.), αλλά επίσης επειδή µετέφεραν τοκόστος απόκρυψης δεδοµένων από το PSNR στο ρυθµό µετάδοσης δεδοµένων (bitrate),σε αντίθεση µε τις ήδη υπάρχουσες µεθόδους. Επίσης αναπτύχθηκε µία τεχνικήµετατροπής (transcod<strong>in</strong>g), η οποία ελέγχει το ρυθµό µετάδοσης δεδοµένων (bitratetranscoder) απευθείας στον συµπιεσµένο χώρο. Τέλος, µία µέθοδος ανίχνευσηςκινούµενου αντικειµένου (mov<strong>in</strong>g object detection) και µία µέθοδος ανίχνευσης αλλαγήςσκηνής (scene change detection) ολοκληρώνουν το ρεπερτόριο των εφαρµοσµένωνµεθόδων.


Submitted <strong>in</strong> total fulfillment <strong>of</strong> the requirements <strong>of</strong> the degree <strong>of</strong>Doctor <strong>of</strong> <strong>Ph</strong>ilosophyJune 25, 2011


Exam<strong>in</strong>ation CommitteeAthanassios Skodras * , Pr<strong>of</strong>essor <strong>of</strong> Hellenic Open University, Greece.Athanassios Stouraitis * , Pr<strong>of</strong>essor <strong>of</strong> University <strong>of</strong> Patras, Greece.Stefanos Kollias * , Pr<strong>of</strong>essor <strong>of</strong> National Technical University <strong>of</strong> Athens, Greece.Konstant<strong>in</strong>os Berberidis, Pr<strong>of</strong>essor <strong>of</strong> University <strong>of</strong> Patras, Greece.Vassilios Verykios, Associate Pr<strong>of</strong>essor <strong>of</strong> Hellenic Open University, Greece.George Economou, Associate Pr<strong>of</strong>essor <strong>of</strong> University <strong>of</strong> Patras, Greece.Emmanouil Psarakis, Assistant Pr<strong>of</strong>essor <strong>of</strong> University <strong>of</strong> Patras, Greece.* Member <strong>of</strong> the Advisory Committee


Στη Ντόρα


DeclarationThis is to certify that: the dissertation comprises only my orig<strong>in</strong>al work towards the <strong>Ph</strong>D exceptwhere <strong>in</strong>dicated, due acknowledgement has been made <strong>in</strong> the text to all other material used


AcknowledgementsI owe my deepest gratitude to my supervisor, Pr<strong>of</strong>essor Skodras, a truly <strong>in</strong>spired teacher,whose encouragement and support enabled me to complete my research.I am also grateful to my family for their support and for be<strong>in</strong>g patient with me over thelast five years.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong>iCONTENTSGLOSSARYPUBLICATIONSVIX1 INTRODUCTION 11.1 Motivation and goals 11.2 Structure <strong>of</strong> the dissertation 22 OVERVIEW OF <strong>H.264</strong> 32.1 Introduction 32.2 Term<strong>in</strong>ology 52.3 Pr<strong>of</strong>iles and levels 62.4 Coded Data Format 72.5 Reference Pictures 72.6 Slices 82.7 Macroblocks 92.8 Technical overview 102.8.1 Encoder (forward path) 102.8.2 Encoder (reconstruction path) 112.8.3 Decoder 113 INTER PREDICTION 133.1 Introduction 133.2 Problem formulation 153.2.1 Inter prediction complexity 153.2.2 Special video applications 163.3 Solutions 163.4 Fast Successive Elim<strong>in</strong>ation Algorithm 173.4.1 Literature review 173.4.2 Full search <strong>in</strong> <strong>H.264</strong> reference encoder 183.4.3 Fast SEA 183.4.4 Two-level motion estimation 223.4.5 Simulation Results 243.4.6 Conclusions 25


ii<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong>3.5 Spatio-Temporal Predictor for Motion Estimation 263.5.1 Literature review 263.5.2 Effectiveness <strong>of</strong> the EPZS predictors 273.5.3 Spatio-temporal predictor 293.5.4 Simulation results 323.5.5 Conclusions 323.6 Fast Multiple Reference Frame Selection 343.6.1 Literature review 343.6.2 Multiple Reference Frame <strong>in</strong> <strong>H.264</strong> 343.6.3 Frame selection method 353.6.4 Simulation Results 363.6.5 Conclusions 373.7 Mov<strong>in</strong>g Object Detection <strong>in</strong> the Compressed Doma<strong>in</strong> 393.7.1 Literature review 393.7.2 Mov<strong>in</strong>g object detection <strong>in</strong> the compressed doma<strong>in</strong> 403.7.3 Simulation results 423.7.4 Further improvements 423.7.5 Conclusions 434 DATA HIDING 454.1 Introduction 454.2 Problem Formulation 464.3 Solutions 464.4 Data Hid<strong>in</strong>g dur<strong>in</strong>g the <strong>in</strong>ter-prediction 474.4.1 Literature review 474.4.2 Data hid<strong>in</strong>g method 474.4.3 Simulation results 514.4.4 Message Extractor 524.4.5 Further improvements 524.4.6 Conclusions 524.4.7 Application based on this method: A Data Hid<strong>in</strong>g Scheme for Scene Change Detection 544.5 Real Time Data Hid<strong>in</strong>g by Exploit<strong>in</strong>g the I_PCM Macroblocks 614.5.1 Literature review 614.5.2 Intra mode prediction <strong>in</strong> <strong>H.264</strong> 614.5.3 Real time Data Hid<strong>in</strong>g 624.5.4 Simulation results 674.5.5 Message extractor 694.5.6 Further improvements 704.5.7 Conclusions 705 BITRATE TRANSCODING 755.1 Introduction 755.2 Problem Formulation 755.3 Solution 765.4 Bit Rate Transcod<strong>in</strong>g by Dropp<strong>in</strong>g Frames <strong>in</strong> the Compressed Doma<strong>in</strong> 78


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong>iii5.4.1 Literature review 785.4.2 Ma<strong>in</strong> concepts 805.4.3 Bit Rate Transcoder 855.4.4 Simulation Results 935.4.5 Further improvements 955.4.6 Conclusions 956 EPILOGUE 996.1 Contribution 1026.1.1 Inter prediction 1026.1.2 Data hid<strong>in</strong>g 1026.1.3 Bitrate transcod<strong>in</strong>g 1036.2 Further improvements 103REFERENCES 105APPENDICES 113


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong>vGlossary4:2:0 (sampl<strong>in</strong>g) Sampl<strong>in</strong>g method: chrom<strong>in</strong>ance components have half thehorizontal and vertical resolution <strong>of</strong> lum<strong>in</strong>ance componentArithmetic cod<strong>in</strong>g Cod<strong>in</strong>g method to reduce statistical redundancyArtifactVisual distortion <strong>in</strong> an imageBlockRegion <strong>of</strong> macroblock (8 × 8 or 4 × 4) for transform purposesBlock match<strong>in</strong>g Motion estimation carried out on rectangular picture areasB-picture (slice) Coded picture (slice) predicted us<strong>in</strong>g bidirectional motioncompensationCABACContext-based Adaptive B<strong>in</strong>ary Arithmetic Cod<strong>in</strong>gCAVLCContext Adaptive Variable Length Cod<strong>in</strong>gCCTVClosed-circuit televisionChrom<strong>in</strong>ance Color difference componentCIFCommon Intermediate Format, a color image formatCODECCOder / DECoder pairColor space Method <strong>of</strong> represent<strong>in</strong>g color imagesDCTDiscrete Cos<strong>in</strong>e TransformDFTDiscrete Fourier TransformDWTDiscrete Wavelet TransformEntropy cod<strong>in</strong>g Cod<strong>in</strong>g method to reduce redundancyError concealment Post-process<strong>in</strong>g <strong>of</strong> a decoded image to remove or reduce visibleerror effectsFieldOdd- or even-numbered l<strong>in</strong>es from an <strong>in</strong>terlaced video sequenceFMOFlexible Macroblock Order, <strong>in</strong> which macroblocks may be codedout <strong>of</strong> raster sequenceFPSFrame Rate (Frame Per Second)Full Search A motion estimation algorithmGOPGroup Of Pictures, a set <strong>of</strong> coded video imagesH.261 A video cod<strong>in</strong>g standardH.263 A video cod<strong>in</strong>g standard<strong>H.264</strong> A video cod<strong>in</strong>g standardHDTVHigh Def<strong>in</strong>ition TelevisionHuffman cod<strong>in</strong>g Cod<strong>in</strong>g method to reduce redundancyHVSHuman Visual System, the system by which humans perceive and<strong>in</strong>terpret visual imagesHybrid (CODEC) CODEC model featur<strong>in</strong>g motion compensation and transformIECInternational Electrotechnical Commission, a standards bodyIDR<strong>in</strong>stantaneous decod<strong>in</strong>g refresh, a picture, which causes thedecod<strong>in</strong>g process to mark all reference pictures as "unused forreference" immediately after the decod<strong>in</strong>g <strong>of</strong> the IDR pictureInter (cod<strong>in</strong>g) Cod<strong>in</strong>g <strong>of</strong> video frames us<strong>in</strong>g temporal prediction or compensationInterlaced (video) Video data represented as a series <strong>of</strong> fieldsIntra (cod<strong>in</strong>g) Cod<strong>in</strong>g <strong>of</strong> video frames without temporal predictionIPCM or I_PCM A macroblock, which is neither predicted nor quantized. It issubject only to entropy encod<strong>in</strong>g.I-picture (slice) Picture (or slice) coded without reference to any other frameISOInternational Standards Organization, a standards bodyITUInternational Telecommunication Union, a standards body


vi<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong>JPEGJPEG2000JVTLoop filterMacroblock (MB)MacroblockpartitionMacroblock subpartitionMedia processorMOSMotioncompensationMotion estimationMotion vector(MV)MPEGMPEG-1MPEG-2MPEG-4MSEANALObjective qualityPicture (coded)POCP-picture (slice)Pr<strong>of</strong>ileProgressive (video)PSNRQCIFQuantizeQPRate controlRate–distortionRBSPRGBRTPRVLCSEASI sliceSIFJo<strong>in</strong>t <strong>Ph</strong>otographic Experts Group, a committee <strong>of</strong> ISO (also animage cod<strong>in</strong>g standard)An image cod<strong>in</strong>g standardJo<strong>in</strong>t Video Team consist<strong>in</strong>g <strong>of</strong> experts from VCEG and MPEGSpatial filter placed with<strong>in</strong> encod<strong>in</strong>g or decod<strong>in</strong>g feedback loopRegion <strong>of</strong> frame coded as a unit (usually 16×16 pixels <strong>in</strong> theorig<strong>in</strong>al frame)Region <strong>of</strong> macroblock with its own motion vector (<strong>H.264</strong>)Region <strong>of</strong> macroblock with its own motion vector (<strong>H.264</strong>)Processor with features specific to multimedia cod<strong>in</strong>g andprocess<strong>in</strong>gMean Op<strong>in</strong>ion Score. A subjective quality metricPrediction <strong>of</strong> a video frame with model<strong>in</strong>g <strong>of</strong> motionEstimation <strong>of</strong> relative motion between two or more video framesVector <strong>in</strong>dicat<strong>in</strong>g a displaced block or region to be used for motioncompensationMotion Picture Experts Group, a committee <strong>of</strong> ISO/IECA multimedia cod<strong>in</strong>g standardA multimedia cod<strong>in</strong>g standardA multimedia cod<strong>in</strong>g standardMultilevel Successive Elim<strong>in</strong>ation Algorithm, a full searchalgorithmNetwork Abstraction LayerVisual quality measured by algorithm(s)Coded (compressed) video framePicture Order Count, a number to keep the order<strong>in</strong>g <strong>of</strong> the picturesand the values <strong>of</strong> samples <strong>in</strong> the decoded pictures isolated fromtim<strong>in</strong>g <strong>in</strong>formationCoded picture (or slice) us<strong>in</strong>g motion-compensated prediction fromone reference frameA set <strong>of</strong> functional capabilities (<strong>of</strong> a video CODEC)Video data represented as a series <strong>of</strong> complete framesPeak Signal to Noise Ratio, an objective quality measureQuarter Common Intermediate Format, a color image formatReduce the precision <strong>of</strong> a scalar or vector quantityQuantization ParameterControl <strong>of</strong> bit rate <strong>of</strong> encoded video signalMeasure <strong>of</strong> CODEC performance (distortion at a range <strong>of</strong> coded bitrates)Raw Byte Sequence PayloadRed/Green/Blue color spaceReal Time Protocol, a transport protocol for real-time dataReversible Variable Length CodeSuccessive Elim<strong>in</strong>ation Algorithm, a full search algorithmIntra-coded slice used for switch<strong>in</strong>g between coded bitstreams(<strong>H.264</strong>)Source Input Format, a color image format.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong>viiSliceSP sliceVCEGA region <strong>of</strong> a coded pictureInter-coded slice used for switch<strong>in</strong>g between coded bitstreams(<strong>H.264</strong>)Video Cod<strong>in</strong>g Experts Group <strong>of</strong> ITU-T


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong>ixPublicationsIn the context <strong>of</strong> our research we published eight papers <strong>in</strong> journals and <strong>in</strong>ternationalconferences:[1] S. Kapotas and A.N. Skodras, "Bit Rate Transcod<strong>in</strong>g <strong>of</strong> <strong>H.264</strong> Encoded Movies byDropp<strong>in</strong>g Frames <strong>in</strong> the Compressed Doma<strong>in</strong>", IEEE Transactions on ConsumerElectronics, vol. 56, no. 3, pp. 1593-1601, 2010.[2] S. Kapotas and A.N. Skodras, "Rate Control <strong>of</strong> <strong>H.264</strong> Encoded Sequences byDropp<strong>in</strong>g Frames <strong>in</strong> the Compressed Doma<strong>in</strong>", 20th Int. Conference on PatternRecognition (ICPR 2010), Istanbul, Turkey, 23-26 Aug. 2010.[3] S. Kapotas and A.N. Skodras, "Mov<strong>in</strong>g Object Detection <strong>in</strong> the <strong>H.264</strong> CompressedDoma<strong>in</strong>", IEEE International Conference on Imag<strong>in</strong>g Systems and Techniques (IST2010), Thessaloniki, Greece, 1-2 July 2010.[4] S. Kapotas and A.N. Skodras, "Real Time Data Hid<strong>in</strong>g by Exploit<strong>in</strong>g the IPCMMacroblocks <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> Streams", Journal <strong>of</strong> Real-Time Image Process<strong>in</strong>g, vol.4, no. 1, pp. 33-41, Mar. 2009.[5] S. Kapotas and A.N. Skodras, “A New Data Hid<strong>in</strong>g Scheme for Scene ChangeDetection <strong>in</strong> <strong>H.264</strong> Encoded Video Sequences", IEEE International Conference onMultimedia & Expo (ICME) 2008, Hannover, Germany, June 23-26, 2008.[6] S. Kapotas and A.N. Skodras, "Fast Multiple Reference Frame Selection Method <strong>in</strong><strong>H.264</strong> Video Encod<strong>in</strong>g", 26th Picture Cod<strong>in</strong>g Symposium (PCS 2007), Lisbon,Portugal, 7-2 Nov. 2007.[7] S. Kapotas, E.E. Varsaki and A.N. Skodras, "Data Hid<strong>in</strong>g <strong>in</strong> <strong>H.264</strong> Encoded VideoSequences", 2007 IEEE Int. Workshop on Multimedia Signal Process<strong>in</strong>g, Chania,Greece, 1-3 Oct. 2007.[8] S. Kapotas and A.N. Skodras, "A New Spatio-Temporal Predictor for MotionEstimation <strong>in</strong> <strong>H.264</strong> Video Cod<strong>in</strong>g", 8th Int. Workshop on Image Analysis forMultimedia Interactive Services (WIAMIS 2007), Santor<strong>in</strong>i, Greece, 6-8 June 2007.


1 Introduction1.1 MOTIVATION AND GOALS<strong>H.264</strong>/<strong>AVC</strong>, the latest standard for video cod<strong>in</strong>g, is the result <strong>of</strong> the collaborationbetween the ISO/IEC Mov<strong>in</strong>g Picture Experts Group and the ITU-T Video Cod<strong>in</strong>gExperts Group. The goals <strong>of</strong> this standardization effort were enhanced compressionefficiency, network friendly video representation for <strong>in</strong>teractive (video telephony) andnon-<strong>in</strong>teractive applications (broadcast, stream<strong>in</strong>g, storage, video on demand).<strong>H.264</strong>/<strong>AVC</strong> provides ga<strong>in</strong>s <strong>in</strong> compression efficiency <strong>of</strong> up to 50% over a wide range <strong>of</strong>bit rates and video resolutions compared to previous standards. However, the <strong>H.264</strong>/<strong>AVC</strong>complexity is about four times that <strong>of</strong> MPEG-2. As a consequence, video cod<strong>in</strong>g issues,which were considered to hav<strong>in</strong>g been resolved by previous standards (H.263 and MPEG-2) such as the encod<strong>in</strong>g speed, need to be re-considered <strong>in</strong> case <strong>of</strong> <strong>H.264</strong>/<strong>AVC</strong>. Reapply<strong>in</strong>golder solutions is feasible but <strong>in</strong>sufficient because the special <strong>H.264</strong>/<strong>AVC</strong>characteristics are not taken <strong>in</strong>to account and thus the problems caused by thesecharacteristics are not properly addressed. On the other hand, the new <strong>H.264</strong>/<strong>AVC</strong>characteristics, also referred to as cod<strong>in</strong>g tools, make possible a series <strong>of</strong> applications thateither were not possible or showed <strong>in</strong>ferior results prior the <strong>H.264</strong>/<strong>AVC</strong> era.This dissertation aims at <strong>in</strong>vestigat<strong>in</strong>g methods, which take advantage <strong>of</strong> the specialcharacteristics <strong>in</strong>troduced by <strong>H.264</strong>/<strong>AVC</strong>. We classify the methods <strong>in</strong> two categories,namely enhancements and applied methods. The goal <strong>of</strong> the enhancements is to improvethe performance <strong>of</strong> the <strong>H.264</strong>/<strong>AVC</strong> encoder by reduc<strong>in</strong>g its complexity. We focused onthe <strong>in</strong>ter prediction part <strong>of</strong> the encoder with respect to its complexity i.e. the proposedmethods reduce the time that the <strong>in</strong>ter prediction takes. The applied methods propose


2 Introductiontechniques, such as data hid<strong>in</strong>g and bit rate transcod<strong>in</strong>g techniques, which exploit thespecial <strong>H.264</strong>/<strong>AVC</strong> characteristics <strong>in</strong> order to improve their performance.F<strong>in</strong>ally, this work shows how some <strong>of</strong> the <strong>H.264</strong>/<strong>AVC</strong> new characteristics can beexploited on behalf <strong>of</strong> the consumer.1.2 STRUCTURE OF THE DISSERTATIONThis dissertation is ma<strong>in</strong>ly separated <strong>in</strong> three research parts, namely Inter Prediction,Data Hid<strong>in</strong>g and Bitrate Transcod<strong>in</strong>g, which are described <strong>in</strong> three chapters. Each chapteris treated <strong>in</strong>dependently, i.e. it has its own literature review and research sections. As amatter <strong>of</strong> fact, each chapter is a set <strong>of</strong> methods, which justify our motivations. The <strong>in</strong>terprediction methods fall <strong>in</strong>to the enhancements category whilst the data hid<strong>in</strong>g and the bitrate transcod<strong>in</strong>g methods fall <strong>in</strong>to the applied methods category. More specifically thedissertation has the follow<strong>in</strong>g structure:In Chapter 2 we give a brief overview <strong>of</strong> the <strong>H.264</strong>/<strong>AVC</strong> standard. In Chapters 3, 4 and 5we present the Inter Prediction, the Data Hid<strong>in</strong>g and the Bit Rate Transcod<strong>in</strong>g researchparts respectively. The chapters beg<strong>in</strong> by describ<strong>in</strong>g the problem that we are go<strong>in</strong>g to dealwith (Problem Formulation). Then we enumerate the proposed solution(s)-method(s),which can be <strong>in</strong>tegrated <strong>in</strong>to the <strong>H.264</strong>/<strong>AVC</strong> codec. The methods are described <strong>in</strong> thesections/sub-sections that follows. Each section beg<strong>in</strong>s with review<strong>in</strong>g the literature andthen cont<strong>in</strong>ues with the description <strong>of</strong> the proposed method. Whenever is needed, a subsectiondescribes a part <strong>of</strong> <strong>in</strong>terest <strong>of</strong> the <strong>H.264</strong>/<strong>AVC</strong> reference encoder, e.g. full searchmethod, reference frame selection, etc. Chapter 6 is the epilogue. There, we give a briefanalysis <strong>of</strong> the methods, their achievements as long as the potential improvements.F<strong>in</strong>ally, we evaluate the methods with respect to their contribution to the <strong>H.264</strong>/<strong>AVC</strong>field.In Appendix I we present the metrics that we use <strong>in</strong> our simulation results whilst <strong>in</strong>Appendix II we describe the simulation environment and the methodology that wefollowed dur<strong>in</strong>g our tests.


2 Overview <strong>of</strong> <strong>H.264</strong> 12.1 INTRODUCTIONInternational study groups, VCEG (Video Cod<strong>in</strong>g Experts Group) <strong>of</strong> ITU-T (InternationalTelecommunication Union—Telecommunication sector) and MPEG (Mov<strong>in</strong>g PictureExperts Group) <strong>of</strong> ISO/IEC, have researched the video cod<strong>in</strong>g techniques for variousapplications <strong>of</strong> mov<strong>in</strong>g pictures s<strong>in</strong>ce the early 1990s. S<strong>in</strong>ce then, ITU-T developedH.261 as the first video cod<strong>in</strong>g standard for videoconferenc<strong>in</strong>g application. MPEG-1video cod<strong>in</strong>g standard was established for storage <strong>in</strong> compact disk and MPEG-2 (ITU-Tadopted it as H.262) standard for digital TV and HDTV as extension <strong>of</strong> MPEG-1. Also,for cover<strong>in</strong>g the very wide range <strong>of</strong> applications such as shaped regions <strong>of</strong> video objectsas well as rectangular pictures, MPEG-4 part 2 standard was developed. This <strong>in</strong>cludesalso natural and synthetic video/audio comb<strong>in</strong>ations with <strong>in</strong>teractivity built <strong>in</strong>. On theother hand, ITU-T developed H.263 <strong>in</strong> order to improve the compression performance <strong>of</strong>H.261 and the base cod<strong>in</strong>g model <strong>of</strong> H.263 was adopted as the core <strong>of</strong> some parts <strong>in</strong>MPEG-4 part 2. MPEG-1, 2 and 4 also cover audio cod<strong>in</strong>g. To provide bettercompression <strong>of</strong> video compared to previous standards, <strong>H.264</strong>/MPEG-4 part 10, alsoknown as <strong>H.264</strong>/<strong>AVC</strong>, video cod<strong>in</strong>g standard was developed by the JVT (Jo<strong>in</strong>t VideoTeam), consist<strong>in</strong>g <strong>of</strong> experts from VCEG and MPEG, <strong>in</strong> 2003.Table 2-1 compares the compression rate <strong>of</strong> <strong>H.264</strong>/<strong>AVC</strong> vs older compression standards.1 Some text, figures and tables <strong>of</strong> Chapter 2 are copied from Chapter 6 <strong>of</strong> Ia<strong>in</strong>Richardson’s book “<strong>H.264</strong> and MPEG-4 Video Compression” [15]. Courtesy <strong>of</strong> Pr<strong>of</strong>.Richardson.


4 Overview <strong>of</strong> <strong>H.264</strong>Table 2-1: Compression ratios to ma<strong>in</strong>ta<strong>in</strong> excellent quality.StandardJPEG 10:1MPEG2 – H.263 30:1MPEG4/<strong>H.264</strong> <strong>AVC</strong> 50:1Compression ratio<strong>H.264</strong>/<strong>AVC</strong>, hereafter <strong>H.264</strong>, fulfills significant cod<strong>in</strong>g efficiency, simple syntaxspecifications and seamless <strong>in</strong>tegration <strong>of</strong> video cod<strong>in</strong>g <strong>in</strong>to all current protocols andmultiplex architectures. Thus, <strong>H.264</strong> can support various applications like videobroadcast<strong>in</strong>g, video stream<strong>in</strong>g and video conferenc<strong>in</strong>g over fixed and wireless networksand over different transport protocols. <strong>H.264</strong> has the same basic functional elements asprevious standards (MPEG-1, MPEG-2, MPEG-4 part 2, H.261 and H.263), i.e. transformfor reduction <strong>of</strong> spatial correlation, quantization for bitrate control, motion compensatedprediction for reduction <strong>of</strong> temporal correlation and entropy encod<strong>in</strong>g for reduction <strong>of</strong>statistical correlation. However, to fulfill better cod<strong>in</strong>g performance, the importantchanges <strong>in</strong> <strong>H.264</strong> occur <strong>in</strong> the details <strong>of</strong> each functional element by <strong>in</strong>clud<strong>in</strong>g <strong>in</strong>trapictureprediction, a new 4 × 4 <strong>in</strong>teger transform, multiple reference pictures, variableblock sizes and a quarter pel precision for motion compensation, a de-block<strong>in</strong>g filter andimproved entropy cod<strong>in</strong>g. Improved cod<strong>in</strong>g efficiency comes at the expense <strong>of</strong> addedcomplexity to the coder/decoder. Therefore, <strong>H.264</strong> utilizes some cod<strong>in</strong>g tools (methods)to reduce the implementation complexity e.g. multiplier-free <strong>in</strong>teger transform is<strong>in</strong>troduced. Multiplication operation for the exact transform is comb<strong>in</strong>ed with themultiplication <strong>of</strong> quantization. The noisy channel conditions, like the wireless networks,obstruct the perfect reception <strong>of</strong> coded video bitstream <strong>in</strong> the decoder. Incorrect decod<strong>in</strong>gby the lost data degrades the subjective picture quality and propagates to the subsequentblocks or pictures. So, <strong>H.264</strong> utilizes some tools to exploit error resilience to networknoise. The parameter sett<strong>in</strong>g, flexible macroblock order<strong>in</strong>g, switched slice, redundantslice methods are added to the data partition<strong>in</strong>g, used <strong>in</strong> previous standards. For theparticular applications, <strong>H.264</strong> def<strong>in</strong>es the Pr<strong>of</strong>iles and Levels specify<strong>in</strong>g restrictions onbitstreams like some <strong>of</strong> the previous video standards. Three Pr<strong>of</strong>iles are def<strong>in</strong>ed to coverthe various applications from the wireless networks to digital c<strong>in</strong>ema. These are described<strong>in</strong> Section 2.3 <strong>in</strong> detail.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 52.2 TERMINOLOGYSome <strong>of</strong> the important term<strong>in</strong>ology adopted <strong>in</strong> the <strong>H.264</strong> standard is as follows: A field(<strong>of</strong> <strong>in</strong>terlaced video) or a frame (<strong>of</strong> progressive or <strong>in</strong>terlaced video) is encoded to producea coded picture. A coded frame has a frame number (signaled <strong>in</strong> the bitstream), which isnot necessarily related to decod<strong>in</strong>g order and each coded field <strong>of</strong> a progressive or<strong>in</strong>terlaced frame has an associated picture order count, which def<strong>in</strong>es the decod<strong>in</strong>g order<strong>of</strong> fields. Previously coded pictures (reference pictures) may be used for <strong>in</strong>ter prediction<strong>of</strong> further coded pictures. Reference pictures are organized <strong>in</strong>to one or two lists (sets <strong>of</strong>numbers correspond<strong>in</strong>g to reference pictures), described as list 0 and list 1. A codedpicture consists <strong>of</strong> a number <strong>of</strong> macroblocks, each conta<strong>in</strong><strong>in</strong>g 16× 16 luma samples andassociated chroma samples ( 8× 8 C b and 8 × 8 C r samples <strong>in</strong> the current standard).With<strong>in</strong> each picture, macroblocks are arranged <strong>in</strong> slices, where a slice is a set <strong>of</strong>macroblocks <strong>in</strong> raster scan order. An I slice may conta<strong>in</strong> only I macroblock types (seebelow), a P slice may conta<strong>in</strong> P and I macroblock types and a B slice may conta<strong>in</strong> B and Imacroblock types. (There are two further slice types, SI and SP, which are not <strong>in</strong> thescope <strong>of</strong> this dissertation).I macroblocks are predicted us<strong>in</strong>g <strong>in</strong>tra prediction from decoded samples <strong>in</strong> the currentslice. A prediction is formed either (a) for the complete macroblock or (b) for each 4 × 4block <strong>of</strong> luma samples (and associated chroma samples) <strong>in</strong> the macroblock. (Analternative to <strong>in</strong>tra prediction, I_PCM mode, is described <strong>in</strong> Section 4.5.2).P macroblocks are predicted us<strong>in</strong>g <strong>in</strong>ter prediction from reference picture(s). An <strong>in</strong>tercoded macroblock may be divided <strong>in</strong>to macroblock partitions, i.e. blocks <strong>of</strong> size 16 × 16 ,16× 8 , 8× 16 or 8× 8 luma samples (and associated chroma samples). If the 8× 8partition size is chosen, each 8× 8 sub-macroblock may be further divided <strong>in</strong>to submacroblockpartitions <strong>of</strong> size 8 × 8, 8 × 4 , 4 × 8 or 4× 4 luma samples (and associatedchroma samples). Each macroblock partition may be predicted from one picture <strong>in</strong> list 0.If present, every sub-macroblock partition <strong>in</strong> a sub-macroblock is predicted from thesame picture <strong>in</strong> list 0.B macroblocks are predicted us<strong>in</strong>g <strong>in</strong>ter prediction from reference picture(s). Eachmacroblock partition may be predicted from one or two reference pictures, one picture <strong>in</strong>


6 Overview <strong>of</strong> <strong>H.264</strong>list 0 and/or one picture <strong>in</strong> list 1. If present, every sub-macroblock partition <strong>in</strong> a submacroblockis predicted from (the same) one or two reference pictures, one picture <strong>in</strong> list0 and/or one picture <strong>in</strong> list 1.Basel<strong>in</strong>e pr<strong>of</strong>ileFigure 2-1: <strong>H.264</strong> Basel<strong>in</strong>e, Ma<strong>in</strong> and Extended pr<strong>of</strong>iles.2.3 PROFILES AND LEVELS<strong>H.264</strong> [1] def<strong>in</strong>es a set <strong>of</strong> three Pr<strong>of</strong>iles 2 , each support<strong>in</strong>g a particular set <strong>of</strong> cod<strong>in</strong>gfunctions and each specify<strong>in</strong>g what is required <strong>of</strong> an encoder or decoder that complieswith the Pr<strong>of</strong>ile. The Basel<strong>in</strong>e Pr<strong>of</strong>ile supports <strong>in</strong>tra and <strong>in</strong>ter-cod<strong>in</strong>g (us<strong>in</strong>g I-slices andP-slices) and entropy cod<strong>in</strong>g with context-adaptive variable-length codes (CAVLC). TheMa<strong>in</strong> Pr<strong>of</strong>ile <strong>in</strong>cludes support for <strong>in</strong>terlaced video, <strong>in</strong>ter-cod<strong>in</strong>g us<strong>in</strong>g B-slices, <strong>in</strong>tercod<strong>in</strong>g us<strong>in</strong>g weighted prediction and entropy cod<strong>in</strong>g us<strong>in</strong>g context-based adaptivearithmetic b<strong>in</strong>ary cod<strong>in</strong>g (CABAC). The Extended Pr<strong>of</strong>ile does not support <strong>in</strong>terlacedvideo or CABAC but adds modes to enable efficient switch<strong>in</strong>g between coded bitstreams(SP and SI slices) and improved error resilience (Data Partition<strong>in</strong>g). Potential applications[1]2 The latest draft, ITU-T Rec. (03/2010) def<strong>in</strong>es a set <strong>of</strong> 17 pr<strong>of</strong>iles.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 7<strong>of</strong> the Basel<strong>in</strong>e Pr<strong>of</strong>ile <strong>in</strong>clude videotelephony, videoconferenc<strong>in</strong>g and wirelesscommunications; potential applications <strong>of</strong> the Ma<strong>in</strong> Pr<strong>of</strong>ile <strong>in</strong>clude televisionbroadcast<strong>in</strong>g and video storage; and the Extended Pr<strong>of</strong>ile may be particularly useful forstream<strong>in</strong>g media applications. However, each Pr<strong>of</strong>ile has sufficient flexibility to support awide range <strong>of</strong> applications and so these examples <strong>of</strong> applications should not beconsidered def<strong>in</strong>itive. Figure 2-1 shows the relationship between the three Pr<strong>of</strong>iles andthe cod<strong>in</strong>g tools supported by the standard. It is clear from this figure that the Basel<strong>in</strong>ePr<strong>of</strong>ile is a subset <strong>of</strong> the Extended Pr<strong>of</strong>ile, but not <strong>of</strong> the Ma<strong>in</strong> Pr<strong>of</strong>ile.2.4 CODED DATA FORMAT<strong>H.264</strong> [1] makes a dist<strong>in</strong>ction between a Video Cod<strong>in</strong>g Layer (VCL) and a NetworkAbstraction Layer (NAL). The output <strong>of</strong> the encod<strong>in</strong>g process is VCL data, a sequence <strong>of</strong>bits represent<strong>in</strong>g the coded video data, which are mapped to NAL units prior totransmission or storage. Each NAL unit conta<strong>in</strong>s a Raw Byte Sequence Payload (RBSP),a set <strong>of</strong> data correspond<strong>in</strong>g to coded video data or header <strong>in</strong>formation. A coded videosequence is represented by a sequence <strong>of</strong> NAL units (Figure 2-2) that can be transmittedover a packet-based network or a bitstream transmission l<strong>in</strong>k or stored <strong>in</strong> a file. Thepurpose <strong>of</strong> separately specify<strong>in</strong>g the VCL and NAL is to dist<strong>in</strong>guish between cod<strong>in</strong>gspecificfeatures (at the VCL) and transport-specific features.Figure 2-2: Sequence <strong>of</strong> NAL units.2.5 REFERENCE PICTURESAn <strong>H.264</strong> encoder may use one or two (even more <strong>in</strong> the reference <strong>H.264</strong> encoder) <strong>of</strong>previously encoded pictures as a reference for motion-compensated prediction <strong>of</strong> each<strong>in</strong>ter coded macroblock or macroblock partition. This enables the encoder to search forthe best ‘match’ for the current macroblock partition from a wider set <strong>of</strong> pictures than justthe previously encoded picture. Multiple reference frames result <strong>in</strong> significantcompression efficiency especially when the motion is periodic by nature, as is illustrated<strong>in</strong> Figure 2-3.


8 Overview <strong>of</strong> <strong>H.264</strong>Figure 2-3: Multiple reference frames.The encoder and decoder each ma<strong>in</strong>ta<strong>in</strong> one or two lists <strong>of</strong> reference pictures, conta<strong>in</strong><strong>in</strong>gpictures that have previously been encoded and decoded (occurr<strong>in</strong>g before and/or after thecurrent picture <strong>in</strong> display order). Inter coded macroblocks and macroblock partitions <strong>in</strong> Pslices (see below) are predicted from pictures <strong>in</strong> a s<strong>in</strong>gle list, list 0. Inter codedmacroblocks and macroblock partitions <strong>in</strong> a B slice may be predicted from two lists, list 0and list 1.Table 2-2: <strong>H.264</strong> slice modes.Slice Type Description Pr<strong>of</strong>ile(s)I (Intra)Conta<strong>in</strong>s only I macroblocks (each block or macroblock is predicted Allfrom previously coded data with<strong>in</strong> the same slice).P (Predicted) Conta<strong>in</strong>s P macroblocks (each macroblock or macroblock partition is Allpredicted from one list 0 reference picture) and/or I macroblocksB (Bi-predictive) Conta<strong>in</strong>s B macroblocks (each macroblock or macroblock partition ispredicted from a list 0 and/or a list 1 reference picture) and/or IExtendedand Ma<strong>in</strong>macroblocksSP (Switch<strong>in</strong>g P) Facilitates switch<strong>in</strong>g between coded streams; conta<strong>in</strong>s P and/or I ExtendedmacroblocksSI (Switch<strong>in</strong>g I) Facilitates switch<strong>in</strong>g between coded streams; conta<strong>in</strong>s SI macroblocks(a special type <strong>of</strong> <strong>in</strong>tra coded macroblock)Extended2.6 SLICESA video picture (frame) is coded as one or more slices, each conta<strong>in</strong><strong>in</strong>g an <strong>in</strong>tegralnumber <strong>of</strong> macroblocks from one (one MB per slice) to the total number <strong>of</strong> macroblocks<strong>in</strong> a picture (one slice per picture). The number <strong>of</strong> macroblocks per slice need not beconstant with<strong>in</strong> a picture. There is m<strong>in</strong>imal <strong>in</strong>ter-dependency between coded slices whichcan help to limit the propagation <strong>of</strong> errors. There are five types <strong>of</strong> coded slice (Table 2-2)and a coded picture may be composed <strong>of</strong> different types <strong>of</strong> slices. For example, aBasel<strong>in</strong>e Pr<strong>of</strong>ile coded picture may conta<strong>in</strong> a mixture <strong>of</strong> I and P slices and a Ma<strong>in</strong> orExtended Pr<strong>of</strong>ile picture may conta<strong>in</strong> a mixture <strong>of</strong> I, P and B slices.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 9Figure 2-4 shows a simplified illustration <strong>of</strong> the syntax <strong>of</strong> a coded slice. The slice headerdef<strong>in</strong>es (among other th<strong>in</strong>gs) the slice type and the coded picture that the slice ‘belongs’to and may conta<strong>in</strong> <strong>in</strong>structions related to reference picture management. The slice dataconsists <strong>of</strong> a series <strong>of</strong> coded macroblocks and/or an <strong>in</strong>dication <strong>of</strong> skipped (not coded)macroblocks. Each MB conta<strong>in</strong>s a series <strong>of</strong> header elements and coded residual data.Figure 2-4: Slice syntax.Table 2-3: Macroblock syntax elements.mb_type Determ<strong>in</strong>es whether the macroblock is coded <strong>in</strong> <strong>in</strong>tra or <strong>in</strong>ter (P or B)modeDeterm<strong>in</strong>es macroblock partition sizemb_pred Determ<strong>in</strong>es <strong>in</strong>tra prediction modes (<strong>in</strong>tra macroblocks); determ<strong>in</strong>es list 0 and/or list 1references and differentially coded motion vectors for each macroblock partition (<strong>in</strong>termacroblocks, except for <strong>in</strong>ter MBs with 8× 8 macroblock partition size).sub_mb_pred (Inter MBs with 8× 8 macroblock partition size only). Determ<strong>in</strong>es sub-macroblockpartition size for each sub-macroblock; list 0 and/or list 1 references for eachmacroblock partition; differentially coded motion vectors for each macroblock subpartitioncoded block pattern. Identifies which 8× 8 blocks (luma and chroma) conta<strong>in</strong>coded transform coefficients.mb_qp_delta Changes the quantizer parameterresidual Coded transform coefficients correspond<strong>in</strong>g to the residual image samples afterprediction2.7 MACROBLOCKSA macroblock conta<strong>in</strong>s coded data correspond<strong>in</strong>g to a 16× 16 sample region <strong>of</strong> the vide<strong>of</strong>rame ( 16 × 16 luma samples, 8 × 8 Cband 8 × 8 Crsamples) and conta<strong>in</strong>s the syntaxelements described <strong>in</strong> Table 2-3. Macroblocks are numbered (addressed) <strong>in</strong> raster scanorder with<strong>in</strong> a frame.


10 Overview <strong>of</strong> <strong>H.264</strong>2.8 TECHNICAL OVERVIEWIn common with earlier standards (such as MPEG-1, MPEG-2 and MPEG-4), the <strong>H.264</strong>draft standard does not explicitly def<strong>in</strong>e a CODEC (enCOder/DECoder pair). Rather, thestandard def<strong>in</strong>es the syntax <strong>of</strong> an encoded video bitstream together with the method <strong>of</strong>decod<strong>in</strong>g this bitstream. In practice, however, a compliant encoder and decoder are likelyto <strong>in</strong>clude the functional elements shown <strong>in</strong> Figure 2-5 and Figure 2-6. Whilst thefunctions shown <strong>in</strong> these figures are likely to be necessary for compliance, there is scopefor considerable variation <strong>in</strong> the structure <strong>of</strong> the CODEC. The basic functional elements(prediction, transform, quantization, entropy encod<strong>in</strong>g) are different from previousstandards (MPEG-1, MPEG-2, MPEG-4, H.261, H.263); the important changes <strong>in</strong> <strong>H.264</strong>occur <strong>in</strong> the details <strong>of</strong> each functional element.The Encoder (Figure 2-5) <strong>in</strong>cludes two dataflow paths, a “forward” path (left to right,shown <strong>in</strong> blue) and a “reconstruction” path (right to left, shown <strong>in</strong> magenta). Thedataflow path <strong>in</strong> the Decoder (Figure 2-6) is shown from right to left to illustrate thesimilarities between Encoder and Decoder.2.8.1 Encoder (forward path)An <strong>in</strong>put frameF nis presented for encod<strong>in</strong>g. The frame is processed <strong>in</strong> units <strong>of</strong> amacroblock (correspond<strong>in</strong>g to 16× 16 pixels <strong>in</strong> the orig<strong>in</strong>al image). Each macroblock isencoded <strong>in</strong> <strong>in</strong>tra or <strong>in</strong>ter mode. In either case, a prediction macroblock P is formed basedon a reconstructed frame. In <strong>in</strong>tra mode, P is formed from samples <strong>in</strong> the current frame nthat have previously encoded, decoded and reconstructed ( u ′ <strong>in</strong> the Figures; note thatthe unfiltered samples are used to form P ). In <strong>in</strong>ter mode, P is formed by motioncompensatedprediction from one or more reference frame(s). In the Figures, thereference frame is shown as the previous encoded frame F ′ n −1; however, the prediction foreach macroblock may be formed from one or two past or future frames (<strong>in</strong> time order)that have already been encoded and reconstructed. The prediction P is subtracted fromthe current macroblock to produce a residual or difference macroblockFnD n. This istransformed (us<strong>in</strong>g a block transform) and quantized to give X , a set <strong>of</strong> quantizedtransform coefficients. These coefficients are re-ordered and entropy encoded. Theentropy encoded coefficients, together with side <strong>in</strong>formation required to decode themacroblock (such as the macroblock prediction mode, quantizer step size, motion vector


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 11<strong>in</strong>formation describ<strong>in</strong>g how the macroblock was motion-compensated, etc.) form thecompressed bitstream. This is passed to a Network Abstraction Layer (NAL) fortransmission or storage.2.8.2 Encoder (reconstruction path)Figure 2-5: <strong>H.264</strong> Encoder.The quantized macroblock coefficients X are decoded <strong>in</strong> order to reconstruct a frame forencod<strong>in</strong>g <strong>of</strong> further macroblocks. The coefficients X are re-scaled ( Q−1transformed ( T ) to produce a difference macroblockorig<strong>in</strong>al difference macroblockis a distorted version <strong>of</strong>reconstructed macroblock−1) and <strong>in</strong>verse′Dn. This is not identical to theDn; the quantization process <strong>in</strong>troduces losses and soDn. The prediction macroblock P is added to′Dn′Dnto create au F ′n (a distorted version <strong>of</strong> the orig<strong>in</strong>al macroblock). A filter isapplied to reduce the effects <strong>of</strong> block<strong>in</strong>g distortion and reconstructed reference frame iscreated from a series <strong>of</strong> macroblocksu F ′n.2.8.3 DecoderThe decoder receives a compressed bitstream from the NAL. The data elements areentropy decoded and reordered to produce a set <strong>of</strong> quantized coefficients X. These arerescaled and <strong>in</strong>verse transformed to give′D n(this identical to the′D nshown <strong>in</strong> theEncoder). Us<strong>in</strong>g the header <strong>in</strong>formation decoded from the bitstream, the decoder creates aprediction macroblock P, identical to the orig<strong>in</strong>al prediction P formed <strong>in</strong> the encoder. P


12 Overview <strong>of</strong> <strong>H.264</strong>is added tomacroblock′D n<strong>in</strong> order to produceu F ′n, which is filtered to create the decodedF′n . It should be clear from the figures and from the discussion above that thepurpose <strong>of</strong> the reconstruction path <strong>in</strong> the encoder is to ensure that both encoder anddecoder use identical reference frames to create the prediction P. If this is not the case,then the predictions P <strong>in</strong> encoder and decoder will not be identical, lead<strong>in</strong>g to an<strong>in</strong>creas<strong>in</strong>g error or “drift” between the encoder and decoder.Figure 2-6: <strong>H.264</strong> Decoder.


3 Inter Prediction3.1 INTRODUCTIONThe goal <strong>of</strong> the <strong>in</strong>ter prediction is to reduce redundancy between transmitted frames byform<strong>in</strong>g a predicted frame and subtract<strong>in</strong>g this from the current frame. The output <strong>of</strong> thisprocess is a residual (difference) frame and the more accurate the prediction process, theless energy is conta<strong>in</strong>ed <strong>in</strong> the residual frame. The residual frame is encoded and sent tothe decoder which re-creates the predicted frame, adds the decoded residual andreconstructs the current frame. The key part <strong>of</strong> the <strong>in</strong>ter prediction is the block basedmotion estimation-compensation. The motion estimation deals with f<strong>in</strong>d<strong>in</strong>g the bestmatch <strong>of</strong> the current block (sub-block) whilst the motion compensation refers to thepredicted block (sub-block), which is the result (residuals) <strong>of</strong> the subtraction <strong>of</strong> theorig<strong>in</strong>al block (sub-block) from its best match. The residuals are encoded and transmittedtogether with a motion vector describ<strong>in</strong>g the position <strong>of</strong> the best match<strong>in</strong>g block (subblock),relative to the current macroblock position.As specified <strong>in</strong> <strong>H.264</strong> [1], there are 7 different block sizes, also known as modes,( 16 × 16 , 16 × 8 , 8× 16 , 8× 8, 8× 4 , 4 × 8 and 4× 4 ) that can be used <strong>in</strong> motionestimation-compensation. These different block sizes actually form a two-level hierarchy<strong>in</strong>side a macroblock. The first level comprises block sizes <strong>of</strong> 16× 16, 16× 8 or 8× 16 . Inthe second level, the macroblock is specified as P8× 8type, <strong>of</strong> which each 8× 8 block canbe one <strong>of</strong> the subtypes such as 8× 8, 8× 4 , 4× 8 or 4 × 4 . This macroblock partition<strong>in</strong>g isdepicted <strong>in</strong> Figure 3-1.


14 Inter PredictionFigure 3-1: Macroblock partitions.In order to choose the best block size for a macroblock, the <strong>H.264</strong> reference code makesuse <strong>of</strong> computationally <strong>in</strong>tensive Lagrangian Rate-Distortion (RD) optimization, thegeneral form <strong>of</strong> which is:Jmod= SSD + λmodR(3-1)e e ×where λmod eis the Lagrange multiplier used <strong>in</strong> mode decision, R reflects the number <strong>of</strong>bits associated with choos<strong>in</strong>g the mode and macroblock quantizer value,Qp<strong>in</strong>clud<strong>in</strong>g thebits for the macroblock header, the motion vector(s) and all the DCT residue blocks. SSDis the sum <strong>of</strong> the squared differences. For a block <strong>of</strong>M 1 1( ) 2( , ( )) ∑∑− N −s c m = s(x,y)− c(x − mx , y − m y)x=0 y=0M × N the SSD is calculated as:SSD (3-2)where s is the pixel value <strong>of</strong> the current block, c is the value <strong>of</strong> the reconstructedreference block and m is the motion vector.In <strong>H.264</strong> reference code, the motion estimation and the <strong>in</strong>ter mode decision are executedtogether. For each mode, motion estimation is done first and the resulted motion cost isused for the mode decision. The <strong>in</strong>ter mode decision is therefore an extremely timeconsum<strong>in</strong>g process. For each position <strong>in</strong> the search w<strong>in</strong>dow, motion estimation has to beperformed <strong>in</strong> order to f<strong>in</strong>d the motion vector that m<strong>in</strong>imizes eq. (3-3).J= SAD + λ R(3-3)motion motion×


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 15where λmotionis the Lagrange multiplier used <strong>in</strong> motion estimation, R is the number <strong>of</strong> bitsassociated with the motion vectors and SAD is the sum <strong>of</strong> the absolute differences. For ablock <strong>of</strong>M × N, the SAD is calculated as:N − 1 M −1∑∑SAD ( m,n)= C(i,j)− S(i + m,j + n)(3-4)i=0j=0where C(i, j) and S(i+m, j+n) represent the pixels (i, j) <strong>in</strong> the current luma block and thecandidate luma block, respectively.Then, eq. (3.2) is calculated us<strong>in</strong>g the best match resulted by eq. (3-4), for every mode, <strong>in</strong>order to choose the best mode.3.2 PROBLEM FORMULATION3.2.1 Inter prediction complexityAs shown <strong>in</strong> Section 3.1, <strong>H.264</strong> [1] has various motion estimation-compensation units(<strong>in</strong>ter prediction modes) <strong>in</strong> sizes <strong>of</strong> 16×16, 16×8, 8×16, 8×8 and sub8×8. For sub8×8,there are further four sub-partitions <strong>of</strong> sub8×8, sub8×4, sub4×8 and sub4×4. Moreover,quarter-pixel motion compensation can be applied. Such wide blocks choices greatlyimprove cod<strong>in</strong>g efficiency but at the expense <strong>of</strong> largely <strong>in</strong>creased <strong>in</strong>ter prediction time.The computational complexity becomes even higher when larger search ranges, bidirectionaland multiple reference frames are used. It has been observed that <strong>in</strong> the case <strong>of</strong>exhaustive search <strong>of</strong> all candidate blocks, up to 80% <strong>of</strong> the computational power <strong>of</strong> theencoder is consumed by motion estimation. Such high computational complexity is <strong>of</strong>tena bottle-neck for real-time applications.Various motion estimation methods, legacy <strong>of</strong> previous standards as well as new ones,have been applied to <strong>H.264</strong> <strong>in</strong> an attempt to reduce the <strong>in</strong>ter prediction complexity.However, these methods had limited success ma<strong>in</strong>ly due to the new <strong>in</strong>ter predictionscheme, which is <strong>in</strong>troduced by <strong>H.264</strong>. For example, no matter how effective a motionestimation algorithm is, it needs to be executed for every s<strong>in</strong>gle mode (16×16, 16×8,8×16, etc.). Moreover, it must be executed for every reference frame backwards andforward. These requirements <strong>in</strong>crease the complexity and eventually the time <strong>of</strong> the <strong>in</strong>terprediction. In this dissertation we propose various methods, which exploit the new <strong>in</strong>ter


16 Inter Predictionprediction scheme <strong>of</strong> <strong>H.264</strong> and comb<strong>in</strong>ed with exist<strong>in</strong>g motion estimation methods theyreduce the <strong>in</strong>ter prediction complexity drastically.3.2.2 Special video applicationsThe new <strong>in</strong>ter prediction scheme, <strong>in</strong>troduced by <strong>H.264</strong>, <strong>in</strong>creases the encoder’scomplexity, on one hand, but it also makes possible the development <strong>of</strong> various methodsapplied to special video applications on the other hand. In this dissertation we present anobject detection method, which exploits the motion vectors generated by the <strong>in</strong>terprediction. Previous standards could not make such a use <strong>of</strong> the motion vectors becausethey were few and were applied only to 16×16 blocks with<strong>in</strong> a frame.3.3 SOLUTIONSIn the follow<strong>in</strong>g sections we shall present four novel methods. Three <strong>of</strong> these methods,the Fast Successive Elim<strong>in</strong>ation (Section 3.4), the Spatio-Temporal Predictor (Section3.5) and the Fast Multiple Reference Frame Selector (Section 3.6) <strong>in</strong>terfere with theexist<strong>in</strong>g <strong>in</strong>ter prediction process <strong>of</strong> the <strong>H.264</strong> reference encoder and aim at reduc<strong>in</strong>g theencod<strong>in</strong>g time. Their major advantage is that they can be easily comb<strong>in</strong>ed with manyother <strong>in</strong>ter prediction techniques and make them more effective. These methods areconsidered to be enhancements accord<strong>in</strong>g to the def<strong>in</strong>ition <strong>of</strong> the term enhancement given<strong>in</strong> the Introduction. The fourth method (Section 3.7) detects a mov<strong>in</strong>g object with<strong>in</strong> an<strong>H.264</strong> video sequence. The method works directly <strong>in</strong> the compressed doma<strong>in</strong> and thus itis suitable for real time applications. The method is possible only due to the nature <strong>of</strong> the<strong>H.264</strong> <strong>in</strong>ter prediction, which generates a sufficient amount <strong>of</strong> motion vectors. Thismethod falls <strong>in</strong> the category <strong>of</strong> the applied methods as it was expla<strong>in</strong>ed <strong>in</strong> theIntroduction. In that way we demonstrate how various applications can take advantage <strong>of</strong>the new characteristics <strong>of</strong> the <strong>H.264</strong> encoder.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 173.4 FAST SUCCESSIVE ELIMINATION ALGORITHM3.4.1 Literature reviewOne common technique to speed up the full search is the successive elim<strong>in</strong>ation techniqueproposed by Li and Salari [2]. The basic idea <strong>of</strong> this technique is to obta<strong>in</strong> the bestestimate <strong>of</strong> the motion vectors by successively elim<strong>in</strong>at<strong>in</strong>g the search positions <strong>in</strong> thesearch w<strong>in</strong>dow and thus decreas<strong>in</strong>g the number <strong>of</strong> match<strong>in</strong>g evaluations. Variations <strong>of</strong>this technique have been proposed <strong>in</strong> [3-7]. To decrease the amount <strong>of</strong> computations <strong>of</strong>the full-search algorithm, Jong-Nam Kim and Tae-Sun Choi [8] propose a fast blockmatch<strong>in</strong>galgorithm based on an adaptive match<strong>in</strong>g scan and representative pixels. Chen-Fu L<strong>in</strong> and J<strong>in</strong>-Jang Leou [9] propose a fast full search method, which reduces the sum <strong>of</strong>absolute differences (SAD) computations. In Ahmad et al. work [10], the proposedalgorithm takes advantage <strong>of</strong> the correlation between the motion vectors, controls to curbthe search, avoids <strong>of</strong> search stationary regions and uses switchable shape search patternsto accelerate motion search. Yan-Ho Kam and Wan-Chi Siu [11], propose two new fastfull search (FFS) methods. The first method comb<strong>in</strong>es the concepts <strong>of</strong> both conventionalsum <strong>of</strong> absolute difference (SAD) reuse method and row-based partial distortion search(PDS) together to make the performance largely better than these two conventionalmethods for variable block size motion estimation. The second method speeds up themulti-frame motion estimation process by us<strong>in</strong>g the previous results obta<strong>in</strong>ed <strong>in</strong>considered reference frames to set extra thresholds for reject<strong>in</strong>g search po<strong>in</strong>ts sooner <strong>in</strong>reference frames. Xuan J<strong>in</strong>g and Lap-Pui Chau [12], propose a fast full search methodus<strong>in</strong>g a predictive search area. Lung-Chun Chang et al. [13], propose a fast full searchmethod us<strong>in</strong>g adaptive search order so as the best matched block to be found <strong>in</strong> an earlysearch stage. Tian Song et al. [14], propose a fast full search method us<strong>in</strong>g an adaptivesearch range. In this section we propose a new fast full search algorithm. Our algorithmreduces the computations dur<strong>in</strong>g the full search by apply<strong>in</strong>g a two-level search rangeadaptation technique.The proposed algorithm achieves a considerable speed up <strong>of</strong> the full search motionestimation process by apply<strong>in</strong>g a fast Successive Elim<strong>in</strong>ation Algorithm (SEA) and byreduc<strong>in</strong>g the search area around an adaptive search center. In the follow<strong>in</strong>g paragraphs we


18 Inter Predictionshall first describe the fast SEA that we developed and then we shall give the overalldescription <strong>of</strong> the proposed algorithm.3.4.2 Full search <strong>in</strong> <strong>H.264</strong> reference encoderThe motion estimation <strong>in</strong> <strong>H.264</strong> is performed upon the <strong>in</strong>ter prediction modes (Figure3-1). Furthermore, the rate-constra<strong>in</strong>ed motion estimation is utilized, where the criterionto f<strong>in</strong>d the optimum motion vector is to m<strong>in</strong>imize a Lagrangian cost functional (eq. (3-3)).The full search motion estimation is performed with<strong>in</strong> a search area around a searchcenter. The center position is placed on the position, which is po<strong>in</strong>ted by the motionvector (MV) prediction used for conduct<strong>in</strong>g the differential cod<strong>in</strong>g for the MVs [1]. This isbecause generally the true motion vectors have a high correlation with the predicted MV.Furthermore, the full search algorithm scans the search area <strong>in</strong> a spiral fashion. Typicalvalues <strong>of</strong> the search range are 8× 8, 16× 16 and 32× 32 . The <strong>H.264</strong> reference encoder<strong>in</strong>corporates two full search algorithms. The first one is a typical full search algorithm,which calculates the SAD for each block size separately and it f<strong>in</strong>ally discovers the bestblock match by m<strong>in</strong>imiz<strong>in</strong>g the (eq. (3-3)). The second one is a fast full search algorithm,which calculates only the SADs <strong>of</strong> the 4× 4 sub-blocks for every block. The SADs <strong>of</strong> theother block sizes are calculated by merg<strong>in</strong>g the 4× 4 SADs.The proposed algorithm achieves a considerable speed up <strong>of</strong> the full search motionestimation process by apply<strong>in</strong>g a fast Successive Elim<strong>in</strong>ation Algorithm (SEA) and byreduc<strong>in</strong>g the search area around an adaptive search center.3.4.3 Fast SEASEA [2] is based on the follow<strong>in</strong>g <strong>in</strong>equality:a + b ≤ a + b(3-5)If we apply eq. (3-5) to eq. (3-4), it can be shown that:N − 1 M −1∑∑SAD ( m,n)= C(i,j)− S(i + m,j + n)≥ C − S ≡ sea(m,n (3-6)i=0j=00 0)where C 0 and S 0 are the sum norms <strong>of</strong> the current block and the candidate block,respectively.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 19N −1M −1N −1M −10= ∑∑C(i,j),S0= ∑∑ S(i + m,j + n)i=0 j=0i=0 j=0C (3-7)The workflow <strong>of</strong> the SEA algorithm is as follows:StepAction1 Calculate the SAD <strong>of</strong> the first search po<strong>in</strong>t. This is considered to be the currentm<strong>in</strong>imum SAD ( SAD )m<strong>in</strong>2 Calculate the SEA for the next search po<strong>in</strong>t. If the SEA is larger than SADm<strong>in</strong>, skipthis po<strong>in</strong>t. Otherwise calculate the new SAD and update theSADm<strong>in</strong>3 Proceed with the next search po<strong>in</strong>t and repeat step 2 until all <strong>of</strong> the search po<strong>in</strong>tshave been exam<strong>in</strong>edApparently, the speed <strong>of</strong> the SEA highly depends on the fast calculations <strong>of</strong> the sumnorms C 0 and S 0 . The C 0 is calculated once, while the S 0 is traditionally calculated us<strong>in</strong>gthe frame method, which was first described by Li and Salari [2]. However, this methodcan be applied <strong>in</strong> blocks <strong>of</strong> fixed sizes, usually NXN. On the other hand, the motionestimation <strong>in</strong> <strong>H.264</strong> is performed upon a variety <strong>of</strong> block sizes (prediction modes)( 16× 16 , 16× 8 , 8× 16 , 8× 8, 8× 4 , 4 × 8 and 4 × 4 ). A variation <strong>of</strong> the SEA, denoted asMultilevel Successive Elim<strong>in</strong>ation Algorithm (MSEA) [6,7], has been proposed by manyauthors partly <strong>in</strong> order to <strong>in</strong>crease the efficiency <strong>of</strong> the SEA but mostly <strong>in</strong> order toovercome the problem <strong>of</strong> the variable block sizes <strong>in</strong> <strong>H.264</strong>. In MSEA algorithm the NXMblocks are divided <strong>in</strong>to K sub-blocks. The sum norms <strong>of</strong> the sub-blocks are accumulatedto get the msea value, as shown <strong>in</strong> (3-8):N −1M −1∑∑SAD ( m,n)= C(i,j)− S(i + m,j + n)≥ C − S ≡ msea(m,n)(3-8)i=0j=0thwhere C k and S k are the sum norms <strong>of</strong> the k sub-block <strong>of</strong> the current block C and <strong>of</strong> thecandidate block S , respectively.K −1∑k = 0kkThis approach <strong>in</strong>creases the complexity <strong>of</strong> the motion estimation <strong>in</strong> many cases. Thereason is because the SEA/MSEA is not close enough to the true SAD and quite <strong>of</strong>ten theSAD for a given search po<strong>in</strong>t must be calculated besides the calculation <strong>of</strong> the


20 Inter PredictionSEA/MSEA. In that case the SEA/MSEA is clearly an overhead. We therefore concludethat there are two requirements for a SEA method to be efficient:1. The m<strong>in</strong>imum SAD must be found as close as possible to the <strong>in</strong>itial search po<strong>in</strong>tso as the <strong>in</strong>equality (3-6) to occur more times.2. The SEA must be calculated as fast as possible so as even if the SAD must be alsocalculated, the impact <strong>of</strong> the SEA calculation to be negligible.In order to satisfy the first requirement we propose a new method, which adapts the <strong>in</strong>itialsearch po<strong>in</strong>t and the search range accord<strong>in</strong>g to the results <strong>of</strong> the motion estimation for the16 × 16 block, as we shall show later. Moreover, <strong>in</strong> order to satisfy the secondrequirement we adopted the method, which was proposed by Frankl<strong>in</strong> Crow [16] <strong>in</strong> 1984.This method was used for texture mapp<strong>in</strong>g but it can be also used for calculat<strong>in</strong>g the sumnorms. The basic idea <strong>of</strong> this method is to map a frame to a “Summed Area Table (SAT)”as is described below (the description has been copied from [17]):Given a frame, let g ( m,n)be the pixel <strong>in</strong>tensity at (m,n). The SAT at pixel (m,n),denoted as ( m,n), is def<strong>in</strong>ed as the sum <strong>of</strong> the values g(m,n)’s, over the region that isI gabove and to the left <strong>of</strong> pixel (m, n), <strong>in</strong>clusive (Figure 3-2). That isIgmn∑∑( m,n)= g(x,y)(3-9)x= 0 y=0Let R g ( m,n)denote the cumulative row sum <strong>of</strong> the pixel <strong>in</strong>tensities g(m,n)’s, def<strong>in</strong>ed asRgm( m n) = ∑, g(x,n)(3-10)x=0Assum<strong>in</strong>g R g( −1 , n) = 0 and I g( m, −1) = 0 , one can compute the ( m n)by us<strong>in</strong>g two recursive formulas:R( m, n) = R ( m −1,n)g(m,n), I ( m, n) I ( m,n −1)R ( m,n)g g+Hence, for a frame <strong>of</strong>g g+gI , , <strong>in</strong> one passg= (3-11)W × H pixels, only 2 ×W × H additions are required to computethe whole Summed Area Table (SAT). Us<strong>in</strong>g the SAT, the sum norms <strong>in</strong> any rectangularblock can be computed with three arithmetic operations; one addition and twosubtractions. This can been seen <strong>in</strong> Figure 3-3, where the sum norms (SN) <strong>of</strong> the block D


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 21can be computed by us<strong>in</strong>g the four correspond<strong>in</strong>g SATs at the corners <strong>of</strong> the block asfollows:SNSNDD= Imx= r+1 y s+1gn= ∑ ∑g(x,y)⇔= (3-12)( m,n)− Ig( r,n)− Ig( m,s)+ Ig( r,s)I g( m,n)Figure 3-2: Value g(m,n) is the pixel <strong>in</strong>tensity at (m,n), while I g (m,n) is the sum <strong>of</strong> thevalues g(m,n)’s over the region that is above and to the left <strong>of</strong> pixel (m,n), <strong>in</strong>clusive.Figure 3-3: The sum norms <strong>in</strong> block D can be computed by us<strong>in</strong>g the four SATs at theblock boundaries.3.4.3.1 Cost analysisTo facilitate the analysis we assume that each addition, subtraction and conditionaloperation, <strong>in</strong>clud<strong>in</strong>g the calculation <strong>of</strong> the absolute value, needs one operation. Accord<strong>in</strong>gto [2], the total computations required to obta<strong>in</strong> the sum norm <strong>of</strong> all the blocks <strong>in</strong> thereference frame are:


22 Inter PredictionT = 4 × W × H − (H − N) × (N + 3) − 3 × W × (N + 1)(3-13)If we consider a QCIF video frame (W=176, H=144), which is divided <strong>in</strong> 16 × 16 blocks(N=16), T will be equal to 89968 operations. Moreover, if the 16 × 16 blocks are divided<strong>in</strong>to 4 × 4 sub-blocks (N=4), as required by the MSEA methods, then T will be equal to99716 operations. The calculation <strong>of</strong> each MSEA value <strong>in</strong> eq. (3-8) will also cost 48operations (k=16 for N=4). If we consider the motion estimation <strong>of</strong> the 16 × 16 blocksonly, the computation overhead for each reference 16 × 16 block is:T99716+ 48 = + 48 ≈ 1055(3-14)Total Blocks per Frame 99On the other hand, us<strong>in</strong>g the proposed SAT method, it can be shown from eq. (3-11) andeq. (3-12) that the computation overhead for each reference 16 × 16 block is:2 × W × H50688+ 3 = + 3 = 515Total Blocks per Frame 99(3-15)The great advantage <strong>of</strong> the SAT method is that the overhead (eq. (3-15)) rema<strong>in</strong>s thesame for every block size whilst the overhead (eq. (3-14)) varies accord<strong>in</strong>g to the currentblock size and to the selection <strong>of</strong> the number k <strong>of</strong> the sub-blocks <strong>in</strong> eq. (3-8).The proposed method uses the SAT method for calculat<strong>in</strong>g both <strong>of</strong> the sum norms C 0 andS 0 <strong>in</strong> order to save as many computations as possible.3.4.4 Two-level motion estimationThe proposed method performs a two-level motion estimation. At the first level, it doesthe motion estimation for the 16 × 16 blocks exactly like the conventional full searchalgorithm does, as described <strong>in</strong> Section 3.4.2; the motion estimation <strong>of</strong> the 16 × 16macroblock is performed for a number <strong>of</strong> search po<strong>in</strong>ts with<strong>in</strong> a search area (typically <strong>of</strong>16× 16 po<strong>in</strong>ts) <strong>in</strong> order to f<strong>in</strong>d its best match <strong>in</strong> the reference frame. At the second level,the method moves the search center to the best match<strong>in</strong>g block position resulted by theprevious motion estimation. Moreover, it reduces the search range accord<strong>in</strong>g to thedistance between the <strong>in</strong>itial search center and the best match<strong>in</strong>g block position as isillustrated <strong>in</strong> Figure 3-4.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 23Figure 3-4: The two-level motion estimation.The workflow <strong>of</strong> the proposed method is as follows:StepAction1 Perform the motion estimation and f<strong>in</strong>d the best match for the 16 × 16 modearound the MV predicted search center. Let the <strong>in</strong>itial search center be at position 1(0, 0) and the best match at position 2 (x, y) as is shown <strong>in</strong> Figure 3-4 (Level A)2 Move the search center to position 2 (x, y)3 Reduce the search range accord<strong>in</strong>g to the follow<strong>in</strong>g formula (3-16)4 Perform the motion estimation for the rest modes ( 16× 8 , 8× 16 , 8× 8, 8× 4 ,4× 8 and 4× 4 ) as is shown <strong>in</strong> Figure 3-4 (Level B)SearchRang e = max( x,y,5)(3-16)Notice that the adapted search range (eq. (3-16)) cannot be less than 5 positions. Thereason is because the max( x , y)may be too small and the best match might be located out


24 Inter Prediction<strong>of</strong> the search range. Besides, it is reported that about 93% <strong>of</strong> the best match<strong>in</strong>g po<strong>in</strong>ts arelocated <strong>in</strong> the 5× 5 area near the search center [18].3.4.5 Simulation ResultsThe simulation tests were executed <strong>in</strong> the simulation environment, which is described <strong>in</strong>Appendix II. The most important configuration parameters <strong>of</strong> the reference s<strong>of</strong>tware areshown <strong>in</strong> Table 3-1. The rest <strong>of</strong> the parameters have reta<strong>in</strong>ed their default values. Theproposed algorithm was tested aga<strong>in</strong>st the conventional full search and the fast full searchalgorithms used by the reference encoder (JM12.0).In our tests we enabled the bit rate control mechanism <strong>of</strong> the encoder and we set a 45kbps bit rate constra<strong>in</strong>t on the encoder. By enabl<strong>in</strong>g the bit rate control the quantizationparameters were automatically controlled by the encoder, which had to generate a bit ratelower or equal to the bit rate constra<strong>in</strong>t. For our tests we used 200 frames <strong>of</strong> three wellknownrepresentative video sequences <strong>in</strong> QCIF format (YUV 4:2:0): themother&daughter (Class A), the foreman (Class B) and the mobile (Class C). The test<strong>in</strong>gprocedure (ref. Appendix II.4) was to run the reference encoder with and without ouralgorithm and then compare the results <strong>in</strong> respect with the bit rate, the PSNR and theencod<strong>in</strong>g time. We used the bit rate, the encod<strong>in</strong>g time and the PSNR variations ascomparative metrics, which were calculated as <strong>in</strong> eq. (I-1), (I-2) and (I-5) respectively.The results are shown <strong>in</strong> Table 3-2. The variations are between our proposed method andthe Full Search (FS) algorithm and between our proposed method and the Fast FullSearch (FFS) algorithm, which are used by the reference <strong>H.264</strong> encoder (JM12.0).From the results we see that the proposed algorithm achieves a significant reduction <strong>of</strong>53.57% <strong>in</strong> average <strong>of</strong> the motion estimation time compared to the full search algorithm atthe cost <strong>of</strong> 0.01 dB average loss <strong>of</strong> the PSNR. Moreover, it results <strong>in</strong> bit rate reduction(0.087% <strong>in</strong> average). On the other hand, it achieves a reduction <strong>of</strong> 32.34% <strong>in</strong> average <strong>of</strong>the motion estimation time compared to the fast full search algorithm. The PSNR and thebit rate variations obviously rema<strong>in</strong>ed the same, s<strong>in</strong>ce the full search and the fast fullsearch algorithms are the same with respect to the PSNR and the bit rate outcomes.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 253.4.6 ConclusionsThe proposed method achieves a considerable speed up <strong>of</strong> the full search motionestimation process by apply<strong>in</strong>g a fast Successive Elim<strong>in</strong>ation Algorithm (SEA) and byreduc<strong>in</strong>g the search area around an adaptive search center. Moreover, it leaves the PSNRand the bit rate practically unaffected.Table 3-1: Configuration parameters <strong>of</strong> the encoder.Pr<strong>of</strong>ileBasel<strong>in</strong>eNumber <strong>of</strong> Frames 200Frame Rate30 fpsReference frames 5RD OptimizationFast High Complexity ModeMotion EstimationFull Search & Fast Full SearchIntra Period0: only the first frame is <strong>in</strong>traSymbol ModeUVLCBit Rate ControlEnabledBit Rate Constra<strong>in</strong>t45000 b/sTable 3-2: Simulation results.SequencePSNRVar. (dB)Bit RateVar. (%)Encod<strong>in</strong>g TimeVar. (%)FSFFSMother & Daughter -0.01 -0.17 -56.56 -42.13Foreman 0.00 0.01 -49.40 -25.61Mobile -0.02 -0.10 -54.75 -29.26Average -0.01 -0.087 -53.57 -32.34


26 Inter Prediction3.5 SPATIO-TEMPORAL PREDICTOR FOR MOTION ESTIMATION3.5.1 Literature reviewThere have been many fast motion estimation (FME) techniques proposed <strong>in</strong> the literature[19-24]. Two popular approaches are chosen to reduce computation <strong>in</strong> block match<strong>in</strong>gmotion estimation. The first approach reduces the number <strong>of</strong> candidate blocks <strong>in</strong> thesearch w<strong>in</strong>dow (fast search<strong>in</strong>g techniques). These algorithms usually show good speedga<strong>in</strong> but have relatively larger rate-distortion (R-D) performance degradation. The secondapproach reduces the complexity <strong>of</strong> SAD computation (fast match<strong>in</strong>g techniques). Thesealgorithms <strong>of</strong>ten achieve good cod<strong>in</strong>g efficiency but have limited speed up ga<strong>in</strong>. Othertechniques <strong>in</strong>clude predicted spatial-temporal search, adaptive early term<strong>in</strong>ation anddynamic search range adjustment. It is possible to comb<strong>in</strong>e several <strong>of</strong> the abovetechniques to form a hybrid search method. For example, PMVFAST [25] andUMHexagonS [26] utilize prediction, diamond search, hexagon search, partial distortionand adaptive early term<strong>in</strong>ation. They are proven to be more robust than a s<strong>in</strong>gle searchstrategy.Reference s<strong>of</strong>tware is <strong>of</strong>ten optimized for cod<strong>in</strong>g efficiency rather than encod<strong>in</strong>g speedbecause R-D performance is the paramount concern dur<strong>in</strong>g standardization process. Thereference <strong>H.264</strong> encoder adopted three FME algorithms due to their competitive R-Dperformance over Full Search. The three FME algorithms are the UMHexagonS [26], thesimplified UMHexagonS [27] and the EPZS [28]. The first two algorithms make use <strong>of</strong>the notorious median predictor, described <strong>in</strong> the standard, <strong>in</strong> order to f<strong>in</strong>d a better searchcenter and then perform a limited search around this center. On the other hand, the EPZSalgorithm def<strong>in</strong>es some sets <strong>of</strong> predicted search po<strong>in</strong>ts, which are likely to give the bestmatch. For that purpose the EPZS uses various predictors, such as the median predictorand temporal predictor(s).In the follow<strong>in</strong>g section we study the effectiveness <strong>of</strong> different predictors used by theEPZS algorithm <strong>in</strong> order to verify that their use is justified. In addition to that, we take<strong>in</strong>to account the results <strong>of</strong> the EPZS study <strong>in</strong> order to form a new predictor, which maysubstitute the median predictor used <strong>in</strong> [26] and [27] or may also be <strong>in</strong>cluded <strong>in</strong> the set <strong>of</strong>the predictors used by the EPZS.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 273.5.2 Effectiveness <strong>of</strong> the EPZS predictorsThe EPZS [28] is considered to be the most advanced Fast Motion Estimation algorithmamong the three, which are used by the <strong>H.264</strong> reference code. The basic idea <strong>of</strong> EPZS isto reduce the candidate search po<strong>in</strong>ts by predict<strong>in</strong>g search po<strong>in</strong>ts, which are likely to givegood results. For that purpose EPZS uses various search po<strong>in</strong>ts predictors such as thewell-known median predictor, the (0,0) position, the motion vectors <strong>of</strong> the adjacentblocks <strong>in</strong> the current frame, the motion vectors <strong>of</strong> the collocated block and <strong>of</strong> its adjacentblocks <strong>in</strong> the reference frame and many others. In particular the motion vector predictor,which plays a key role <strong>in</strong> the motion estimation, is calculated <strong>in</strong> the follow<strong>in</strong>g way [15].Let E be the current macroblock, macroblock partition or sub-macroblock partition. Let Abe the partition or sub-partition immediately to the left <strong>of</strong> E. Let B be the partition or subpartitionimmediately above E and C the partition or sub-macroblock partition above andto the right <strong>of</strong> E. If there is more than one partition immediately to the left <strong>of</strong> E, thetopmost <strong>of</strong> these partitions is chosen as A. If there is more than one partition immediatelyabove E, the leftmost <strong>of</strong> these is chosen as B. Figure 3-5 illustrates the choice <strong>of</strong>neighbor<strong>in</strong>g partitions when all the partitions have the same size. Figure 3-6 shows anexample <strong>of</strong> the choice <strong>of</strong> prediction partitions when the neighbor<strong>in</strong>g partitions havedifferent sizes from the current partition E.Figure 3-5: Current and neighbor<strong>in</strong>g macroblocks.


28 Inter PredictionB4x8C16x8A8x4E16x16The Motion Vector PredictorConditionFigure 3-6: Current and neighbor<strong>in</strong>g macroblock partitions.MVpis calculated as followsCalculation1 For transmitted partitions exclud<strong>in</strong>g 16 × 8 and 8× 16 partition sizes, MVpisthe median <strong>of</strong> the motion vectors for partitions A, B and C2 For 16 × 8 partitions, MVpfor the upper 16 × 8 partition is predicted from BandMVpfor the lower 16× 8 partition is predicted from A3 For 8× 16 partitions, MVpfor the left 8× 16 partition is predicted from AandMVpfor the right 8× 16 partition is predicted from C4 For skipped macroblocks, a 16 × 16 vector MVpis generated as <strong>in</strong> case (1)above (i.e. as if the block were encoded <strong>in</strong> 16 × 16 <strong>in</strong>ter mode)5 If one or more <strong>of</strong> the previously transmitted blocks is not available (e.g. if itis outside the current slice), the choice <strong>of</strong>MVpis modified accord<strong>in</strong>glyWe exam<strong>in</strong>ed 11 video sequences <strong>in</strong> QCIF format. The <strong>H.264</strong> encoder was configuredwith the default parameters <strong>of</strong> the basel<strong>in</strong>e pr<strong>of</strong>ile and the results are shown <strong>in</strong> Figure 3-7.This figure shows the percent (%) contribution <strong>of</strong> each predictor over the different videosequences. It is clear that the median predictor is the dom<strong>in</strong>ant predictor. The second bestpredictor seems to be the (0, 0) position. Moreover, the motion vectors <strong>of</strong> the adjacentblocks <strong>in</strong> the current frame (Left, Up, UpRight, UpLeft, Mem Left, Mem Up, MemUpRight) have a significant contribution. The contribution <strong>of</strong> the motion vector <strong>of</strong> the


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 29collocated block and the contributions <strong>of</strong> the motion vectors <strong>of</strong> its adjacent blocks <strong>in</strong> thereference frame are all summed over the label “Collocate” for simplicity. In practice thisnumber is spread over 9 different predictors. Moreover, the predictors <strong>of</strong> “Block Type”also have considerable contribution. F<strong>in</strong>ally, the “W<strong>in</strong>dow type” predictors seem to havenegligible contribution to the motion estimation and they might have been skipped.Figure 3-7: Effectiveness <strong>of</strong> the EPZS predictors.3.5.3 Spatio-temporal predictorIt has been found <strong>in</strong> previous section that the median predictor is more reliable and hashigher probability to be the true predictor, especially for nonzero biased sequences. Onthe other hand, the collocated (0, 0) prediction is more suitable for sequences, whichconta<strong>in</strong> a lot <strong>of</strong> stationary data, i.e. the block is exactly the same with the one at the sameposition <strong>in</strong> the previous frame. F<strong>in</strong>ally, the prediction based on the motion vector <strong>of</strong> thecollocated macroblock is better <strong>in</strong> a number <strong>of</strong> cases. Apparently, each predictor by itselfperforms well for specific sequences and not so well for others. The proposed predictorcomb<strong>in</strong>es the aforementioned predictors <strong>in</strong> order to form a new predictor which covers awider range <strong>of</strong> video sequences.Let mv the desired predictor andcol _ mv the motion vector <strong>of</strong> the collocatedmacroblock anddist<strong>in</strong>guish the follow<strong>in</strong>g cases:med _ mv the median predictor, as is illustrated <strong>in</strong> Figure 3-8. We


mv30 Inter Prediction0,0med_mvCurrent Luma block3.5.3.1 Stationary blockCondition:Figure 3-8: Spatio-temporal predictor.Both <strong>of</strong> the coord<strong>in</strong>ates x, y <strong>of</strong> the col_mv are zero.Choice:mv ( x,y)= (0,0)(3-17)3.5.3.2 Vertical movementCondition:Both <strong>of</strong> the x coord<strong>in</strong>ates <strong>of</strong> the col_mv and the med_mv are zero.Choice:If col _ mv y> 2 then we consider the movement to be fast and we setmv x,y)= (0, max( col _ mv , med _ mv ))(3-18)(yyOtherwise we setmv x,y)= (0, m<strong>in</strong>( col _ mv , med _ mv ))(3-19)(yy3.5.3.3 Horizontal movementCondition:


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 31Both <strong>of</strong> the y coord<strong>in</strong>ates <strong>of</strong> the col_mv and the med_mv are zeroChoice:If col _ mvx > 2 then we consider the movement to be fast and we setmv(x,y) = ( max(col_mv,med_mv ), )(3-20)Otherwise we setx x0mv(x,y) = ( m<strong>in</strong> (col_mv ,med_mv ), )(3-21)x x03.5.3.4 The current block is mov<strong>in</strong>g at the same direction and at the same speed withthe collocated blockConditions:Same direction: Both <strong>of</strong> the col_mv and the med_mv lie <strong>in</strong> the same quadrant as <strong>in</strong> Figure3-8.Same speed:med_mvmed_mvxy− 2 ≤ col_mv− 2 ≤ col_mvxy≤ med_mvx≤ med_mvy+ 2+ 2(3-22)Choice:mv(x,y) = (col_mv ,col_mv )(3-23)xy3.5.3.5 The current block is mov<strong>in</strong>g at the same direction with the collocated blockbut at different speedConditions:Same direction: Both <strong>of</strong> the col_mv and the med_mv lie <strong>in</strong> the same quadrant as <strong>in</strong> Figure3-8.Different speed: The <strong>in</strong>equality (3-22) does not apply.Choise:col _ mv _ _ _x+ med mv col mvxy+ med mvymv(x,y)= ,(3-24)22


32 Inter Prediction3.5.3.6 All other casesChoice:mv(x,y) = (med_mv ,med_mv )(3-25)xy3.5.4 Simulation resultsThe simulation tests were executed <strong>in</strong> the simulation environment, which is described <strong>in</strong>Appendix II. The most important configuration parameters <strong>of</strong> the reference s<strong>of</strong>tware(JM11.0) are shown <strong>in</strong> Table 3-3. The rest <strong>of</strong> the parameters have reta<strong>in</strong>ed their defaultvalues.The reference code uses three fast motion estimation algorithms, those <strong>in</strong> [26], [27] and[28], as described <strong>in</strong> Section 3.5.1. All <strong>of</strong> these algorithms consider the Median Predictoras the <strong>in</strong>itial search po<strong>in</strong>t and then they perform a fast search around this po<strong>in</strong>t. In ourtests we substituted the Median Predictor by our Spatio-Temporal Predictor and then welet the three algorithms do the fast search around the predicted po<strong>in</strong>t. We have tested theproposed scheme on different video sequences and the results are shown <strong>in</strong> Table 3-4. Weused the encod<strong>in</strong>g time and the PSNR variations as comparative metrics, which werecalculated as <strong>in</strong> eq. (I-2) and (I-5) respectively. From the results we observe that theproposed scheme does not actually affect the PSNR. This was expected s<strong>in</strong>ce the PSNR isaffected ma<strong>in</strong>ly by the search pattern <strong>of</strong> the FME algorithm rather than its <strong>in</strong>itial searchpo<strong>in</strong>t. We also observe that the proposed scheme speeds up the Motion Estimation <strong>in</strong>most <strong>of</strong> the test cases. In the vast majority <strong>of</strong> the cases a speedup was observed, whichvaried from 0.6% to 7.3 %. This is a considerable improvement <strong>of</strong> the exist<strong>in</strong>g FMEalgorithms [26], [27] and [28], tak<strong>in</strong>g <strong>in</strong>to account that the proposed scheme leaves thema<strong>in</strong> core <strong>of</strong> the FME algorithms as is and it simply modifies the <strong>in</strong>itial search po<strong>in</strong>t.However, <strong>in</strong> some cases the proposed scheme was proved to be <strong>in</strong>effective s<strong>in</strong>ce it<strong>in</strong>creased the motion estimation time.3.5.5 ConclusionsThe proposed predictor may be used as the <strong>in</strong>itial search center by [26] and [27] and byother FME algorithms <strong>of</strong> this type. The predictor actually def<strong>in</strong>es an optimized searcharea around the predictor, where the best match dur<strong>in</strong>g the motion estimation is likely tobe close to the center <strong>of</strong> this area. Moreover, it may be used as an additional search


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 33candidate by [28]. The proposed scheme <strong>in</strong> conjunction with the study <strong>of</strong> the EPZSpredictors shows that it is possible to comb<strong>in</strong>e different spatial and temporal predictors <strong>in</strong>order to form a new better predictor.Table 3-3: Configuration parameters <strong>of</strong> the encoder.Pr<strong>of</strong>ileBasel<strong>in</strong>eNumber <strong>of</strong> Frames 100Number <strong>of</strong> reference frames 5Motion Estimation AlgorithmHEX, SHEX, EPZSRD OptimizationHigh ComplexityRate ControlEnabledBit Rate45000 bpsTable 3-4: Evaluation <strong>of</strong> the Predictor.Sequence Format FME PSNR Y % Speed up %bridge-close QCIF HEX 0.00 1.4SHEX 0.00 4.7EPZS 0.00 1.9bridge-far QCIF HEX 0.00 1.0SHEX 0.00 6.7EPZS 0.00 -0.7highway QCIF HEX 0.10 4.4SHEX 0.10 7.3EPZS 0.00 6.4salesman QCIF HEX 0.00 1.9SHEX 0.00 3.4EPZS 0.00 2.5carphone QCIF HEX 0.20 1.0SHEX 0.00 6.4EPZS 0.00 5.0news QCIF HEX 0.00 0.6SHEX 0.10 1.8EPZS 0.00 -3.0grandma QCIF HEX 0.00 3.5SHEX 0.00 3.3EPZS 0.00 1.5conta<strong>in</strong>er QCIF HEX 0.10 1.3SHEX 0.00 2.9EPZS 0.00 -1.7claire QCIF HEX 0.14 -3.0SHEX 0.00 4.5EPZS 0.10 6.1silent QCIF HEX 0.18 3.4SHEX 0.00 3.2EPZS 0.00 1.4foreman QCIF HEX 0.00 -1.0SHEX 0.00 -0.7EPZS 0.2 3.9


34 Inter Prediction3.6 FAST MULTIPLE REFERENCE FRAME SELECTION3.6.1 Literature reviewSeveral methods for reduc<strong>in</strong>g the number <strong>of</strong> reference frames have been proposed overthe past years. In [29] a method is proposed, which employs the best reference frames <strong>of</strong>neighbor<strong>in</strong>g blocks <strong>in</strong> order to determ<strong>in</strong>e the best reference frame <strong>of</strong> the current block. In[30] the frame selection is based on the sub-pixel movement across the reference frames.In [31] the number <strong>of</strong> reference frames is reduced by us<strong>in</strong>g the correlation <strong>of</strong> differentvalues between the block <strong>of</strong> the current frame and that <strong>of</strong> previous frame. In [32] amethod is proposed, which applies some well-known fast motion estimation methods oneach reference frame. The local m<strong>in</strong>imum SAD (eq. (3-4)) found <strong>in</strong> the selection path isused as the <strong>in</strong>dicator <strong>of</strong> the f<strong>in</strong>al reference frame. In the follow<strong>in</strong>g sections we propose anapproach, which is based on the same concept as [32], i.e. it performs a SAD test acrossthe reference frames <strong>in</strong> order to reveal the optimal reference frame. However, our methodoutperforms [32], s<strong>in</strong>ce it uses a significantly smaller number <strong>of</strong> test po<strong>in</strong>ts, it takes <strong>in</strong>toaccount the Lagrangian reference cost and it leads to better results with respect to videoquality as well as to reduction <strong>of</strong> the motion estimation time.3.6.2 Multiple Reference Frame <strong>in</strong> <strong>H.264</strong><strong>H.264</strong> uses multiple reference frames to achieve better motion prediction performance.The encoder performs the motion estimation upon every reference frame for everymacroblock <strong>of</strong> the current frame. Rate-distortion optimization is the criterion <strong>of</strong> selection<strong>of</strong> the best cod<strong>in</strong>g mode. The rate-distortion algorithm evaluates the cost <strong>of</strong> every possiblereference frame, consider<strong>in</strong>g the balance <strong>of</strong> the distortion and the number <strong>of</strong> bitsconsumed at the same time [34]. After that, the reference frame which results <strong>in</strong> thesmallest cost is considered as the optimum choice. The cost function <strong>of</strong> the selection <strong>of</strong>the optimal reference frame is calculated as follows:J( REFλmotion(λmotion) =R( m( REF )SAD( s,c( REF,m( REF )))−+p( REF ))+R( REF ))(3-26)where λmotionis the Lagrange operator used <strong>in</strong> motion estimation, R(REF) is the number<strong>of</strong> bits consumed for cod<strong>in</strong>g the <strong>in</strong>dex <strong>of</strong> the reference frame and it is computed by table


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 35look-up, m(REF) is the motion vector to be decided, p(REF) is the prediction <strong>of</strong> motionvector and R(m(REF)-p(REF)) is the size <strong>of</strong> the bit stream <strong>of</strong> the motion vector afterentropy cod<strong>in</strong>g. SAD (Sum <strong>of</strong> Absolute Difference) is calculated as <strong>in</strong> eq. (3-4).From the above, it is clear that we can achieve significant reduction <strong>of</strong> the encod<strong>in</strong>g timeif we reduce the number <strong>of</strong> the reference frames.3.6.3 Frame selection methodIn the previous section, we saw that the motion vector predictor and the center position(0,0) <strong>of</strong> the collocated macroblock <strong>in</strong> the reference frame are by far the most accuratepredictors <strong>of</strong> the search center.The proposed method takes advantage <strong>of</strong> the great prediction accuracy <strong>of</strong> both <strong>of</strong> themotion vector and the collocated macroblock’s center predictors <strong>in</strong> order to f<strong>in</strong>d theoptimal reference frame. This frame is found by m<strong>in</strong>imiz<strong>in</strong>g a cost function. Apparently,we cannot use eq. (3-26) as a cost function s<strong>in</strong>ce the R(m(REF)-p(REF)) value is notknown prior to the motion estimation. We therefore take <strong>in</strong>to account only the frame costvalue and the cost function (3-26) is modified as followsJ(REFλ ) = SAD( s,c( REF ,m( REF ))) + λ ( R( REF )) (3-27)motionIt has been found that apply<strong>in</strong>g the cost function (3-27) only on the 16 × 16 blocks issufficient enough to give the optimal reference frame. The proposed approach is asfollows (for every macroblock <strong>in</strong> the current frame):Step1 Get the 16× 16 luma blockActionmotion2 Get the next frame from the frame list3 Calculate the cost function (3-27) for the 16 × 16 luma block <strong>of</strong> the macroblock,which is po<strong>in</strong>ted by the motion vector predictor <strong>in</strong> the reference frame4 Calculate the cost function (3-27) for the 16 × 16 luma block <strong>of</strong> the collocatedmacroblock <strong>in</strong> the reference frame5 Compare the two results and keep the m<strong>in</strong>imum


36 Inter Prediction6 Compare this m<strong>in</strong>imum with the one from the previous frame and set the globalm<strong>in</strong>imum7 Repeat steps 2, 3, 4, 5, 6 until all <strong>of</strong> the reference frames have been exam<strong>in</strong>ed8 The rema<strong>in</strong><strong>in</strong>g global m<strong>in</strong>imum will denote the optimal reference frame for themacroblock, which is under testThe method <strong>in</strong>corporates also a simple early-stop criterion <strong>in</strong> order to speed up the SADcalculation. The encoder compares the total SAD with the previous m<strong>in</strong>imum. If the totalSAD so far exceeds the previous m<strong>in</strong>imum, the calculation is term<strong>in</strong>ated.The proposed method is quite simple and very easy to be implemented. There is nocalculation overhead s<strong>in</strong>ce the position <strong>of</strong> the collocated block is known and the motionvector predictor is calculated by the encoder accord<strong>in</strong>g to <strong>H.264</strong> standard, anyway. Inaddition to that, our method can be comb<strong>in</strong>ed with any exist<strong>in</strong>g motion estimationalgorithm, either the full search or any other fast motion estimation algorithm.3.6.4 Simulation ResultsThe simulation tests were executed <strong>in</strong> the simulation environment, which is described <strong>in</strong>Appendix II. The most important configuration parameters <strong>of</strong> the reference s<strong>of</strong>tware(JM11.0) are shown <strong>in</strong> Table 3-5. The rest <strong>of</strong> the parameters have reta<strong>in</strong>ed their defaultvalues.First we conducted some quick tests, where we let our algorithm be<strong>in</strong>g executed without<strong>in</strong>terfer<strong>in</strong>g with the reference frame selection process. Our purpose was to f<strong>in</strong>d out howmany times our algorithm succeeds <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g the same optimal reference frame with theone, which is found by the full search algorithm <strong>in</strong> normal operation mode. The resultsare shown <strong>in</strong> Table 3-6 and they <strong>in</strong>dicate that the proposed method may sufficientlyreplace the normal frame selection procedure.Several video sequences <strong>in</strong> QCIF format were tested. The results are shown <strong>in</strong> Table 3-7and Table 3-8. The bit rate variations are negligible s<strong>in</strong>ce we set a very low bit rateconstra<strong>in</strong>t at 45000 bps and therefore the comparisons can be made by exam<strong>in</strong><strong>in</strong>g thevariations <strong>of</strong> the Motion Estimation time (eq. (I-2)) and the PSNR (eq. (I-5)). From the


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 37comparisons we realize that the proposed method results <strong>in</strong> a significant reduction <strong>of</strong> theMotion Estimation time, while the degradation <strong>of</strong> the PSNR does not exceed 0.5 dB.Moreover, the proposed method has a uniform behavior for all <strong>of</strong> the test<strong>in</strong>g sequences,no matter which class they belong to.3.6.5 ConclusionsThe method performs a simple and fast test prior to motion estimation <strong>in</strong> order to choosethe best reference frame among a number <strong>of</strong> candidates, which are usually more than two(typically 5). In this way, the method reduces the number <strong>of</strong> the reference frames to one.As a consequence, the motion estimation process is performed aga<strong>in</strong>st only one frame.Experimental results showed that this frame is, <strong>in</strong> most <strong>of</strong> the cases, the frame that the<strong>H.264</strong> encoder would choose anyway if it had performed the motion estimation for everyreference frame. Thus, the method decreases the Motion Estimation time withoutconsiderably affect<strong>in</strong>g the video quality.Table 3-5: Configuration parameters <strong>of</strong> the encoder.Pr<strong>of</strong>ileBasel<strong>in</strong>eNumber <strong>of</strong> Frames 100Number <strong>of</strong> reference frames 5Motion Estimation AlgorithmFull SearchRD OptimizationHigh ComplexityRate ControlEnabledBit Rate45000 bpsTable 3-6: Successful matches between (3-26) and (3-27).Video Sequence (QCIF, 50 frames) Success (%)Foreman 73.6Salesman 95.4News 94.8Carphone 71.5Grandma 94.0


38 Inter PredictionTable 3-7: Motion estimation time variation.Video Sequence Encod<strong>in</strong>g Time (ms) Variation (%)ReferenceProposedbridge-close 68664 14182 -79.345bridge-far 68980 13974 -79.741highway 71306 14665 -79.433salesman 69236 13814 -80.047carphone 72852 14942 -79.489news 69410 14170 -79.585grandma 69785 14341 -79.449conta<strong>in</strong>er 69505 13785 -80.166claire 69850 14410 -79.370silent 69520 13735 -80.243foreman 73965 14218 -80.777akiyo 70853 13665 -80.71mobile 79776 15155 -80.81Average Variation <strong>of</strong> Motion Estimation Time: -79.935 %Table 3-8: PSNR variation.Video Sequence PSNR (dB) Variation (dB)ReferenceProposedbridge-close 32.733 32.774 0.041bridge-far 39.216 39.239 0.023highway 36.092 35.885 -0,207salesman 33.057 32.988 -0.069carphone 33.459 33.100 -0.359news 33.590 33.530 -0.060grandma 36.799 36.683 -0.116conta<strong>in</strong>er 36.410 35.981 -0.429claire 41.126 40.893 -0.233silent 32.427 32.389 -0.038foreman 31.827 31.331 -0.496akiyo 39.096 38.901 -0.195mobile 23.742 23.312 -0.430Average Variation <strong>of</strong> PSNR: -0.197 dB


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 393.7 MOVING OBJECT DETECTION IN THE COMPRESSED DOMAIN3.7.1 Literature reviewTraditionally, mov<strong>in</strong>g object detection algorithms <strong>in</strong> compressed doma<strong>in</strong> usually rely ontwo types <strong>of</strong> features <strong>in</strong> terms <strong>of</strong> macroblock (MB): motion vector (MV) and DCTcoefficients. For <strong>in</strong>stance, Ahmad et al. [35], have analyzed the performance <strong>of</strong> MVsmooth<strong>in</strong>g with different spatial filters. Sukmarg and Rao [36] propose regionsegmentation and cluster<strong>in</strong>g based algorithm to detect objects <strong>in</strong> MPEG compressedvideo. Wang et al. [37] suggest several confidence measures to improve motion layerseparation. Babu and Ramakrishnan [38] use only aggregated motion vectors. Wei Zenget al. [39] employ Markov Random Field (MRF) to segment mov<strong>in</strong>g objects from thesparse motion vector field obta<strong>in</strong>ed directly from the bit stream. Ibrahim and Rao [40]propose the use <strong>of</strong> a spatio-temporal filter for filter<strong>in</strong>g the motion vectors and a hybridapproach to exploit both compressed doma<strong>in</strong> process<strong>in</strong>g and spatial doma<strong>in</strong> process<strong>in</strong>g.Babu et al. [41] have proposed automatic video object segmentation algorithm for theMPEG video. They first estimate the number <strong>of</strong> <strong>in</strong>dependently mov<strong>in</strong>g objects <strong>in</strong> thescene us<strong>in</strong>g a block-based aff<strong>in</strong>e cluster<strong>in</strong>g method. The object segmentation is thenobta<strong>in</strong>ed by the expectation maximization (EM) cluster<strong>in</strong>g algorithm.Most <strong>of</strong> the above methods are designed to work <strong>in</strong> the MPEG compressed doma<strong>in</strong>.However, <strong>H.264</strong> employs several new cod<strong>in</strong>g tools. In <strong>H.264</strong>, the <strong>in</strong>tra-coded block isspatial <strong>in</strong>tra-predicted accord<strong>in</strong>g to its neighbor pixels. So, the transform coefficientsprovide the spatial prediction residues <strong>in</strong>formation for blocks now. Moreover, <strong>H.264</strong>supports variable block-size motion compensation. A macroblock may be partitioned <strong>in</strong>toseveral blocks and has several motion vectors. As a result, the motion vector field for<strong>H.264</strong> compressed video consists <strong>of</strong> motion vectors with variable block size. This is quitedifferent from the former MPEG standard video with regular block size motion vectors.The proposed method is specially designed to work <strong>in</strong> the <strong>H.264</strong> compressed doma<strong>in</strong>. Itma<strong>in</strong>ly takes advantage <strong>of</strong> the variable block sizes <strong>in</strong> order to detect a mov<strong>in</strong>g object.Thus, it is very fast and easy <strong>in</strong> its implementation.


40 Inter Prediction3.7.2 Mov<strong>in</strong>g object detection <strong>in</strong> the compressed doma<strong>in</strong>The proposed method can detect a mov<strong>in</strong>g object by simply exam<strong>in</strong><strong>in</strong>g the motion<strong>in</strong>formation <strong>of</strong> a macroblock directly <strong>in</strong> the compressed doma<strong>in</strong> as shown <strong>in</strong> the blockdiagram <strong>of</strong> Figure 3-9. The block size (<strong>in</strong>ter mode) <strong>of</strong> a block and its correspond<strong>in</strong>gmotion vector (MV) can be obta<strong>in</strong>ed by entropy decod<strong>in</strong>g the <strong>H.264</strong> bitstream. Theproposed method consists <strong>of</strong> three phases:• Classification,• Merg<strong>in</strong>g and• Ref<strong>in</strong>ement.<strong>H.264</strong> bitsream...00010101010101000110101...Get next frame from thebitstreamTake actionsEntropy Decode the HeaderInformation <strong>of</strong> the nextmacroblock <strong>in</strong> the frameYesNoCalculate the MV ThresholdHas a mov<strong>in</strong>gobject beendetected?NoNoClassify the pixels as mov<strong>in</strong>g andstatic accord<strong>in</strong>g to the previousthresholdMerge the pixels <strong>of</strong> the frames<strong>in</strong>to one frameYesThis is the lastmacroblockYesHave weexam<strong>in</strong>edenoughframes?Figure 3-9: Block diagram <strong>of</strong> the proposed method.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 413.7.2.1 ClassificationThe proposed method is suitable for CCTV based video surveillance. Hence, we assumethat the video source is a static camera. Therefore, each pixel <strong>of</strong> a frame will either belongto the static background or to a mov<strong>in</strong>g object. In this phase we classify the pixels <strong>of</strong> ablock <strong>in</strong> static and mov<strong>in</strong>g pixels by compar<strong>in</strong>g the block’s motion vector MV with athresholdMVTH: If the MV <strong>of</strong> the block is less than the MVTH, then its pixels areclassified as static. Otherwise, the pixels are classified as mov<strong>in</strong>g pixels. The reason whywe prefer to apply the classification on the pixels <strong>of</strong> a block and not on the block itself isbecause <strong>in</strong> this way we can merge the pixels <strong>of</strong> different frames <strong>in</strong> order to get the wholemov<strong>in</strong>g object. This procedure is further analyzed <strong>in</strong> Section 3.6.7.2. Furthermore, weexam<strong>in</strong>e the pixels <strong>of</strong> the macroblocks, which have been encoded <strong>in</strong> modes other than16 × 16 , e.g. 16 × 8 , 8× 16 , 8× 8, etc., because the presence <strong>of</strong> such sub-blocks usuallydenotes a motion. In that way, we also avoid to take <strong>in</strong>to account false motion vectors dueto the <strong>in</strong>ter prediction.The thresholdMVTHcannot be zero because even the static blocks may have non-zeromotion vectors. This is due to the nature <strong>of</strong> the <strong>in</strong>ter prediction where every block ismotion compensated with regard to previous and/or next reconstructed reference frames,which have suffered quantization. Moreover, the threshold cannot be set to a fixed predef<strong>in</strong>edvalue because the afore-mentioned behavior is highly related to the quantizationparameter (QP), which is used by the encoder: the higher the QP the more zero motionvectors. The proposed method uses, therefore, a dynamic threshold per frame, which isactually the mean value <strong>of</strong> the motion vectors <strong>of</strong> all <strong>of</strong> the blocks <strong>of</strong> the frame. This iscalculated as <strong>in</strong> eq. (3-28).∑ − 11 NTH= MV bN b=0MV (3-28)whereMVTHis the threshold, MVbis the motion vector <strong>of</strong> thetotal blocks/sub-blocks <strong>of</strong> the (current) frame.3.7.2.2 Merg<strong>in</strong>gthb block and N is theIn this phase we accumulate successive <strong>in</strong>ter frames (P/B) and we merge their mov<strong>in</strong>gpixels <strong>in</strong> order to get the complete contour <strong>of</strong> the mov<strong>in</strong>g object. This process enables us


42 Inter Predictionto capture objects, which have slow motion. In that case some pixels <strong>of</strong> the object arestatic <strong>in</strong> one frame whereas they are mov<strong>in</strong>g <strong>in</strong> the next frames. Another case, which ishandled by the merg<strong>in</strong>g phase, is the case where the mov<strong>in</strong>g object is occluded by a staticobject and its mov<strong>in</strong>g parts are uncovered <strong>in</strong> successive frames.3.7.2.3 Ref<strong>in</strong>ementIn the ref<strong>in</strong>ement phase we can obta<strong>in</strong> the details <strong>of</strong> the mov<strong>in</strong>g object by fully decod<strong>in</strong>gonly the blocks, which conta<strong>in</strong> mov<strong>in</strong>g pixels. The typical usage <strong>of</strong> this phase is to allowthe supervisor <strong>of</strong> a surveillance system to take a quick look at the object, which <strong>in</strong>truded<strong>in</strong>to the surveyed area.3.7.3 Simulation resultsThe simulation tests were executed <strong>in</strong> the simulation environment, which is described <strong>in</strong>Appendix II. The most important configuration parameters <strong>of</strong> the reference s<strong>of</strong>tware(JM14.1) are shown <strong>in</strong> Table 3-9. The rest <strong>of</strong> the parameters have reta<strong>in</strong>ed their defaultvalues. The tests were performed us<strong>in</strong>g various video sequences <strong>in</strong> different formats. Herewe present the simulation results <strong>of</strong> us<strong>in</strong>g the tennis video sequence <strong>in</strong> SIF format. Theresults are shown <strong>in</strong> Figure 3-10 and they demonstrate the three phases <strong>of</strong> Section 3.6.7,the classification, the merg<strong>in</strong>g and the ref<strong>in</strong>ement. Figure 3-10.a shows the 6 th frame <strong>of</strong>the tennis sequence. Figure 3-10.b shows the classification <strong>of</strong> pixels. The black colordenotes mov<strong>in</strong>g pixels while the white color denotes static pixels. Figure 3-10.c showsthe effect <strong>of</strong> merg<strong>in</strong>g the pixels <strong>of</strong> the different frames. More specifically, the classifiedmov<strong>in</strong>g pixels <strong>of</strong> the 4 th , 5 th and 6 th frames were merged <strong>in</strong> one frame. In that way thecontour <strong>of</strong> the mov<strong>in</strong>g object was revealed. Figure 3-10.d shows the f<strong>in</strong>al stage <strong>of</strong> thealgorithm, which is the ref<strong>in</strong>ement. In that stage the blocks, which compose the mov<strong>in</strong>gobject, are fully decoded <strong>in</strong> order to present the real mov<strong>in</strong>g object <strong>in</strong> detail.3.7.4 Further improvementsThe proposed method presents good results but under conditions. As a matter <strong>of</strong> fact, ithas two major disadvantages, which should be handled <strong>in</strong> the future. First <strong>of</strong> all, theaccuracy <strong>of</strong> the method, especially the detection <strong>of</strong> an object’s contour, heavily dependson the number <strong>of</strong> the sub-blocks dur<strong>in</strong>g the motion estimation. That means that lack <strong>of</strong>sufficient number <strong>of</strong> sub-blocks due to either high QP or to slow motion may lead to


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 43rather crude object detection. Moreover, the method cannot handle complex motions,such as the overlapp<strong>in</strong>g motions <strong>of</strong> two or more mov<strong>in</strong>g objects.3.7.5 ConclusionsThe method is specially designed to work <strong>in</strong> the <strong>H.264</strong> compressed doma<strong>in</strong> as it exploitsthe variable block sizes used by the <strong>H.264</strong> encoder dur<strong>in</strong>g the motion estimation as longas the generated motion vectors. The proposed method works <strong>in</strong> the compressed doma<strong>in</strong>because it requires only the entropy decod<strong>in</strong>g <strong>of</strong> the <strong>H.264</strong> bitstream <strong>in</strong> order to obta<strong>in</strong> the<strong>in</strong>ter prediction modes and the associated motion vectors. Moreover, once the <strong>in</strong>ter<strong>in</strong>formation has been obta<strong>in</strong>ed, the method is able to detect a mov<strong>in</strong>g object by comb<strong>in</strong><strong>in</strong>gthe modes with the motion vectors and apply<strong>in</strong>g some thresholds. This makes the methodsimple as long as fast. It can be therefore used <strong>in</strong> real time applications such as videosurveillance. Future work must be done to either elim<strong>in</strong>ate or limit the drawbacks, whichare described <strong>in</strong> the previous section.Table 3-9: Configuration parameters <strong>of</strong> the encoder.Pr<strong>of</strong>ileBasel<strong>in</strong>eQuantization Parameter 28Frame Rate30 fpsRD OptimizationHigh Complexity ModeMotion EstimationFull SearchIntra Period0: only the first frame is <strong>in</strong>traSymbol ModeUVLCRate ControlDisabled


44 Inter Predictiona. Orig<strong>in</strong>al Frameb. Classificationc. Merg<strong>in</strong>gd. Ref<strong>in</strong>ementFigure 3-10: The three phases <strong>of</strong> the proposed algorithm applied <strong>in</strong> the 6 th frame <strong>of</strong> thetennis video sequence. The frame has been merged with the 4 th and the 5 th frames dur<strong>in</strong>gthe merg<strong>in</strong>g phase.


4 Data Hid<strong>in</strong>g4.1 INTRODUCTIONData hid<strong>in</strong>g can be considered as a communication problem where the embedded data isthe signal to be transmitted. A typical data hid<strong>in</strong>g framework starts with an orig<strong>in</strong>aldigital media, known as the host media or cover media. The data hid<strong>in</strong>g module <strong>in</strong>serts <strong>in</strong>it a set <strong>of</strong> secondary data, known as embedded data or watermark, to obta<strong>in</strong> the markedmedia. The <strong>in</strong>sertion or embedd<strong>in</strong>g is done <strong>in</strong> such a way that the marked media isperceptually identical to the orig<strong>in</strong>al media. In most cases, the embedded data is acollection <strong>of</strong> bits, which may come from an encoded character str<strong>in</strong>g, from a pattern, orfrom some executable agents, depend<strong>in</strong>g on the application. The embedded data will beextracted from the marked media by an extractor, which performs the <strong>in</strong>verse embedd<strong>in</strong>gprocess.Data hid<strong>in</strong>g techniques are categorized as non-bl<strong>in</strong>d or bl<strong>in</strong>d techniques. In non-bl<strong>in</strong>d datahid<strong>in</strong>g systems, it is assumed that the orig<strong>in</strong>al host or cover is available at the decoder.For bl<strong>in</strong>d <strong>in</strong>formation hid<strong>in</strong>g systems, the decoder does not have an access to the orig<strong>in</strong>alcover signal. Data hid<strong>in</strong>g techniques are also categorized as robust, semi-fragile andfragile based on their robustness aga<strong>in</strong>st attacks. Robust are those techniques, which theembedded data can be extracted even if the marked media has been suffered by variousattacks. In fragile Data Hid<strong>in</strong>g techniques the embedded data cannot be recovered whencompression or other small alteration is applied to the marked media. In semi-fragiletechniques the embedded data can be extracted out, if the marked media has gone throughcompression or other alterations to some extent.


46 Data Hid<strong>in</strong>g4.2 PROBLEM FORMULATIONTraditionally the data hid<strong>in</strong>g <strong>in</strong> video uses legacy techniques, which are applied to staticimages. These techniques can also be applied to videos if each video frame is treated as astatic image. Most <strong>of</strong> these techniques modify the coefficients, which are generated bysome transformation <strong>in</strong> the frequency doma<strong>in</strong>, <strong>in</strong> order to hide the desired data. Thisresults <strong>in</strong> degradation <strong>of</strong> the video quality when the video stream is decoded. Othertechniques modify the motion vectors, which are generated by the motion estimationprocess, <strong>in</strong> order to hide data. These techniques cause drift errors, which also degrades thevideo quality. The goal <strong>of</strong> all <strong>of</strong> these techniques is the hidden data not to causeperceptible errors to the viewer. In the context <strong>of</strong> our research we followed a completelydifferent approach. We moved the cost <strong>of</strong> the hidden data from the visual quality to thebit-rate, i.e. the hidden data do not degrade the visual quality but <strong>in</strong>crease, althoughslightly, the bit rate.4.3 SOLUTIONSIn the follow<strong>in</strong>g sections we shall present two novel data hid<strong>in</strong>g methods. The data hid<strong>in</strong>gdur<strong>in</strong>g the <strong>in</strong>ter prediction method (Section 4.4) demonstrates how the <strong>in</strong>ter predictionprocess can be exploited for hid<strong>in</strong>g data. The ma<strong>in</strong> advantage <strong>of</strong> the method is that it doesnot affect the visual quality <strong>of</strong> the video. Similarly the second data hid<strong>in</strong>g method(Section 4.5) does not affect the visual quality <strong>of</strong> the video whilst it is capable to hide alarge amount <strong>of</strong> data. Moreover, it has the unique capability to re-use the video directly <strong>in</strong>the compressed doma<strong>in</strong> numerous times. F<strong>in</strong>ally, a scene change detection method(Section 4.4.7) is also presented. This method makes use <strong>of</strong> the data hid<strong>in</strong>g dur<strong>in</strong>g the<strong>in</strong>ter prediction. All <strong>of</strong> the aforementioned methods fall <strong>in</strong>to the category <strong>of</strong> the appliedmethods as described <strong>in</strong> the Introduction.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 474.4 DATA HIDING DURING THE INTER-PREDICTION4.4.1 Literature reviewEarly video data hid<strong>in</strong>g approaches were propos<strong>in</strong>g still image watermark<strong>in</strong>g techniquesextended to video by hid<strong>in</strong>g the message <strong>in</strong> each frame <strong>in</strong>dependently [42]. <strong>Methods</strong> suchas spread spectrum are used, where the basic idea is to distribute the message over a widerange <strong>of</strong> frequencies <strong>of</strong> the host data. Transform doma<strong>in</strong> is generally preferred for hid<strong>in</strong>gdata s<strong>in</strong>ce, for the same robustness as for the spatial doma<strong>in</strong>, the result is more pleasant tothe Human Visual System (HVS). For this purpose the DFT (Discrete FourierTransform), the DCT (Discrete Cos<strong>in</strong>e Transform) and the DWT (Discrete WaveletTransform) doma<strong>in</strong>s are usually employed [43-45].Recent video data hid<strong>in</strong>g techniques are focused on the characteristics generated by videocompress<strong>in</strong>g standards. Motion vector based schemes have been proposed for MPEGalgorithms [46-48]. Motion vectors are calculated by the video encoder <strong>in</strong> order toremove the temporal redundancies between frames. In these methods the orig<strong>in</strong>al motionvector is replaced by another locally optimal motion vector to embed data. Only few datahid<strong>in</strong>g algorithms consider<strong>in</strong>g the properties <strong>of</strong> <strong>H.264</strong> standard [49-51] have recentlyappeared <strong>in</strong> the open literature. In [49] a subset <strong>of</strong> the 4×4 DCT coefficients are modified<strong>in</strong> order to achieve a robust watermark<strong>in</strong>g algorithm for <strong>H.264</strong>. In [50] the bl<strong>in</strong>dalgorithm for copyright protection is based on the <strong>in</strong>tra prediction mode <strong>of</strong> <strong>H.264</strong>. In [51]some skipped macroblocks are used to embed data.In the follow<strong>in</strong>g section we propose a new data hid<strong>in</strong>g scheme, which takes advantage <strong>of</strong>the different block sizes (16×16, 16×8, 8×16, 8×8, etc.) used by the <strong>H.264</strong> encoder dur<strong>in</strong>gthe <strong>in</strong>ter prediction, <strong>in</strong> order to hide the desirable data. The message can be extracteddirectly from the encoded stream without know<strong>in</strong>g <strong>of</strong> the orig<strong>in</strong>al host video. This methodis best suited for content-based authentication and covert communication applications.4.4.2 Data hid<strong>in</strong>g methodThe ma<strong>in</strong> blocks <strong>of</strong> the <strong>H.264</strong> video encoder are depicted <strong>in</strong> Figure 4-1. The TemporalPrediction block is responsible for the <strong>in</strong>ter prediction <strong>of</strong> each <strong>in</strong>ter frame. Our scheme<strong>in</strong>tervenes <strong>in</strong> the <strong>in</strong>ter prediction process <strong>in</strong> order to hide the data.


48 Data Hid<strong>in</strong>gThe most important part <strong>of</strong> <strong>in</strong>ter prediction is the motion estimation process, which aimsat f<strong>in</strong>d<strong>in</strong>g the “closest” macroblock (best match) <strong>in</strong> the previously coded frame for everymacroblock <strong>of</strong> the current <strong>in</strong>put frame. Then each macroblock, with<strong>in</strong> the current frame,is motion compensated, i.e. its best match is subtracted from it, and the residualmacroblock is coded. In order to <strong>in</strong>crease the cod<strong>in</strong>g efficiency, the <strong>H.264</strong> standard, asalready described <strong>in</strong> previous sections, has adopted seven different block types (16×16,16×8, 8×16, 8×8, 8×4, 4×8 and 4×4) and the motion estimation is applied on each <strong>of</strong> thesetypes. The block type, which results <strong>in</strong> the best cod<strong>in</strong>g, is selected <strong>in</strong> the end. The basicidea <strong>of</strong> the proposed scheme is to force the encoder to choose a block type not <strong>in</strong> terms <strong>of</strong>cod<strong>in</strong>g efficiency, but accord<strong>in</strong>g to our data hid<strong>in</strong>g requirements. This can be done <strong>in</strong> aseamless way so as the whole encod<strong>in</strong>g process not to be disturbed. The procedure isdescribed below <strong>in</strong> detail.First we assign a b<strong>in</strong>ary code to every block type accord<strong>in</strong>g to Table 4-1. For simplicitywe use only 4 block types. That gives us 2 bits per block. Then we convert the embedd<strong>in</strong>gmessage <strong>in</strong>to a b<strong>in</strong>ary number and we separate the bits <strong>in</strong> pairs. These pairs are mapped<strong>in</strong>to macroblocks, which are go<strong>in</strong>g to be motion compensated, us<strong>in</strong>g the chosen blocktypes as is illustrated <strong>in</strong> Figure 4-2.Figure 4-1: Data hid<strong>in</strong>g module with<strong>in</strong> the <strong>H.264</strong> encoder.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 49...6H...ASCIIConvert <strong>in</strong>to b<strong>in</strong>ary...0011011001001000...B<strong>in</strong>arySeparate bits <strong>in</strong> pairs00 11 01 10 01 00 10 00Map pairs <strong>in</strong>to block types.16x16 8x8 16x8 8x16 16x8 16x16 8x16 16x16Figure 4-2: Message mapp<strong>in</strong>g <strong>in</strong>to block types.Table 4-1: B<strong>in</strong>ary codes <strong>of</strong> the block types.Block type16×16 0016×8 018×16 108×8 11B<strong>in</strong>ary codeIt is also important to def<strong>in</strong>e the data hid<strong>in</strong>g parameters such as:1. Start<strong>in</strong>g frame: It <strong>in</strong>dicates the frame from which the algorithm starts messageembedd<strong>in</strong>g.2. Start<strong>in</strong>g macroblock: It <strong>in</strong>dicates the macroblock with<strong>in</strong> the chosen frame fromwhich the algorithm starts message embedd<strong>in</strong>g.3. Number <strong>of</strong> macroblocks: It <strong>in</strong>dicates how many macroblocks with<strong>in</strong> a frame arego<strong>in</strong>g to be used for data hid<strong>in</strong>g. These macroblocks may be consecutive, or evenbetter, they may be widespread with<strong>in</strong> the frame accord<strong>in</strong>g to a predef<strong>in</strong>ed pattern.Apparently, the more the macroblocks we use, the higher the embedd<strong>in</strong>g capacity weget. Moreover, if the size <strong>of</strong> the message is fixed, this number will be fixed, too.Otherwise it can be dynamically changed.4. Frame period: It <strong>in</strong>dicates the number <strong>of</strong> the <strong>in</strong>ter frames, which must pass, beforethe algorithm repeats the embedd<strong>in</strong>g. This parameter is very important s<strong>in</strong>ce it<strong>in</strong>creases the possibilities <strong>of</strong> extract<strong>in</strong>g the message even if some parts <strong>of</strong> the video


50 Data Hid<strong>in</strong>gsequence are miss<strong>in</strong>g. However, if the frame period is too small and the algorithmrepeats the message very <strong>of</strong>ten, that might have an impact onto the cod<strong>in</strong>g efficiency<strong>of</strong> the encoder. Apparently, if the video sequence is large enough, the frame periodcan be accord<strong>in</strong>gly large.The encoder reads these parameters from a file. The same file is read by the s<strong>of</strong>tware thatextracts the message, so as both <strong>of</strong> the two codes to be synchronized.Figure 4-3 shows the block diagram <strong>of</strong> the proposed embedd<strong>in</strong>g algorithm. As an <strong>in</strong>terframe enters the Temporal Prediction module, the algorithm decides whether to use it forhid<strong>in</strong>g a message or not, accord<strong>in</strong>g to the hid<strong>in</strong>g parameters. If the algorithm decides touse the frame for hid<strong>in</strong>g data, it chooses the macroblock candidates and performs themotion estimation on them, forc<strong>in</strong>g the encoder to choose a specific block type accord<strong>in</strong>gto the message mapp<strong>in</strong>g (Figure 4-2). Then it lets the encoder to proceed with theencod<strong>in</strong>g as <strong>in</strong> normal operation. In other words the algorithm fakes the motionestimation process, which the encoder would normally perform.Hid<strong>in</strong>g parametersFigure 4-3: Block diagram <strong>of</strong> the proposed scheme.The proposed scheme may result <strong>in</strong> very high capacity proportional to the host videosequence size. Its major advantage is that it does not affect the visual quality <strong>of</strong> the videosequence and if the hid<strong>in</strong>g parameters are properly controlled it does not affect the cod<strong>in</strong>gefficiency, either. In addition to that, it is extremely difficult for the decoder to detect thedata hid<strong>in</strong>g <strong>in</strong>terference and this <strong>in</strong>creases the <strong>in</strong>visibility <strong>of</strong> the hidden message. F<strong>in</strong>ally,


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 51the message can be extracted directly from the encoded video stream without the need <strong>of</strong>the orig<strong>in</strong>al host video sequence.4.4.3 Simulation resultsThe simulation tests were executed <strong>in</strong> the simulation environment, which is described <strong>in</strong>Appendix II. The most important configuration parameters <strong>of</strong> the reference s<strong>of</strong>tware(JM11.0) are shown <strong>in</strong> Table 4-2. The rest <strong>of</strong> the parameters have reta<strong>in</strong>ed their defaultvalues. Note that the <strong>in</strong>ter prediction optimiz<strong>in</strong>g parameters are disabled for simplify<strong>in</strong>gthe algorithm implementation.Several video sequences <strong>in</strong> QCIF format were tested. Figure 4-4, shows the PSNR (eq. (I-3)) results <strong>of</strong> each luma <strong>in</strong>ter frame for the foreman sequence. We refer to the <strong>in</strong>ter framess<strong>in</strong>ce the message is <strong>in</strong>serted <strong>in</strong>to these frames only. By default, the <strong>H.264</strong> encoderregards only the first frame as an <strong>in</strong>tra and the rest as <strong>in</strong>ter frames. The first <strong>in</strong>tra framehas been excluded from Figure 4-4.From the results we observe that the proposed scheme does not actually affect the PSNR<strong>of</strong> the <strong>in</strong>ter frames. This was expected s<strong>in</strong>ce there is no bit rate constra<strong>in</strong>t and thus ourscheme does not provoke any loss <strong>of</strong> <strong>in</strong>formation. We would rather expect to seedifferences <strong>in</strong> the total bit-rate <strong>of</strong> the <strong>in</strong>ter frames, due to the fact that the scheme<strong>in</strong>terferes with the optimiz<strong>in</strong>g part <strong>of</strong> the <strong>in</strong>ter prediction. Figure 4-5 shows the bit ratevariations (eq. (I-1)) <strong>of</strong> the <strong>in</strong>ter frames between the orig<strong>in</strong>al sequences and the markedones. The bit-rate is generally <strong>in</strong>creased proportionally to the capacity size.Based on Figure 4-4 and Figure 4-5 we can assume that if we put a bit rate constra<strong>in</strong>t onthe encoder we should expect a PSNR decrease. Figure 4-6 shows the PSNR variations(eq. (I-5)) <strong>of</strong> the <strong>in</strong>ter frames between the orig<strong>in</strong>al sequences and the marked ones whenwe enforce a 40 kbps bit rate constra<strong>in</strong>t on the encoder. A maximum <strong>of</strong> 1.4 dB differenceis experienced.The small bit-rate reduction and the PSNR <strong>in</strong>crease that we see <strong>in</strong> some cases <strong>in</strong> Figure4-5 and Figure 4-6 respectively, are partly due to the stochastic choice <strong>of</strong> the message andma<strong>in</strong>ly to the fact that the optimiz<strong>in</strong>g parameters <strong>of</strong> the encoder were disabled, <strong>in</strong> thesense that the encoder was not able to perform the best possible <strong>in</strong>ter prediction dur<strong>in</strong>g itsnormal operation.


52 Data Hid<strong>in</strong>gThe proposed scheme should ideally affect both the PSNR and the bit rate as less aspossible. A few approaches, which may result <strong>in</strong> a great improvement <strong>of</strong> the proposedscheme, are discussed <strong>in</strong> Section 4.4.5.4.4.4 Message ExtractorThe message extractor is s<strong>of</strong>tware, not necessarily an <strong>H.264</strong> decoder, which extracts thehidden message from the marked <strong>H.264</strong> bitstream. The message extractor needs topartially decode the bitstream <strong>in</strong> order to discover the chosen block type <strong>of</strong> eachmacroblock <strong>of</strong> each <strong>in</strong>ter frame. Then, it can form the hidden message accord<strong>in</strong>g to Table4-1. Apparently, the message extractor must be aware <strong>of</strong> the hid<strong>in</strong>g parameters, whichwere used by the encoder.4.4.5 Further improvementsIn our current scheme we used only 4 different block types, namely 16×16, 16×8, 8×16,8×8. However, the scheme can also use the sub partitions <strong>of</strong> the 8×8 type (8×4, 4×8, 4×4),thus <strong>in</strong>creas<strong>in</strong>g the available bits for cod<strong>in</strong>g to 8. Apparently, the additional bits will<strong>in</strong>crease the data capacity decreas<strong>in</strong>g the number <strong>of</strong> the “tweaked” macroblocks at thesame time. Moreover, the scheme used consecutive macroblocks with<strong>in</strong> a s<strong>in</strong>gle frame <strong>in</strong>order to hide the data. Another improvement would have been if the macroblocks werewidespread with<strong>in</strong> the frame or even better if the macroblocks were widespread with<strong>in</strong>multiple frames. This approach would improve the cod<strong>in</strong>g efficiency, s<strong>in</strong>ce the “motionerror”, which is produced by the scheme, will not be accumulated <strong>in</strong> one place. Inaddition to that, the assignment <strong>of</strong> the b<strong>in</strong>ary codes <strong>in</strong> Table 4-1 could be modified so asto take <strong>in</strong>to account some video statistics. For example the 16× 16 block type appearsmore <strong>of</strong>ten than the other types. The message can therefore be coded us<strong>in</strong>g a Huffmancod<strong>in</strong>g and the Huffman code with the highest probability could be assigned to the16 × 16 block type. The ga<strong>in</strong> <strong>of</strong> this approach will be that our scheme will most likelychoose the block type, which would have been chosen by the encoder <strong>in</strong> normal operationwithout our <strong>in</strong>terference.4.4.6 ConclusionsThe proposed method embeds the data dur<strong>in</strong>g the encod<strong>in</strong>g process and utilizes theadvanced <strong>in</strong>ter prediction features <strong>of</strong> the <strong>H.264</strong> encoder. Its ma<strong>in</strong> advantage is that it is abl<strong>in</strong>d scheme and its impact on video quality or cod<strong>in</strong>g efficiency is almost negligible. It


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 53is highly configurable, thus it may result <strong>in</strong> high data capacities. F<strong>in</strong>ally, it can be easilyextended, result<strong>in</strong>g <strong>in</strong> better robustness, better data security and higher embedd<strong>in</strong>gcapacity.Table 4-2: Configuration parameters <strong>of</strong> the encoder.Pr<strong>of</strong>ileBasel<strong>in</strong>eFrames 100Frame Rate30 fpsNumber <strong>of</strong> reference frames 10Motion Estimation AlgorithmFull SearchRD OptimizationDisabled8x8 Sub-blocksDisabledRate ControlDisabledFigure 4-4: Foreman–PSNR <strong>of</strong> the luma <strong>in</strong>ter frames.Figure 4-5: Bit rate variations <strong>of</strong> the luma <strong>in</strong>ter frames.


54 Data Hid<strong>in</strong>gFigure 4-6: PSNR variations <strong>of</strong> the luma <strong>in</strong>ter frames.4.4.7 Application based on this method: A Data Hid<strong>in</strong>g Scheme for Scene ChangeDetection4.4.7.1 IntroductionVideo data can be divided <strong>in</strong>to different shots. A shot is a video sequence that consists <strong>of</strong>cont<strong>in</strong>uous video frames for one action. Scene change detection is an operation thatdivides video data <strong>in</strong>to physical shots. Scene change detection is an important means forvideo edit<strong>in</strong>g, video <strong>in</strong>dex<strong>in</strong>g, error resilience, etc. and it has been recognized as one <strong>of</strong>the significant research areas <strong>in</strong> recent years.Many scene change detection methods can be found <strong>in</strong> the literature, but some <strong>of</strong> themare either computationally expensive or <strong>in</strong>effective. In the uncompressed doma<strong>in</strong>, themajor techniques are based on pixels, histogram comparisons and/or edge differencecalculations [52-54]. Recent trends focus on develop<strong>in</strong>g scene change detectionalgorithms directly <strong>in</strong> the compressed doma<strong>in</strong>, especially for MPEG compressed videos[55-58]. In general, the methods, which work <strong>in</strong> the uncompressed doma<strong>in</strong> are moreefficient but not as useful as the methods that work <strong>in</strong> the compressed doma<strong>in</strong>.The proposed data hid<strong>in</strong>g scheme can be comb<strong>in</strong>ed with any exist<strong>in</strong>g uncompresseddoma<strong>in</strong> scene change detection method <strong>in</strong> order to enable real time scene changedetection <strong>in</strong> the compressed doma<strong>in</strong>. The idea is to detect the scene change dur<strong>in</strong>g the


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 55<strong>H.264</strong> encod<strong>in</strong>g process and hide this <strong>in</strong>formation as metadata <strong>in</strong> the encoded sequence. Itis then easy for a metadata aware application to detect the scene change and possiblyextract other useful <strong>in</strong>formation about the scene, <strong>in</strong> the compressed doma<strong>in</strong>. The schemeis based on the technique described <strong>in</strong> Section 4.4.2 <strong>in</strong> order to hide the metadata. Thedata can then be extracted directly from the encoded stream without know<strong>in</strong>g the orig<strong>in</strong>alhost video.The idea <strong>of</strong> us<strong>in</strong>g metadata hid<strong>in</strong>g <strong>in</strong> video, audio and images <strong>in</strong> order to create datachannels is not new. Metadata hid<strong>in</strong>g has also been used for error correction [59] and forcontent adaptation [60].4.4.7.2 The Proposed SchemeThe proposed scheme is comb<strong>in</strong>ed with one or more well-known scene change detectiontechniques. As soon as a scene change has been detected, the scheme <strong>in</strong>serts an additional<strong>in</strong>ter frame <strong>in</strong> the video sequence. This extra frame is <strong>in</strong>ter-encoded <strong>in</strong> such a way that:1. It marks the end <strong>of</strong> the scene,2. It hides useful <strong>in</strong>formation about the detected scene, such as the number <strong>of</strong> thescene frames and the key frames,3. It does not considerably affect the bit rate and the PSNR <strong>of</strong> the encoded sequence.In general the proposed algorithm consists <strong>of</strong> two phases, namely the scene changedetection and the metadata hid<strong>in</strong>g. These two phases are presented below.4.4.7.3 Scene Change DetectionAs scene change detection is out <strong>of</strong> the scope <strong>of</strong> this dissertation, we only present somebasic pr<strong>in</strong>ciples <strong>of</strong> it. Apparently, a scene change detection method, which works <strong>in</strong> theuncompressed doma<strong>in</strong>, must be used. Scene detection is based on shot group<strong>in</strong>g. Shotsprovide users with better access than an unstructured raw video stream. However, thegranularity <strong>of</strong> the shot is too small for access<strong>in</strong>g and thus not so useful. Work<strong>in</strong>g on ahigher level unit <strong>of</strong> video content, such as a scene, i.e. a group <strong>of</strong> shots shar<strong>in</strong>g similarvisual content is beneficial to human perception and reduces substantially the data needed


56 Data Hid<strong>in</strong>gto deal with compared to the shot level structure. The outcome <strong>of</strong> the scene changedetection conta<strong>in</strong>s, among others, the follow<strong>in</strong>g <strong>in</strong>formation:1. Type <strong>of</strong> the scene change (cut, dissolve, fade, etc.),2. Scene duration measured <strong>in</strong> seconds or frames,3. Key frame(s) which can be used to represent the salient content <strong>of</strong> the scene.4.4.7.4 Metadata Hid<strong>in</strong>gThe ma<strong>in</strong> blocks <strong>of</strong> the <strong>H.264</strong> video encoder are depicted <strong>in</strong> Figure 4-1. The TemporalPrediction block is responsible for the <strong>in</strong>ter prediction <strong>of</strong> each <strong>in</strong>ter frame. The <strong>H.264</strong>reference encoder, <strong>in</strong> its default condition, will encode the first frame as an <strong>in</strong>tra frameand will consider the rest frames to be <strong>in</strong>ter frames. Therefore, our scheme <strong>in</strong>tervenes <strong>in</strong>the <strong>in</strong>ter prediction process <strong>in</strong> order to hide the metadata. The basic idea is to <strong>in</strong>sert anextra <strong>in</strong>ter frame (PX) whenever a scene change is detected. ThePXis exactly the samewith the current reconstructed frame ( P C), which is meant to be the reference frame forthe next frames. The encoder will treat PXas a normal <strong>in</strong>ter frame and it will <strong>in</strong>ter encodethePXus<strong>in</strong>g itself as a reference. Hence, the <strong>in</strong>ter encod<strong>in</strong>g <strong>of</strong> each macroblock <strong>of</strong> thePXwill result <strong>in</strong> both zero motion vectors and zero residuals. Only that we force theencoder to <strong>in</strong>ter encode thePXmacroblocks choos<strong>in</strong>g the block types not <strong>in</strong> terms <strong>of</strong>cod<strong>in</strong>g efficiency, but accord<strong>in</strong>g to our data hid<strong>in</strong>g requirements. As an additionaloptimization, we do not allow the encoder to use thePXas a reference frame for the nextframes. At the end, the overhead <strong>of</strong> the extra frame’s <strong>in</strong>sertion will be some header<strong>in</strong>formation bits to denote the chosen block types and some payload bits to entropyencode the zero coefficients, which are produced by the quantization stage. The scenechange detection <strong>in</strong>formation is hidden as metadata <strong>in</strong> the extra frame us<strong>in</strong>g the DataHid<strong>in</strong>g method described <strong>in</strong> 4.4.2.Figure 4-7 shows the block diagram <strong>of</strong> the proposed data hid<strong>in</strong>g algorithm. As the <strong>in</strong>terframes enter the Temporal Prediction module, the algorithm decides whether there is ascene change detection or not. If there is a scene change, <strong>in</strong>deed, the algorithm copies thereconstructed current frame. Then, it allows the <strong>in</strong>ter prediction <strong>of</strong> this copy us<strong>in</strong>g itself asa reference, forc<strong>in</strong>g the encoder to choose specific block types accord<strong>in</strong>g to the message


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 57mapp<strong>in</strong>g (Figure 4-2). In the end, it lets the encoder to proceed with the encod<strong>in</strong>g as <strong>in</strong>normal operation. In other words, the algorithm emulates the <strong>in</strong>ter mode decision process,which the encoder would normally perform.There are two rema<strong>in</strong><strong>in</strong>g issues to be discussed: the metadata capacity and the metadataformat.Data Hid<strong>in</strong>gScenechangedetectedyesMotion estimationemulationMotionCompensation<strong>in</strong>tra cod<strong>in</strong>gnoReconstructedreferenceframeReconstructedcurrent frameact<strong>in</strong>g as referenceframeMotionestimation4.4.7.5 Metadata CapacityFigure 4-7: Block diagram <strong>of</strong> the proposed scheme.The metadata capacity (MDC) <strong>in</strong> bits for a s<strong>in</strong>gle scene change detection are calculated as<strong>in</strong> eq. (4-1):MDC = M f× B p(4-1)where M f is the number <strong>of</strong> macroblocks per frame and B p is the number <strong>of</strong> the availablebits per macroblock accord<strong>in</strong>g to Table 4-1. For example a QCIF frame ( 176× 144) gives:176×144MDC = × 2 = 198 bits(4-2)256The metadata capacity per scene change can be <strong>in</strong>creased if we <strong>in</strong>sert more than one extraframe for every scene change detection. However, this may affect the PSNR and the bitrate.


58 Data Hid<strong>in</strong>g4.4.7.6 Metadata FormatThe metadata format is highly related to the metadata capacity <strong>in</strong> the sense that the highercapacity the more metadata can be hidden <strong>in</strong> the extra frame. The general metadata formatis depicted <strong>in</strong> Figure 4-8.Figure 4-8: The metadata format.Magic Str<strong>in</strong>g (98 bits): This is a unique str<strong>in</strong>g that identifies the extra frame, which marksthe scene change. This extra frame is located immediately after the last frame <strong>of</strong> the scenechange.Start Scene Change (8 bits): It is a number which <strong>in</strong>dicates the start<strong>in</strong>g frame <strong>of</strong> the scenechange. It is a zero <strong>in</strong>dexed number which counts backwards start<strong>in</strong>g from the extraframe. In case <strong>of</strong> a sudden scene change this number is equal to 1.Scene Duration (16 bits): The scene duration <strong>in</strong> frames.Key Frame(s) (16 bits): It is one or more numbers separated by comas, which <strong>in</strong>dicate thekey frame(s). These are zero <strong>in</strong>dexed numbers which count backwards start<strong>in</strong>g from theextra frame.Other (MDC-138 bits): Other useful scene <strong>in</strong>formation.4.4.7.7 Metadata ExtractionThe metadata are extracted from the marked video as described <strong>in</strong> Section 4.4.4. Theextractor needs to entropy decode the stream <strong>in</strong> order to discover the magic str<strong>in</strong>g, which<strong>in</strong>dicates a scene change. Then, it can recover the metadata accord<strong>in</strong>g to Table 4-1mapp<strong>in</strong>g. The whole process takes place <strong>in</strong> real time s<strong>in</strong>ce the extractor does not need tocompletely decode the <strong>H.264</strong> stream.4.4.7.8 Simulation ResultsThe proposed scheme is <strong>in</strong> effect a supplement to an exist<strong>in</strong>g scene change detectionmethod. In fact, it turns an effective scene change detection method, work<strong>in</strong>g <strong>in</strong> the


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 59uncompressed doma<strong>in</strong>, <strong>in</strong>to an effective scene change detection method, work<strong>in</strong>g <strong>in</strong> thecompressed doma<strong>in</strong>. Thus, we did not focus to the effectiveness <strong>of</strong> the scene changedetection method itself. Our <strong>in</strong>tention was rather to measure the Bit Rate and the PSNRdistortion due to the <strong>in</strong>sertion <strong>of</strong> the metadata. This is the reason why the videosequences, which were tested, did not necessarily have to conta<strong>in</strong> scene changes. Threevideo sequences <strong>in</strong> QCIF format were tested. The most important configurationparameters <strong>of</strong> the reference s<strong>of</strong>tware are shown <strong>in</strong> Table 4-3. The rest <strong>of</strong> the parametershave reta<strong>in</strong>ed their default values.Dur<strong>in</strong>g our experiments we <strong>in</strong>serted 5, 10, 15 and 20 extra frames <strong>in</strong>to the test<strong>in</strong>gsequences, which corresponded to 5, 10, 15 and 20 scene changes, respectively. Then wemeasured the Variation <strong>of</strong> the PSNR (eq. (I-5)) <strong>of</strong> the Luma samples and the Bit RateVariation (eq. (I-1)) caused by these extra frames. The results are shown <strong>in</strong> Table 4-4,Table 4-5, Table 4-6 and Table 4-7.From the results we see that the PSNR was not affected. This was expected, s<strong>in</strong>ce wedisabled the bit rate control and the encoder produced the best video quality at theexpense <strong>of</strong> the bit rate. Indeed, the bit rate showed an <strong>in</strong>crement proportional to thenumber <strong>of</strong> scene changes, as expected. However, this <strong>in</strong>crement did not exceed the 0.85% <strong>in</strong> average.4.4.7.9 ConclusionsThe proposed scheme is quite simple and it can be comb<strong>in</strong>ed with any exist<strong>in</strong>g scenechange detection method which works <strong>in</strong> the uncompressed doma<strong>in</strong>, enabl<strong>in</strong>g fast scenedetection <strong>in</strong> the compressed doma<strong>in</strong>. In addition to that, it does not substantially affecteither the video quality or the bit rate. The scheme is suitable for video <strong>in</strong>dex<strong>in</strong>g andretrieval.Table 4-3: Configuration parameters <strong>of</strong> the encoder.Pr<strong>of</strong>ileBasel<strong>in</strong>eFrames 800Q parameter for I & P frames 28Number <strong>of</strong> reference frames 2Motion Estimation AlgorithmEPZSRD OptimizationLow ComplexityRate ControlDisabled


60 Data Hid<strong>in</strong>gTable 4-4: PSNR and bit rate variations for 5 scene changes (990 bits capacity).Sequence (800 frames) PSNR Variation (dB) Bit Rate Variation (%)bridge-close -0.00 0.32Highway 0.00 -0.04Grandma 0.01 0.03mother & daughter -0.01 0.23Average: 0.00 dB Average: 0.13 %Table 4-5: PSNR and bit rate variations for 10 scene changes (1980 bits capacity).Sequence (800 frames) PSNR Variation (dB) Bit Rate Variation (%)bridge-close 0.01 0.43Highway -0.00 0.37Grandma 0.01 0.09mother & daughter -0.00 0.21Average: 0.00 dB Average: 0.27 %Table 4-6: PSNR and bit rate variations for 15 scene changes (2970 bits capacity).Sequence (800 frames) PSNR Variation (dB) Bit Rate Variation (%)bridge-close 0.01 0.56highway -0.00 0.25grandma 0.01 0.36mother & daughter -0.00 0.45Average: 0.00 dB Average: 0.41 %Table 4-7: PSNR and bit rate variations for 20 scene changes (3960 bits capacity).Sequence (800 frames) PSNR Variation (dB) Bit Rate Variation (%)bridge-close 0.01 0.79highway -0.01 0.75grandma -0.00 1.19mother & daughter -0.02 0.66Average: 0.00 dB Average: 0.85 %


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 614.5 REAL TIME DATA HIDING BY EXPLOITING THE I_PCM MACROBLOCKS4.5.1 Literature reviewThere are only few <strong>H.264</strong> Data Hid<strong>in</strong>g techniques, which can work <strong>in</strong> real time such asthe [61], which embeds the watermark bit to the sign bit <strong>of</strong> the Trail<strong>in</strong>g Ones <strong>in</strong> ContextAdaptive Variable Length Cod<strong>in</strong>g (CAVLC) <strong>of</strong> <strong>H.264</strong>. Most <strong>of</strong> the known Data Hid<strong>in</strong>gtechniques take place dur<strong>in</strong>g the <strong>H.264</strong> encod<strong>in</strong>g [43-48, 63].The proposed technique, however, can embed the data dur<strong>in</strong>g the encod<strong>in</strong>g process aswell as <strong>in</strong> the compressed doma<strong>in</strong>. It exploits the I_PCM mode used by the <strong>H.264</strong>encoder dur<strong>in</strong>g the <strong>in</strong>tra prediction, <strong>in</strong> order to hide the desired data. The data can then beextracted directly from the encoded stream without know<strong>in</strong>g the orig<strong>in</strong>al host video. Thismethod is best suited for content-based authentication and covert communicationapplications.4.5.2 Intra mode prediction <strong>in</strong> <strong>H.264</strong>In <strong>in</strong>tra mode a prediction block P is formed based on previously encoded andreconstructed blocks and is subtracted from the current block prior to encod<strong>in</strong>g. There aretwo primary types <strong>of</strong> <strong>in</strong>tra cod<strong>in</strong>g supported: Intra _ 4×4 and Intra _ 16×16 prediction.Chroma <strong>in</strong>tra prediction is the same <strong>in</strong> both cases. A third type <strong>of</strong> <strong>in</strong>tra cod<strong>in</strong>g, calledI_PCM (or IPCM), is also provided for use <strong>in</strong> unusual situations. The encoder typicallyselects the prediction mode for each block that m<strong>in</strong>imizes the difference between P andthe block to be encoded.The Intra _ 4×4 mode is based on predict<strong>in</strong>g each 4× 4 luma block separately and iswell suited for cod<strong>in</strong>g parts <strong>of</strong> a picture with significant detail. The Intra _ 16×16 mode,on the other hand, performs prediction and residual cod<strong>in</strong>g on the entire 16 × 16 lumablock and is more suited for cod<strong>in</strong>g very smooth areas <strong>of</strong> a picture. In addition to thesetwo types <strong>of</strong> luma prediction, a separate chroma prediction is conducted. In contrast toprevious video cod<strong>in</strong>g standards (esp. H.263+ and MPEG-4 Visual), where <strong>in</strong>traprediction has been conducted <strong>in</strong> the transform doma<strong>in</strong>, <strong>in</strong>tra prediction <strong>in</strong> <strong>H.264</strong> isalways conducted <strong>in</strong> the spatial doma<strong>in</strong>, by referr<strong>in</strong>g to neighbor<strong>in</strong>g samples <strong>of</strong>previously-decoded blocks that are to the left and/or above the block to be predicted.


62 Data Hid<strong>in</strong>gS<strong>in</strong>ce this can result <strong>in</strong> spatio-temporal error propagation when <strong>in</strong>ter prediction has beenused for neighbor<strong>in</strong>g macroblocks, a constra<strong>in</strong>ed <strong>in</strong>tra cod<strong>in</strong>g mode can alternatively beselected that allows prediction only from <strong>in</strong>tra-coded neighbor<strong>in</strong>g macroblocks. InIntra _ 4×4 mode, each 4× 4 luma block is predicted from spatially neighbor<strong>in</strong>g samples.When the fidelity <strong>of</strong> the coded video is high (i.e., when the quantization step size is verysmall), it is possible <strong>in</strong> certa<strong>in</strong> very rare <strong>in</strong>stances <strong>of</strong> <strong>in</strong>put picture content for theencod<strong>in</strong>g process to actually cause data expansion rather than compression. Furthermore,it is convenient for implementation reasons to have a reasonably-low identifiable limit onthe number <strong>of</strong> bits necessary to process <strong>in</strong> a decoder <strong>in</strong> order to decode a s<strong>in</strong>glemacroblock. To address these issues, the standard <strong>in</strong>cludes an I_PCM macroblock mode,<strong>in</strong> which the values <strong>of</strong> the samples are sent directly without prediction, transformation, orquantization. An additional motivation for support <strong>of</strong> this macroblock mode is to allowregions <strong>of</strong> the picture to be represented without any loss <strong>of</strong> fidelity. However, the I_PCMmode is clearly not efficient. Indeed it is not <strong>in</strong>tended to be efficient. Rather, it is <strong>in</strong>tendedto be simple and to impose a m<strong>in</strong>imum upper bound on the number <strong>of</strong> bits that can beused to represent a macroblock with sufficient accuracy. If one considers the bitsnecessary to <strong>in</strong>dicate which mode has been selected for the macroblock, the use <strong>of</strong> theI_PCM mode actually results <strong>in</strong> a m<strong>in</strong>or degree <strong>of</strong> data expansion.4.5.3 Real time Data Hid<strong>in</strong>gAs expla<strong>in</strong>ed <strong>in</strong> Section 4.5.2, the I_PCM macroblock is a macroblock, <strong>in</strong> which thevalues <strong>of</strong> the samples are sent directly – without prediction, transformation, orquantization. The concept beh<strong>in</strong>d our method is to hide the desired data <strong>in</strong> the low bits <strong>of</strong>both the luma and the chroma samples <strong>of</strong> an I_PCM macroblock. Eventually, the hiddendata will be embedded <strong>in</strong>to the compressed <strong>H.264</strong> stream <strong>in</strong>tact. Simple andstraightforward though, the proposed method has to face two practical obstacles: therareness <strong>of</strong> the I_PCM macroblocks dur<strong>in</strong>g the encod<strong>in</strong>g and the low efficiency <strong>of</strong> theI_PCM mode <strong>in</strong> terms <strong>of</strong> compression. The latter turns out to be a trade<strong>of</strong>f issue betweenthe generated bitrate and the capacity <strong>of</strong> the hidden data. This issue is discussed <strong>in</strong>Section 4.5.4 with the help <strong>of</strong> some experimental results. Regard<strong>in</strong>g the rareness <strong>of</strong> theI_PCM macroblock, we conducted several tests with many well-known video sequences.All <strong>of</strong> the tests resulted <strong>in</strong> none I_PCM macroblocks no matter how low the quantizationparameter had been set. We therefore concluded that the only safe way to produce I_PCM


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 63macroblocks dur<strong>in</strong>g the encod<strong>in</strong>g is to force the encoder to regard specific macroblocksas I_PCM macroblocks. The simplified block diagram <strong>of</strong> the proposed method, <strong>in</strong>tegrated<strong>in</strong>to the <strong>H.264</strong> encoder, is illustrated <strong>in</strong> Figure 4-9.IPCM Data Embedd<strong>in</strong>g IPCMIntra PredictionI4x4, I16x16, IPCMIPCM DecisionMode DecisionInter Prediction16x16, 16x8, 8x16, 8x8,P8x8: 8x4, 4x8, 4x4Figure 4-9: Simplified block diagram <strong>of</strong> the proposed method <strong>in</strong>tegrated <strong>in</strong>to <strong>H.264</strong>.The proposed method adds two new blocks <strong>in</strong> the <strong>H.264</strong> encoder: the IPCM Decisionblock and the Data Embedd<strong>in</strong>g block.The IPCM Decision <strong>in</strong>tervenes <strong>in</strong> the Mode Decision process <strong>of</strong> the <strong>H.264</strong> encod<strong>in</strong>g andforces certa<strong>in</strong> macroblocks to be encoded as I_PCM macroblocks. The decision, on whichmacroblocks are go<strong>in</strong>g to be encoded as I_PCM, depends on the length <strong>of</strong> the data to behidden. The rule is that the I_PCM macroblocks must be enough to cover the hidden dataand they must be widespread to the possible extend with<strong>in</strong> the <strong>H.264</strong> stream. In the raresituation where the encoder decides to encode a macroblock as I_PCM, without our<strong>in</strong>tervention, the Data Embedd<strong>in</strong>g block will also use this macroblock to hide data.The Data Embedd<strong>in</strong>g block takes action after the IPCM Decision and modifies the lowbits <strong>of</strong> the values <strong>of</strong> the aforementioned macroblocks <strong>in</strong> such a way that the modified bitsform the hidden data. The “tweaked” I_PCM macroblocks will then suffer the losslessentropy encod<strong>in</strong>g and the hidden data will eventually be <strong>in</strong>serted <strong>in</strong>tact <strong>in</strong>to the generated<strong>H.264</strong> stream.


64 Data Hid<strong>in</strong>gThe proposed method is characterized by three features, namely, ease <strong>of</strong> implementation,high data capacity and reusability. The latter allows data hid<strong>in</strong>g <strong>in</strong> real time directly <strong>in</strong> thecompressed doma<strong>in</strong>. All <strong>of</strong> these features are described below.4.5.3.1 Ease <strong>of</strong> ImplementationThe proposed method can be easily <strong>in</strong>tegrated <strong>in</strong>to the reference <strong>H.264</strong> encoder. It takesplace <strong>in</strong> a very early stage <strong>of</strong> the encod<strong>in</strong>g process, before any spatial or temporalpredictions and before the transformation and the quantization. Therefore, the impact <strong>of</strong>the proposed method to the encod<strong>in</strong>g process is m<strong>in</strong>imized. Some implementation h<strong>in</strong>tson how to implement the proposed algorithm us<strong>in</strong>g the reference <strong>H.264</strong> encoder version14.0 are given below:1. Add the follow<strong>in</strong>g snippet just before the “compute_mode_RD_cost” function iscalled. This code will force the encoder to encode theslice <strong>in</strong> I_PCM mode.if(img->current_mb_nr==N && img->type ==P_SLICE)for (i=0; i < 11; i++) enc_mb.valid[i] = 0;thN macroblock <strong>of</strong> every P2. Modify the low bits <strong>of</strong> the values <strong>of</strong> this macroblock. This is done under the“IPCM” case <strong>in</strong>side the “RDCost_for_macroblocks” function.4.5.3.2 High Data CapacityThe data capacity <strong>of</strong> a video sequence <strong>in</strong> YUV 4:2:0 format is calculated <strong>in</strong> accordancewith eq. (4-3):wherewhereDataCapaci ty = LumaCapacity + 2 × ChromaCapacity(4-3)LumaCapaci ty = 256×N × L , (4-4)IPCM bitsChromaCapa city = 64×N × C (4-5)IPCM bitsNIPCMis the number <strong>of</strong> the I_PCM macroblocks used for data hid<strong>in</strong>g, Lbitsis thenumber <strong>of</strong> the low bits per I_PCM luma sample used for data hid<strong>in</strong>g andCbitsis thenumber <strong>of</strong> the low bits per I_PCM chroma sample used for data hid<strong>in</strong>g. The luma andchroma samples are 256 and 64, respectively.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 65Accord<strong>in</strong>g to [62], up to 3 low bits <strong>of</strong> an 8-bit sample can be modified without caus<strong>in</strong>gany visual distortion. However, our experiments showed that even if the 4 low bits aremodified the distortion is imperceptible. This is expla<strong>in</strong>ed by the fact that we are notdeal<strong>in</strong>g with static images but with mov<strong>in</strong>g frames at a frame rate <strong>of</strong> 30 fps. Moreover,we embed no more than one I_PCM macroblock per frame and not <strong>in</strong> successive frames.F<strong>in</strong>ally, the rest <strong>of</strong> the non-I_PCM macroblocks have possibly suffered greater distortiondue to the <strong>in</strong>tra/<strong>in</strong>ter prediction and to the quantization dur<strong>in</strong>g the encod<strong>in</strong>g. In order toprove the above we zeroed the 4 low bits <strong>of</strong> every luma and chroma sample <strong>of</strong> the 49 thmacroblock <strong>of</strong> the 9 th frame <strong>of</strong> the mobile sequence. The sequence was encoded (QP=28,CABAC) and decoded with the <strong>H.264</strong> JM.14.0 codec. Figure 4-10 shows the visualresult. An <strong>in</strong>terest<strong>in</strong>g approach would be the modifiable bitsLbitsand Cbitsto bemathematically related to the Quantization Parameter (QP). For example for a high QP(>28) we could modify 4 bits while for a lower QP we could modify 3 bits or fewer.Other comb<strong>in</strong>ations are also applicable such as the use <strong>of</strong> 3 bits for the luma blocks and 4bits for the chroma blocks. In the current implementation <strong>of</strong> the proposed method wemodify the 4 low bits <strong>of</strong> both <strong>of</strong> the I_PCM luma and the chroma samples <strong>in</strong> order to hidethe data. Hence, from eq. (4-3), a s<strong>in</strong>gle I_PCM macroblock ( N = 1) forLbits= C bits=4 gives a capacity <strong>of</strong> 1536 bits. This might be regarded as the upper limit <strong>of</strong>the capacity per I_PCM macroblock.IPCMa. Non-marked frame b. Marked frameFigure 4-10: Comparison <strong>of</strong> the visual results between a non-marked frame and a markedframe <strong>of</strong> the mobile sequence.4.5.3.3 Reusability and Real Time Data Hid<strong>in</strong>gMost <strong>of</strong> the data hid<strong>in</strong>g methods hide the data dur<strong>in</strong>g the encod<strong>in</strong>g process, thus they areslow and <strong>in</strong>effective for real time applications such as covert mobile communication. Theproposed method, as expla<strong>in</strong>ed above, encodes some macroblocks <strong>in</strong> I_PCM mode andhides the data with<strong>in</strong> their values. After the first pass and as long as the I_PCM


66 Data Hid<strong>in</strong>gmacroblocks have been encoded, the same I_PCM macroblocks can be reused to hidenew data, directly <strong>in</strong> the compressed doma<strong>in</strong>, numerous times <strong>in</strong> real time. The reus<strong>in</strong>gprocess needs neither the orig<strong>in</strong>al video sequence nor the orig<strong>in</strong>al encoded stream.Furthermore, it does not cause any significant PSNR or bit rate distortions other thanthose, which were <strong>in</strong>troduced by the <strong>in</strong>itial data hid<strong>in</strong>g. This is because the method<strong>in</strong>creases the bitrate <strong>in</strong> a determ<strong>in</strong>istic way and by a fixed amount dependent only on thedata capacity (the likely bit rate <strong>in</strong>crease due to the entropy encod<strong>in</strong>g is quite small). Afterthe <strong>in</strong>itial bit rate <strong>in</strong>crement the bit rate will be no further significantly affected by reus<strong>in</strong>gthe compressed I_PCM blocks. Moreover, modify<strong>in</strong>g the low I_PCM bits differently doesnot cause any perceptible distortions as expla<strong>in</strong>ed <strong>in</strong> Section 4.5.3.2 above. To ourknowledge the proposed method is the only method, which exposes such a property. Theway <strong>of</strong> the I_PCM macroblock reuse is described below.The <strong>H.264</strong> bitstream is organized <strong>in</strong> discrete packets, called “NAL units”. NAL units areclassified <strong>in</strong>to VCL and non-VCL NAL units. The VCL NAL units conta<strong>in</strong> the data thatrepresent the values <strong>of</strong> the samples <strong>in</strong> the video pictures and the non-VCL NAL unitsconta<strong>in</strong> any associated additional <strong>in</strong>formation. The contents <strong>of</strong> the NAL units are entropyencoded. The reus<strong>in</strong>g process is performed <strong>in</strong> four steps, as follows:StepAction1 Get a NAL unit from the <strong>H.264</strong> stream2 Entropy decode the NAL and check if the NAL conta<strong>in</strong>s an I_PCM macroblock3 In case <strong>of</strong> a I_PCM macroblock:aEntropy decode the macroblockb Hide the new data <strong>in</strong>to the low bits <strong>of</strong> the macroblock’s valuescEntropy re-encode the macroblock4 Go to step 1The real time data hid<strong>in</strong>g is achieved by the fact that the method needs only to entropydecode and re-encode the compressed I_PCM macroblocks, thus avoid<strong>in</strong>g the time-


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 67consum<strong>in</strong>g normal encod<strong>in</strong>g process. Figure 4-11 shows the block diagram <strong>of</strong> the realtime data hid<strong>in</strong>g process.Figure 4-11: Block diagram <strong>of</strong> the real data hid<strong>in</strong>g process.4.5.4 Simulation resultsThe simulation tests were executed <strong>in</strong> the simulation environment, which is described <strong>in</strong>Appendix II. The most important configuration parameters <strong>of</strong> the reference s<strong>of</strong>tware areshown <strong>in</strong> Table 4-8. The rest <strong>of</strong> the parameters have reta<strong>in</strong>ed their default values.The I_PCM macroblocks are expected to have by nature a negative impact to theproduced bit rate. We conducted several tests <strong>in</strong> order to <strong>in</strong>vestigate this impact. For thatpurpose we used 300 frames or 10 sec <strong>of</strong> three well-known representative videosequences <strong>in</strong> QCIF format (YUV 4:2:0): the akiyo (Class A), the foreman (Class B) andthe mobile (Class C). The QCIF format (176x144) was chosen because it is very common<strong>in</strong> mobile applications where the demand for real time is always high. Refer to AppendixII.3 for more details about the test<strong>in</strong>g sequences.The hidden message was generated by the pseudorandom <strong>in</strong>teger generator function,rand, which is provided by the standard C library. The test<strong>in</strong>g procedure, also described <strong>in</strong>Appendix II.4, was to run the reference encoder with and without our algorithm and thencompare the results <strong>in</strong> respect with the bit rate and the PSNR. We used the bit rate and the


68 Data Hid<strong>in</strong>gPSNR variation as comparative metrics, which were calculated as <strong>in</strong> eq. (I-1) and (I-5)respectively.At the first series <strong>of</strong> tests we ran the encoder for different very common QuantizationParameters (10, 20 and 30) and for different Data Capacities (3072 - 18432 bits).Moreover we did not apply any bit rate constra<strong>in</strong>ts. In this way the PSNR rema<strong>in</strong>edpractically unaffected with the exception <strong>of</strong> the QP = 10 test case, where the PSNRshowed some degradation, less than -0.1 dB. The results are shown <strong>in</strong> Figure 4-12, Figure4-13, Figure 4-14, Figure 4-15, Figure 4-16 and Figure 4-17.From the results we see that the bit rate is <strong>in</strong>creased proportionally to the data capacityand to the quantization parameter. This is expected because the higher the quantizationparameter is, the lower the produced bit rate is under the normal reference encod<strong>in</strong>g. Onthe other hand, our method <strong>in</strong>creases the bit rate proportionally to the I_PCMmacroblocks, i.e. to the data capacity. For QP=20 the method results <strong>in</strong> a cont<strong>in</strong>uous bit rate <strong>in</strong>crease. It is notablethat the akiyo sequence presents much higher bit rate variations than the other twosequences. This is expla<strong>in</strong>ed by the fact that the akiyo is a class “A” sequence, i.e. it haslow spatial details and low amount <strong>of</strong> movement. That means that both <strong>of</strong> the <strong>in</strong>tra andthe <strong>in</strong>ter prediction leave small residuals dur<strong>in</strong>g the normal encod<strong>in</strong>g by the referenceencoder, which eventually results <strong>in</strong> a very low bit rate. On the other hand, our modifiedencoder always produces the required I_PCM macroblocks, which <strong>in</strong>crease the bit rate.Therefore, we conclude that for class “A” sequences and for high QPs (>20) the methodcannot hide more than 5000 bits efficiently.In the second series <strong>of</strong> tests we enabled the bit rate control mechanism <strong>of</strong> the encoder andwe set 60 kbps and 50 kbps bit rate constra<strong>in</strong>ts on the encoder, which are considered to below bit rates for the current Internet’s standards. Our purpose was to <strong>in</strong>vestigate theperformance <strong>of</strong> our method when the marked <strong>H.264</strong> bitstream has to be transmitted over achannel with limited bandwidth. By enabl<strong>in</strong>g the bit rate control the quantizationparameters were automatically controlled by the encoder, which had to generate a bit ratelower or equal to the bit rate constra<strong>in</strong>t. In this way the bit rate was practically unaffectedas is shown <strong>in</strong> Table 4-9. Apparently, the cost <strong>of</strong> do<strong>in</strong>g so was put on the PSNR. Figure


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 694-18, Figure 4-19 and Figure 4-20 show the PSNR variations when the bit rate controlwas enabled.From the results we see that the overall performance became smoother when we enabledthe bit rate control. The PSNR is decreased proportionally to the data capacity but themaximum decrement does not exceed the 0.43 dB and the 0.44 dB for the 60 and the 50kbps constra<strong>in</strong>ts respectively at a capacity <strong>of</strong> 18432 bits. These decrements were observed<strong>in</strong> the akiyo sequence as expected. The result is regarded as an acceptable trade <strong>of</strong>f, tak<strong>in</strong>g<strong>in</strong>to account the low bit rate constra<strong>in</strong>ts and the small number <strong>of</strong> frames that were used(300).In the third series <strong>of</strong> tests we compared our method with the method proposed by Y. Hu etal. [63]. Four QCIF sequences were used (bridge blose, grandma, news and silent; 199frames each). The tests were performed us<strong>in</strong>g the <strong>H.264</strong> Ma<strong>in</strong> Pr<strong>of</strong>ile configuration (withRDO, CABAC, QP=28 and 30 frames/sec) and with a GOP structure <strong>of</strong> "IBPBPBPBPB".The results are shown <strong>in</strong> Table 4-10. The PMC denotes the maximum capacity for theproposed method while the 11MC denotes the maximum capacity for the [63] method,which was used <strong>in</strong> the comparison tests.F<strong>in</strong>ally, our experiments showed that the proposed algorithm did not <strong>in</strong>troduce seriousdelays <strong>in</strong> the encod<strong>in</strong>g process. On the contrary, <strong>in</strong> most cases the encod<strong>in</strong>g became fasterbecause the encoder did not have to fully encode the I_PCM macroblocks.Based on all <strong>of</strong> the above results, the overall conclusion is that the proposed methodachieves to hide 18 Kbits <strong>of</strong> data <strong>in</strong> just 300 frames or 10 sec <strong>of</strong> a wide range <strong>of</strong> videosequences <strong>in</strong> real time. It works better for bit rates around 60 kbps and higher, where themaximum PSNR degradation does not exceed 0.43 dB for the higher capacities. For thatrate and for lower capacities, up to 10 Kbits, the proposed method has very small impactto the PSNR.4.5.5 Message extractorThe message extractor is a s<strong>of</strong>tware tool, not necessarily an <strong>H.264</strong> decoder, whichextracts the hidden message from the marked <strong>H.264</strong> bitstream. The message extractorworks <strong>in</strong> a way similar to that <strong>of</strong> the algorithm, which reuses the I_PCM macroblocks fordata hid<strong>in</strong>g and is described below:


70 Data Hid<strong>in</strong>gStepAction1 Get a NAL unit from <strong>H.264</strong> stream2 Entropy decode the NAL and check if the NAL conta<strong>in</strong>s an I_PCM macroblock3 In case <strong>of</strong> a I_PCM macroblock:aEntropy decode the macroblockb Read the low bits <strong>of</strong> the macroblock’s values <strong>in</strong> order to extract the message4 Go to step 14.5.6 Further improvementsThe fact that the proposed method <strong>in</strong>serts raw <strong>in</strong>formation (I_PCM macroblocks) <strong>in</strong>to the<strong>H.264</strong> bitstream generates a lot <strong>of</strong> potential improvements. The I_PCM macroblock canbe regarded as part <strong>of</strong> a still image. Therefore, many data hid<strong>in</strong>g and watermark<strong>in</strong>gtechniques, which work <strong>in</strong> the spatial doma<strong>in</strong>, can be applied [64].4.5.7 ConclusionsThe proposed Data Hid<strong>in</strong>g takes place dur<strong>in</strong>g the encod<strong>in</strong>g process and exploits theI_PCM coded macroblocks <strong>in</strong> order to hide the data. However, the same I_PCMmacroblocks can be reused to hide new data, directly <strong>in</strong> the compressed doma<strong>in</strong>,numerous times <strong>in</strong> real time. The method is a bl<strong>in</strong>d scheme and it can result <strong>in</strong> relativelyhigh data capacities without considerably affect<strong>in</strong>g either the video quality or the cod<strong>in</strong>gefficiency.Table 4-8: Configuration parameters <strong>of</strong> the encoder.Pr<strong>of</strong>ileMa<strong>in</strong>Number <strong>of</strong> Frames300 (10 sec)Frame Rate30 fpsRD OptimizationHigh Complexity ModeMotion EstimationSimplified UMHexagonSIntra Period0: only the first frame is <strong>in</strong>traSymbol ModeCABAC


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 71Table 4-9: Bit rate variations under bit rate control.Sequence Average Bit Rate Variation (%)50 kbps 60 kbpsAkiyo 0.05 0.06Foreman -0.02 -0.28Mobile 0.03 -0.04Table 4-10: Comparison results between the proposed method and [63].SequencePMC/11MCGrandma15360/12352Bridge-close15360/11748News18432/9972Silent18432/17368PSNR Variation (dB) Bit rate Variation (%)Proposed [63] Proposed [63]-0.01 -0.08 2,93 3.720.01 -0.04 1,15 2.90-0.01 -0.01 3.80 3.230.02 -0.02 3.59 4.14Figure 4-12: Akiyo-bit rate variation vs data capacity for different QPs.Figure 4-13: Akiyo-PSNR variation vs data capacity for different QPs.


72 Data Hid<strong>in</strong>gFigure 4-14: Foreman-bit rate variation vs data capacity for different QPs.Figure 4-15: Foreman-PSNR variation vs data capacity for different QPs.Figure 4-16: Mobile-bit rate variation vs data capacity for different QPs.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 73Figure 4-17: Mobile-PSNR variation vs data capacity for different QPs.Figure 4-18: Akiyo-PSNR variation vs data capacity under bit rate control.Figure 4-19: Foreman-PSNR variation vs data capacity under bit rate control.


74 Data Hid<strong>in</strong>gFigure 4-20: Mobile-PSNR variation vs data capacity under bit rate control.


5 Bitrate Transcod<strong>in</strong>g5.1 INTRODUCTIONVideo transcod<strong>in</strong>g [75] performs one or more operations, such as bit rate and formatconversions, to transform one compressed video stream to another. Transcod<strong>in</strong>g canenable multimedia devices <strong>of</strong> diverse capabilities and formats to exchange video contenton heterogeneous network platforms such as the Internet. One scenario is deliver<strong>in</strong>g ahigh-quality multimedia source (such as a DVD or HDTV) to various receivers (such asPDAs, Pocket PCs and fast desktop PCs) on wireless and wirel<strong>in</strong>e networks. Here, atranscoder (placed at the transmitter, receiver or somewhere <strong>in</strong> the network) can generateappropriate bitstream threads directly from the orig<strong>in</strong>al bitstream without hav<strong>in</strong>g todecode and re-encode. To suit available network bandwidth, a video transcoder canperform dynamic adjustments <strong>in</strong> the bit-rate <strong>of</strong> the video bitstream without additionalfunctional requirements <strong>in</strong> the decoder. Another scenario is a video conferenc<strong>in</strong>g systemon the Internet <strong>in</strong> which the participants may be us<strong>in</strong>g different term<strong>in</strong>als. Here, a videotranscoder can <strong>of</strong>fer dual functionality: provide video format conversion to enable contentexchange and perform dynamic bit rate adjustment to facilitate proper schedul<strong>in</strong>g <strong>of</strong>network resources. Thus, video transcod<strong>in</strong>g is one <strong>of</strong> the essential components for currentand future multimedia systems that aim to provide universal access.5.2 PROBLEM FORMULATIONIn this section we describe the target application and the issues that we need to address.The application is illustrated <strong>in</strong> Figure 5-1. Movies are stored <strong>in</strong> a media-stream<strong>in</strong>gserver. The server is connected to a gateway through an error free high speed channel, e.g.Ethernet. A gateway is a network device that acts as an entrance to another network. A


76 Bitrate Transcod<strong>in</strong>gmovie is transmitted to the gateway and then to various devices with wireless capabilities,such as smart phones, PDAs, Tablet PCs and Laptops, which may belong to the same ordifferent networks or subnets.Apparently, the gateway must perform some bit rate control <strong>in</strong> order to cope with eitherthe bandwidth <strong>of</strong> the different networks or with the network congestion. A universalapproach for the gateway is to apply a bit rate transcod<strong>in</strong>g technique. This assumes thatthe gateway is media aware, i.e. it recognizes that the <strong>in</strong>put data is a media source.However, the gateway may be a standalone device, even embedded, with limited CPUand memory capabilities. Therefore, complicated, with regards to CPU and buffer<strong>in</strong>grequirements, transcoders cannot be implemented. On the other hand, low complexitytranscoders, such as the open-loop transcoders, generate drift errors. Real time operation<strong>of</strong> the transcoder is <strong>of</strong> course mandatory. The proposed technique addresses all <strong>of</strong> theabove issues.Figure 5-1: The target application.5.3 SOLUTIONIn the follow<strong>in</strong>g sections we shall present a novel bit rate transcod<strong>in</strong>g technique. Thebasic concept beh<strong>in</strong>d our method is to drop frames <strong>in</strong> order to reduce the bit rate. Framedropp<strong>in</strong>g is the obvious and easiest solution for a bit rate transcoder. As a matter <strong>of</strong> fact,dropp<strong>in</strong>g frames is <strong>in</strong>evitable when the frames cannot be transmitted (what else can the


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 77transmitter do but drop the frames?). However, the dropped frames create a gap <strong>in</strong> the<strong>H.264</strong> bitstream and desynchronize the <strong>H.264</strong> decoder from the encoder caus<strong>in</strong>gperceptible errors, known as drift errors. These errors are accumulated and propagated <strong>in</strong>time (until the decoder reaches an <strong>in</strong>tra frame) caus<strong>in</strong>g further distortions <strong>in</strong> the video.One way to avoid this situation, i.e. frame dropp<strong>in</strong>g, is to reduce the bit rate by apply<strong>in</strong>g abit rate transcod<strong>in</strong>g technique, such as decode and re-encode the video with higherquantization step. Apparently, such techniques are time consum<strong>in</strong>g, as we shall expla<strong>in</strong> <strong>in</strong>Section 5.4.1.1, compared to the frame dropp<strong>in</strong>g and hence they do not serve well ourtarget application. However, our method proves that the frames can be dropped <strong>in</strong> acontrollable manner caus<strong>in</strong>g imperceptible errors to the transmitted video. The methodfalls <strong>in</strong> the applied methods category as this was def<strong>in</strong>ed <strong>in</strong> the Introduction.


78 Bitrate Transcod<strong>in</strong>g5.4 BIT RATE TRANSCODING BY DROPPING FRAMES IN THE COMPRESSED DOMAIN5.4.1 Literature reviewDue to the nature <strong>of</strong> <strong>H.264</strong> standard [1] (high compression, low bitrate, etc.), the <strong>H.264</strong>encoded sequences suit very well for applications such as Video On Demand and videostream<strong>in</strong>g over the Internet or other networks. On the other hand, rate control is animportant issue <strong>in</strong> video stream<strong>in</strong>g applications for both wired and wireless networks.Rate control techniques fall <strong>in</strong>to the video transcod<strong>in</strong>g category when they do not takeplace dur<strong>in</strong>g the encod<strong>in</strong>g <strong>of</strong> the orig<strong>in</strong>al sequence. A typical scenario is deliver<strong>in</strong>g a highquality media source to various receivers (PCs, cell phones, PDAs, etc.) on wireless andwirel<strong>in</strong>e networks. The rate controller, hereafter the transcoder, must generate appropriatebitstreams directly from the orig<strong>in</strong>al bitstream <strong>in</strong> order to accommodate different networkbandwidths. Another scenario is deliver<strong>in</strong>g the media source to a receiver, which supportsonly a lower frame rate. In that case the transcoder must reduce the frame rate. In our casethe target application is a bit-rate transcoder <strong>of</strong> low complexity and with low memoryrequirements, which can control the bit-rate <strong>of</strong> the <strong>H.264</strong> encoded movies <strong>in</strong> thecompressed doma<strong>in</strong> <strong>in</strong> real time.Basically, there are two ways <strong>of</strong> controll<strong>in</strong>g the bit rate <strong>in</strong> the compressed doma<strong>in</strong>: thetemporal transcod<strong>in</strong>g, e.g. dropp<strong>in</strong>g frames, and the bit rate transcod<strong>in</strong>g on a per framebasis. Several bit-rate and temporal transcod<strong>in</strong>g techniques have been proposed <strong>in</strong> thepast. An overview <strong>of</strong> various MPEG transcod<strong>in</strong>g techniques is given <strong>in</strong> [65, 66]. Most <strong>of</strong>them were presented many years ago and although they are applicable <strong>in</strong> <strong>H.264</strong> videosequences, they do not take <strong>in</strong>to account the special characteristics <strong>of</strong> <strong>H.264</strong> standard.Here, we shall review some representative bit rate and temporal transcoders denot<strong>in</strong>g theirlimitations when these are to be applied to <strong>H.264</strong> sequences.5.4.1.1 Bit Rate TranscodersIn general, there exist four bit-rate transcod<strong>in</strong>g categories:1. Cascaded pixel-doma<strong>in</strong> transcoders: These require decod<strong>in</strong>g and re-encod<strong>in</strong>g <strong>of</strong>the bitsream.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 792. Transform-doma<strong>in</strong> transcoders: These require partial decod<strong>in</strong>g <strong>of</strong> the bitstream upto the <strong>in</strong>verse transformation <strong>of</strong> the coefficients.3. Open-loop transcoders: These require entropy decod<strong>in</strong>g and possibly rescal<strong>in</strong>gand re-quantization <strong>of</strong> the coefficients.4. Special category, where precautions are taken dur<strong>in</strong>g the encod<strong>in</strong>g with regard tothe bit rate transcod<strong>in</strong>g, such as hid<strong>in</strong>g <strong>of</strong> data, detection <strong>of</strong> regions <strong>of</strong> <strong>in</strong>terest,extraction <strong>of</strong> side <strong>in</strong>formation, etc.Lefol et al. [67] evaluate the performance <strong>of</strong> some known bit rate transcoders when theseare applied to <strong>H.264</strong> bitstreams. The conclusion is that all <strong>of</strong> the open loop transcodersresult <strong>in</strong> severe drift errors. Drift error is def<strong>in</strong>ed as the error caused by the encoder–decoder prediction mismatch and it is expla<strong>in</strong>ed <strong>in</strong> Section 5.4.2.1. In order to avoid sucherrors transcoders <strong>of</strong> the other categories must be used. However, our target applicationrequires low complexity, low memory and real time implementation. Therefore weexclude the cascaded pixel-doma<strong>in</strong> transcoders s<strong>in</strong>ce they cannot work <strong>in</strong> real time. Wealso exclude the transform-doma<strong>in</strong> transcoders. These are supposed to work <strong>in</strong> real timewhen they are applied to previous standards, e.g. MPEG-2. This may not be true <strong>in</strong><strong>H.264</strong>, especially if the CABAC entropy encod<strong>in</strong>g is used. Besides, they have somecomplexity <strong>in</strong> their implementation caused ma<strong>in</strong>ly by the different <strong>in</strong>ter-modes used bythe <strong>H.264</strong> encoder. F<strong>in</strong>ally we exclude the fourth category assum<strong>in</strong>g that the encoder isnot bit rate wise. In conclusion, only the open-loop transcoders meet our requirements butthey cause errors, which may lead to unacceptable degradation <strong>of</strong> the video quality.5.4.1.2 Temporal Transcod<strong>in</strong>gThe simplest temporal transcod<strong>in</strong>g technique is the random frame dropp<strong>in</strong>g. This causessevere drift error as will be shown <strong>in</strong> Figure 5-4. Several techniques, which try to addressthis problem, have been proposed [65]. The concept beh<strong>in</strong>d these techniques is illustrated<strong>in</strong> Figure 5-2. Let three consecutive frames <strong>in</strong> time n-2, n-1 and n. Note that the frame n is<strong>in</strong>ter and it uses n-1 as reference. For some reason the middle frame n-1 is dropped. As aconsequence the macroblocks (MBs) <strong>in</strong> frame n will lose their references. Let’s exam<strong>in</strong>ehow a basic technique deals with this problem for a s<strong>in</strong>gle MB (current). The best matcharea referenced by the motion vector a <strong>of</strong> the current macroblock MB <strong>in</strong> frame n overlapsat most with four MBs <strong>in</strong> its reference frame n-1. S<strong>in</strong>ce the frame n-1 is dropped, the


80 Bitrate Transcod<strong>in</strong>gpurpose <strong>of</strong> this technique is to discover the most suitable motion vector b, which willpo<strong>in</strong>t to the best match <strong>of</strong> the current MB <strong>in</strong> the non-dropped frame n-2. Eventually,motion vector c will replace vector b.Figure 5-2: The basic concept beh<strong>in</strong>d the temporal transcod<strong>in</strong>g.These techniques work sufficiently well <strong>in</strong> the previous standards where the size <strong>of</strong> theMB is fixed (16×16 for the luma block) dur<strong>in</strong>g the <strong>in</strong>ter prediction. However, <strong>H.264</strong> hasvarious <strong>in</strong>ter modes <strong>in</strong> sizes <strong>of</strong> 16×16, 16×8, 8×16, 8×8 and sub8×8. For sub8×8, there arefurther four sub-partitions, namely sub8×8, sub8×4, sub4×8 and sub4×4. Such wide blockchoice <strong>in</strong>creases the complexity <strong>of</strong> the aforementioned techniques dramatically. Thenecessity <strong>of</strong> keep<strong>in</strong>g the transcoder’s complexity low is better expla<strong>in</strong>ed <strong>in</strong> Section 5.2.Based on our previous research, we conclude that currently there is no transcoder, whichcan serve well our target application as this is described <strong>in</strong> Section 5.2. In this dissertationwe propose a new low complexity bit rate transcoder, which works directly <strong>in</strong> thecompressed doma<strong>in</strong> <strong>in</strong> real time and either elim<strong>in</strong>ates or causes non-perceptible drifterrors.5.4.2 Ma<strong>in</strong> conceptsIn this section we describe the ma<strong>in</strong> concepts, which the proposed technique is based on,such as the <strong>H.264</strong> Prediction Models, the Frame Types, the Network Abstraction Layerand the Shot Boundary Detection. We also give a detailed description <strong>of</strong> the drift errors<strong>in</strong>ce this is the ma<strong>in</strong> problem that the proposed technique addresses.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 815.4.2.1 Drift ErrorFigure 5-3: The block diagram <strong>of</strong> the <strong>H.264</strong> encoder–decoder.The block diagram <strong>of</strong> an <strong>H.264</strong> encoder and decoder is illustrated <strong>in</strong> Figure 5-3. A video<strong>in</strong>ter frame X (n), is predicted from its reference frame and only the predictiondifferences are coded. As shown <strong>in</strong> Figure 5-3 the encoder embeds also a decoder <strong>in</strong> itexcept for the part <strong>of</strong> the entropy decod<strong>in</strong>g. The reason is because the encoder, <strong>in</strong> order toperform the motion estimation, must use, as a reference, the same reconstructed framewith the one, which is used by the decoder <strong>in</strong> order to perform the motion compensation.For example the current frame X (n)will be predicted by the reconstructed frameX ′( n −1) rather than by the orig<strong>in</strong>al frame X ( n −1). The same X ′( n −1)frame will beused by the decoder to reconstruct the frame X ′(n). If the frame X ′( n −1)is eithermodified or miss<strong>in</strong>g from the <strong>H.264</strong> bitstream then drift errors are generated. The errors


82 Bitrate Transcod<strong>in</strong>gaccumulate and cause the video quality to deteriorate with time until an <strong>in</strong>tra frame isreached. The visual side-effect <strong>of</strong> the drift error is illustrated <strong>in</strong> Figure 5-4.Figure 5-4: Drift errors (<strong>in</strong> circles), when frame 31 <strong>of</strong> the tennis sequence is miss<strong>in</strong>g fromthe <strong>H.264</strong> bitstream. The sequence (<strong>in</strong> SIF format) was encoded us<strong>in</strong>g JM16.2 reference<strong>H.264</strong> encoder (QP = 28). The bitsream was decoded by JM16.2 reference decoder andthe miss<strong>in</strong>g frame (31) was concealed by the “Frame Copy” method.5.4.2.2 Prediction ModelsIntra Prediction: <strong>H.264</strong> <strong>in</strong>troduces a new model <strong>of</strong> <strong>in</strong>tra prediction, also known as spatialprediction, where a macroblock is predicted by its neighbors. Then the macroblock issubtracted from its prediction. The residuals are transformed us<strong>in</strong>g an <strong>in</strong>teger transformand are quantized. Furthermore, an <strong>in</strong>tra prediction is formed for the completemacroblock or for each 4×4 block <strong>of</strong> luma samples (and associated chroma samples) <strong>in</strong>the macroblock. Refer to Section 4.5.2 for more details.Inter Prediction: Inter prediction, also known as temporal prediction, creates a predictionmodel where a macroblock is predicted from a previously encoded video frame us<strong>in</strong>gblock-based motion compensation. Important differences from earlier standards <strong>in</strong>cludethe support for a range <strong>of</strong> block sizes (from 16×16 down to 4×4) and f<strong>in</strong>e sub-samplemotion vectors. Refer to Section 3.1 for more details.5.4.2.3 Frame (and Slice) TypesAs expla<strong>in</strong>ed <strong>in</strong> Section 2.6 there are three types <strong>of</strong> frames with regards to the predictionmodel that is applied to them, namely I, P and B. A brief description <strong>of</strong> these types is alsogiven below not<strong>in</strong>g their differences from previous standards. However, there are alsotwo other types, IDR and D, (expla<strong>in</strong>ed below). These two types are not def<strong>in</strong>ed by theway they are predicted but rather by the way they are decoded.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 83I-Frame: (Table 2-2).The macroblocks <strong>in</strong> an I frame can be predicted only us<strong>in</strong>g the <strong>in</strong>traprediction model.P-Frame: (Table 2-2).The macroblocks <strong>in</strong> a P frame are predicted us<strong>in</strong>g the <strong>in</strong>terprediction model. The macroblocks are predicted from one or more (usually five)reference frames before the current frame. Another substantial difference from previousstandards is that the <strong>H.264</strong> encoder also allows the <strong>in</strong>tra prediction <strong>of</strong> a macroblock <strong>in</strong> a Pframe. The decision <strong>of</strong> which model will be used, is based on the Rate-DistortionOptimization method, mean<strong>in</strong>g that the encoder will choose to use <strong>in</strong>tra <strong>in</strong>stead <strong>of</strong> <strong>in</strong>terprediction if this results <strong>in</strong> better compression.B-Frame: (Table 2-2) B frame is <strong>in</strong> pr<strong>in</strong>ciple the same as a P frame. However, each <strong>in</strong>terpredictedmacroblock <strong>in</strong> a B frame may be predicted from one or two reference framesbefore and after the current frame <strong>in</strong> temporal order. The difference from previousstandards is that the <strong>H.264</strong> encoder allows the B frames to be used as reference frames.IDR-Frame: The <strong>H.264</strong> standard <strong>in</strong>troduces the Instantaneous Decoder Refresh (IDR)frame. The IDR is the same as an I frame. However, the subsequent P or B frames <strong>of</strong> anIDR frame are not allowed to use frames, prior to the IDR, as references. The first frame<strong>in</strong> a coded video sequence is always an IDR frame. Besides, the <strong>H.264</strong> encoder may <strong>in</strong>jectperiodical IDR frames <strong>in</strong>to the bitstream as an error resilient tactic because the IDRframes stop the accumulation <strong>of</strong> the temporal prediction errors, such as the drift errors. Ofcourse, the use <strong>of</strong> IDR frames results <strong>in</strong> <strong>in</strong>creased bit rate.D-Frame: <strong>H.264</strong> standard also <strong>in</strong>troduces the Disposable (D) frame. The D frame is aframe, which cannot be used as a reference for other frames. In previous standards the Dframe was synonymous to the B frame. S<strong>in</strong>ce <strong>H.264</strong> standard allows a B frame to be usedas a reference, a dist<strong>in</strong>ct D frame had to be def<strong>in</strong>ed. The <strong>H.264</strong> encoder may generateperiodical D frames. However, this affects the <strong>in</strong>ter prediction by limit<strong>in</strong>g the choice <strong>of</strong>the reference frames. As a result, us<strong>in</strong>g D frames deteriorates the bit rate. The D framesplay a key role <strong>in</strong> the proposed technique.5.4.2.4 Network Abstraction Layer (NAL)<strong>H.264</strong> makes a dist<strong>in</strong>ction between a Video Cod<strong>in</strong>g Layer (VCL) and a NetworkAbstraction Layer (NAL) [15]. The purpose <strong>of</strong> separately specify<strong>in</strong>g the VCL and NAL is


84 Bitrate Transcod<strong>in</strong>gto dist<strong>in</strong>guish between cod<strong>in</strong>g-specific features (at the VCL) and transport-specificfeatures (at the NAL). A coded video sequence is represented by a sequence <strong>of</strong> NAL unitsthat can be transmitted over a packet-based network or a bitstream transmission l<strong>in</strong>k orstored <strong>in</strong> a file. Each NAL Unit (NALU) conta<strong>in</strong>s a header and a set <strong>of</strong> datacorrespond<strong>in</strong>g to coded video data (RBSP) as is shown <strong>in</strong> Figure 2-2. In the context <strong>of</strong>this Chapter many times the term frame implies a NALU, which conta<strong>in</strong>s a frame. Theopposite also holds, i.e. the term NALU implies a frame. Figure 5-5 shows the first octet<strong>of</strong> the NALU header.Figure 5-5: First octet <strong>of</strong> the NAL Unit (NALU) header.What is <strong>in</strong>terest<strong>in</strong>g is that the NALU header conta<strong>in</strong>s useful <strong>in</strong>formation about the videodata, which are conta<strong>in</strong>ed <strong>in</strong> the NALU such as:The F or forbidden_zero_bit was <strong>in</strong>cluded to support gateways. <strong>H.264</strong> specificationdeclares a value <strong>of</strong> 1 as a syntax violation.The NRI, or nal_ref_idc signals the relative importance <strong>of</strong> that NALU. A value <strong>of</strong> 00 <strong>in</strong>b<strong>in</strong>ary format <strong>in</strong>dicates that the content <strong>of</strong> the NALU is not used to reconstruct referenceframes for <strong>in</strong>ter frame prediction.The Type or nal_unit_type specifies the NALU payload type as def<strong>in</strong>ed <strong>in</strong> [1] and is alsoshown <strong>in</strong> Table 5-1.The importance <strong>of</strong> the NALU header is that it reveals <strong>in</strong>formation about the video datawithout hav<strong>in</strong>g to decode it. The proposed technique takes advantage <strong>of</strong> this <strong>in</strong>formation<strong>in</strong> order to decide which frames to drop.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 85Table 5-1: NAL unit type codes.nal_unit_typeContent <strong>of</strong> NAL unit0 Unspecified1 Coded slice <strong>of</strong> a non-IDR picture2 Coded slice data partition A3 Coded slice data partition B4 Coded slice data partition C5 Coded slice <strong>of</strong> an IDR picture6 Supplemental enhancement <strong>in</strong>formation (SEI)7 Sequence parameter set8 Picture parameter set9 Access unit delimiter10 End <strong>of</strong> sequence11 End <strong>of</strong> stream12 Filler data13..23 Reserved24..31 Unspecified5.4.2.5 Shot Boundary DetectionA sequence <strong>of</strong> frames captured by one camera <strong>in</strong> a s<strong>in</strong>gle cont<strong>in</strong>uous action <strong>in</strong> time andspace is referred to as a video shot. Shot boundary detection is the automated detection <strong>of</strong>different shots <strong>in</strong> video sequences. A shot is a key element <strong>of</strong> movies. Usually there aregreat dissimilarities between two successive shots. The proposed method takes advantage<strong>of</strong> these dissimilarities <strong>in</strong> order to decide which frames to drop.5.4.3 Bit Rate TranscoderThe proposed method is ma<strong>in</strong>ly based on the disposable (D) frame concept (see Section5.4.2.3). A D frame can be dropped without generat<strong>in</strong>g drift error because it is not used asa reference for other frames. The problem is that the <strong>H.264</strong> encoder does not generate Dframes by default. Even if it does, the frequency that the D frames appear <strong>in</strong> the bitstreammay not be the desirable one. However, many other frames could be regarded as D or“almost” D frames, mean<strong>in</strong>g that only a few macroblocks with<strong>in</strong> these frames are used asreferences for <strong>in</strong>ter predicted macroblocks <strong>in</strong> other frames. These frames could also bedropped caus<strong>in</strong>g a non-perceptible drift error. There are also other frames, which can bedropped under certa<strong>in</strong> conditions. All <strong>of</strong> the aforementioned frames will be referred toaltogether as droppable.The purpose <strong>of</strong> the proposed transcoder is tw<strong>of</strong>old. First it must discover these droppableframes and signal them <strong>in</strong> the NALU header (Figure 5-5). Then it must drop the markedNALUs accord<strong>in</strong>g to the bit rate requirements <strong>of</strong> the channel. Figure 5-6 shows the blockdiagram <strong>of</strong> the proposed method. There are two dist<strong>in</strong>ct components, namely the


86 Bitrate Transcod<strong>in</strong>gDroppable Frame Generator (DFG) and the Bit Rate Controller (BRC). The ma<strong>in</strong>advantage <strong>of</strong> separat<strong>in</strong>g the two components is that they can be implemented <strong>in</strong> differentdevices as shown <strong>in</strong> Figure 5-7. The BRC can be implemented <strong>in</strong> a gateway <strong>of</strong> limitedcapabilities, while the DFG can be implemented <strong>in</strong> a powerful stream server. In that wayone can correct or implement more advanced techniques for generat<strong>in</strong>g droppable frameswhilst ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the same simple s<strong>of</strong>tware <strong>in</strong> the gateway. Anyway, a gateway cannotnot be easily upgraded.<strong>H.264</strong> bitstreamBit Rate ControllerDecode NAL UnitParse NAL HeaderIs it marked?YesNo/keepYes/dropCan it bedropped?NoIs bit ratereductionneeded?noYesMark itD Frame generatorTranscoded <strong>H.264</strong> bitstreamFigure 5-6: The block diagram <strong>of</strong> the proposed bit rate transcoder.<strong>Ph</strong>ysical ChannelD FrameGeneratorLogical ChannelBit RateControllerFigure 5-7: The implementation <strong>of</strong> the bit rate transcoder <strong>in</strong> separate devices.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 875.4.3.1 Droppable Frame Generator (DFG)The DFG takes <strong>in</strong>to account the special encod<strong>in</strong>g characteristics <strong>of</strong> the <strong>H.264</strong> encoder <strong>in</strong>order to choose the frames that are candidates for be<strong>in</strong>g dropped. When thesecharacteristics do not generate a sufficient number <strong>of</strong> candidates, then the DFG appliesseveral rules <strong>in</strong> order to <strong>in</strong>crease these candidates.Rule 1: If only a few (even better none) macroblocks with<strong>in</strong> a frame are used asreferences by other frames, then this frame can be dropped caus<strong>in</strong>g a non-perceptible drifterror <strong>in</strong> the decoded video. The proposed technique detects such frames us<strong>in</strong>g a “shotboundary detection” approach. The concept beh<strong>in</strong>d these methods is that with<strong>in</strong> an <strong>H.264</strong>sequence there is a strong <strong>in</strong>ter frame correlation unless significant changes occur. As aconsequence the different prediction types and the direction <strong>of</strong> the reference frames <strong>in</strong> aframe may <strong>in</strong>dicate severe dissimilarities between consecutive frames, i.e. a shotboundary. A number <strong>of</strong> shot boundary detection methods on <strong>H.264</strong> encoded sequenceshave been studied [69, 70 and 71]. In our work we applied the method described <strong>in</strong> [69].This method depends on the MB prediction types, the MB partitions and the displaynumbers <strong>of</strong> the reference pictures <strong>in</strong> order to detect the shot boundary. Figure 5-8illustrates the possible positions <strong>of</strong> a shot boundary with<strong>in</strong> an <strong>H.264</strong> sequence.Temporal predictionSpatial predictionI/P/B P/B P/BShot Boundary DetectionTemporal predictionSpatial predictionI/P/B I P/BTemporal predictionFigure 5-8: Possible positions <strong>of</strong> a shot boundary.


88 Bitrate Transcod<strong>in</strong>gWhen a shot boundary occurs at t and the next frame F t+1 is <strong>in</strong>tra (I), <strong>H.264</strong> will encodethe frame’s macroblocks as <strong>in</strong>tra macroblocks exclusively as already expla<strong>in</strong>ed <strong>in</strong> Section2.6. If the F t+1 is <strong>in</strong>ter frame (P), the <strong>H.264</strong> encoder will prefer to spatially predict itsmacroblocks as if they were <strong>in</strong>tra macroblocks. The reason is because the previous frameF t belongs to a different shot and presents little resemblance with F t+1 . Therefore, thespatial prediction is likely to compute smaller residuals than the temporal prediction. IfF t+1 is B, the <strong>H.264</strong> encoder will prefer to use the next frame F t+2 as a reference (Figure5-8) for the same reason as <strong>in</strong> the previous (P) case. In any case, the frames, whichprecede the shot boundary, are not used as references for the next frames. Thus they canbe dropped. The advantages <strong>of</strong> this method, <strong>in</strong> conjunction with [69], are the follow<strong>in</strong>g: It works <strong>in</strong> the compressed doma<strong>in</strong>. It is fast because it requires only entropy decod<strong>in</strong>g <strong>of</strong> the NALU. Moreover, itneeds to know only the different macroblock types and the correspond<strong>in</strong>g displaynumbers <strong>of</strong> the reference frames. It does not require much buffer<strong>in</strong>g because it is applied on a per frame basis. A shot boundary leads to high peaks <strong>in</strong> the bit rate ma<strong>in</strong>ly due to the spatialprediction that takes place that time as expla<strong>in</strong>ed above and illustrated <strong>in</strong> Figure5-8. As a matter <strong>of</strong> fact there are shot boundary detection methods, which exam<strong>in</strong>ethe bit rate peaks <strong>in</strong> order to detect a shot boundary. Therefore, the shot boundarydetection method will probably take place when rate control is actually needed.Rule 2: If a frame is an IDR then the previous frames are surely not referenced by theframes that follow. These frames are considered to be D frames and they can be safelydropped without affect<strong>in</strong>g the visual quality <strong>of</strong> the decoded video.Rule 3: If a frame is I and the number <strong>of</strong> the reference frames is one, then aga<strong>in</strong> theprevious frames are D frames s<strong>in</strong>ce there is no way for them to be used as references.The flow chart <strong>of</strong> the DFG for rules 1, 2 and 3 is shown <strong>in</strong> Figure 5-9.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 89Buffered NALUs...X(n-2)X(n-1)X(n)X(n)Read NALUHeaderEntropy decode the PPSYesType=PPSStore the number <strong>of</strong> the referenceframes num_ref_idx_l0_active_m<strong>in</strong>us1We ignoreTypes>8See Table 5-1NoType=IDRYesSignal X(n-1) as a D frame <strong>in</strong> theNALU HeaderNRI=0 & F=1NoEntropy decode theSlice HeaderNoSlice Type=7 (I) &num_ref_idx_l0_active_m<strong>in</strong>us1=0YesYesEntropy decodethe MacroblocksApply ShotChange DetectionShot ChangeNo/get next NALUFigure 5-9: Flow chart <strong>of</strong> the rules 1, 2 and 3, which are applied by the Droppable FrameGenerator (DFG).Rule 4: The X ( k − r)is a D-like frame if X ( k − r +1)is an I frame and the follow<strong>in</strong>gconditions are met for the X ( k − n)frame:X ( k − n)is an I frame andTref≤r−(n+1) N k n∑ ∑ −i= 0 j=0idx(i,j)≤ Nk −n(5-1)where 0 ≤ n ≤ r + 2 , I denotes an <strong>in</strong>tra frame, r is the number <strong>of</strong> the reference framesthat have been used dur<strong>in</strong>g the encod<strong>in</strong>g, N − )is total number <strong>of</strong> the luma blocks and( k nsub-locks <strong>in</strong> frame X − ), idx ( i,j)is the reference <strong>in</strong>dex <strong>of</strong> the reference list 0 <strong>of</strong> the( k n


90 Bitrate Transcod<strong>in</strong>gblock j , which equals to i and Trefis a threshold, which is used to denote a D-like frame.In this work T=refN k −nEq. (5-1) is very common when the motion between successive frames is either slowor/and smooth. Rule 4 is effectively the same as rule 2 for r = 1. Figure 5-10 illustratesthis rule.Figure 5-10: Example <strong>of</strong> rule 4, Frame X(k-4) is a droppable one because it is not usedby the follow<strong>in</strong>g frames as a reference.Rule 5: Frame X (k)can be dropped if the follow<strong>in</strong>g condition is met:Tm≤P∑i=0MB ( i)+PM∑j=0MB16x16( j)≤ N(5-2)where N is the total number <strong>of</strong> the <strong>in</strong>ter-predicted MBs <strong>in</strong> frame X (k), MBPdenotes askipped MB, MB16x16denotes a 16×16 MB with (0, 0) motion vector, i.e. a static MB andTmis a threshold, which is used to denote a droppable frame. In this work3×NT m= .4Eq. (5-2) is true for frames, which have large static areas and almost no motion, such asthe frames <strong>of</strong> a surveillance camera. Rule 5 <strong>in</strong>dicates mostly a frame that is almostidentical to the previous one and thus it can be replaced by it.Rule 6: A frame is droppable if the NRI field <strong>in</strong> its NALU header equals to zero. Thisactually takes place dur<strong>in</strong>g the encod<strong>in</strong>g by sett<strong>in</strong>g the encoder’s parameter DisposablePequal to one. This will generate a bitstream where every second P frame is Disposableand thus droppable.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 915.4.3.1.1 Signal<strong>in</strong>g <strong>of</strong> the droppable frameIt is important that the DFG must signal the droppable frames <strong>in</strong> the NALU headerwithout violat<strong>in</strong>g its syntax. The DFG will do so by modify<strong>in</strong>g the NALU header (Figure5-5) as follows:- If the NALU conta<strong>in</strong>s a Droppable Frame- Set NRI = 0- Set F = 1 (Rules 1,2,3,4)- Set F = 0 (Rule 5,6)Where NRI = 0 means that the current NALU conta<strong>in</strong>s a droppable frame and F = 1<strong>in</strong>dicates that none <strong>of</strong> the frames, which precede the current frame, is used as a referenceby the frames, which follow. For droppable frames, which have been detected by rules 5and 6 we set F = 0 <strong>in</strong> order to protect the previous frames s<strong>in</strong>ce rules 5 and 6 detect onedroppable frame at a time. Note that F = 1 normally denotes a syntax violation with<strong>in</strong> theNALU but it is used only by the gateways and it has no impact on the decod<strong>in</strong>g.Therefore, we can safely use it for our purposes. Later the BRC will reset this flagalthough this is not necessary.5.4.3.2 Bit Rate Controller (BRC)The BRC will drop the frames based on the bit rate constra<strong>in</strong>ts <strong>of</strong> the transmissionchannel. Here we shall focus on the frame dropp<strong>in</strong>g assum<strong>in</strong>g that the bit rate constra<strong>in</strong>t isknown. The BRC is as simple as a parser <strong>of</strong> the NALU header. As is shown <strong>in</strong> Figure5-11, when NRI=0 and F=1 the BRC must drop k frames <strong>in</strong> order to meet the bit rateconstra<strong>in</strong>t. The variable k is crucial for the bit rate reduction as well as for the visualquality <strong>of</strong> the decoded video. The larger the k is, the greater bit rate reduction is achieved.However, this is achieved at the expense <strong>of</strong> the video quality. The visual side effect <strong>of</strong> abig k is usually an abrupt shot or/and an abnormal object movement due to the miss<strong>in</strong>g kframes. The value <strong>of</strong> k clearly depends on the number <strong>of</strong> frames between two successivedroppable frames. In other words if there is a sufficient number <strong>of</strong> droppable frameswith<strong>in</strong> a sequence, the value <strong>of</strong> k is small and close to one. In general k is def<strong>in</strong>ed as <strong>in</strong> eq.(5-3)1 ≤ k


92 Bitrate Transcod<strong>in</strong>gwhere N D is the number <strong>of</strong> frames between two successive droppable frames. This is thereason why the BRC needs to buffer up to N + 1 frames.DBuffered NALUs...X(n-2)X(n-1)X(n)X(n)Rate Control isneededNo/keepTranscoded BitstreamYesRead NALU Header0


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 93devices, such as cell phones and which they expect to receive a bitstream through an errorprone channel. As a matter <strong>of</strong> fact the JM16.2 reference <strong>H.264</strong> decoder supports two errorconcealment methods, namely Frame Copy and Motion Copy [68]. In our case, we followthe latter approach, i.e. we take no actions for correct<strong>in</strong>g the semantics violation.5.4.3.4 Performance AspectsThe proposed techniques guarantees a fast execution s<strong>in</strong>ce it works directly <strong>in</strong> thecompressed doma<strong>in</strong> and requires only entropy decod<strong>in</strong>g. The speed is also <strong>in</strong>creased bythe separate implementation <strong>of</strong> the DFG and the BRC (Figure 5-7). We shall thereforefocus on the bit rate reduction and the memory requirements. Clearly, the bit ratereduction depends on the capability <strong>of</strong> the DFG to detect as many droppable frames aspossible. This, however, cannot be guaranteed <strong>in</strong> every sequence. For example a videosequence with a GOP <strong>of</strong> the form IPPP…, where only the first frame is I, which also hascomplex motion (rapid cont<strong>in</strong>uous movement <strong>of</strong> objects, camera zoom<strong>in</strong>g, cameramotion, etc.) and long duration without shot changes is not friendly to the proposedtechnique. However, <strong>in</strong> real life, video sequences which are expected to be wirelesslytransmitted, have some error resilient provisions, such as periodic I or IDR frames andpossibly some Flexible Macroblock Order<strong>in</strong>g (FMO) scheme. Moreover, if the sequenceis a movie or sport news we can assume that it will have many shots. All <strong>of</strong> these logicalassumptions make the proposed technique very efficient with regard to the bit ratereduction because most <strong>of</strong> the rules described <strong>in</strong> Section 5.4.3.1 can be applied.Regard<strong>in</strong>g the memory requirements, these are proportional to the appearance frequency<strong>of</strong> the droppable frames <strong>in</strong> the video sequence as it was expla<strong>in</strong>ed <strong>in</strong> Section 5.4.3.1. Wealways need to buffer the NALUs between two successive droppable frames.5.4.4 Simulation Results5.4.4.1 Simulation SetupThe simulation setup is described <strong>in</strong> Appendix II. In order to evaluate the performance <strong>of</strong>the proposed technique we followed two approaches. At first, we measured the impactthat each rule (described <strong>in</strong> Section 5.4.3.1) has to the bit rate and to the video qualityseparately. Several well-known representative video sequences <strong>in</strong> CIF and QCIF formatwere used. Secondly, we applied our technique <strong>in</strong> a movie. For that experiment we used achunk <strong>of</strong> the movie “Bourne Ultimatum”.


94 Bitrate Transcod<strong>in</strong>g5.4.4.2 Encod<strong>in</strong>g AspectsWe used the JM16.2 reference encoder <strong>in</strong> order to encode the test<strong>in</strong>g sequences so as therules 2, 3, 4, 5 and 6 can be applied. The basel<strong>in</strong>e pr<strong>of</strong>ile was used and the configurationparameters reta<strong>in</strong>ed their default values apart from the ones, which are shown <strong>in</strong> Table5-2. These were differentiated accord<strong>in</strong>g to the rule to be applied.5.4.4.3 Decod<strong>in</strong>g AspectsWe used JM16.2 reference decoder <strong>in</strong> order to decode the test<strong>in</strong>g sequences. Theconfiguration parameters <strong>of</strong> the decoder reta<strong>in</strong>ed the default values with the exception <strong>of</strong>the error concealment parameter, which was set to one, i.e. the Frame Copy errorconcealment method was used. The error concealment was required <strong>in</strong> order to apply themetrics eq. (I-1) and eq. (I-4), i.e. the decoded frames <strong>of</strong> the transcoded bitstream shouldbe equal to the decoded frames <strong>of</strong> the orig<strong>in</strong>al bitstream.5.4.4.4 ResultsThe Bit Rate Variation (eq. (I-1)) and the average PSNR (eq. (I-4)) for each rule (2, 3, 4,5 and 6) and for the sequences <strong>of</strong> Table 5-2 are shown <strong>in</strong> Table 5-3. The m<strong>in</strong>us sign <strong>in</strong> thebit rate variation denotes a bit rate reduction. Moreover, the larger the APSNR, the betterthe visual quality is. Figure 5-12, Figure 5-13, Figure 5-14, Figure 5-15 and Figure 5-16show the PSNR per frame between the orig<strong>in</strong>al and the transcoded sequences. A value <strong>of</strong>100 means that a transcoded frame is the same with the orig<strong>in</strong>al frame. Usually a value <strong>of</strong>more than 35 dB is considered good quality. From Table 5-3 and from the Figures onecan notice the effect <strong>of</strong> <strong>in</strong>creas<strong>in</strong>g the value <strong>of</strong> k (see eq. (5-3)).We also used movies for our experiments. Movies have many shots, so we could easilyapply rule 1 along with the other rules. We played back the transcoded sequence us<strong>in</strong>g aknown video player, “Elecard <strong>AVC</strong> HD” [73], which simply ignores the skipped framesand decodes the rest. Moreover, we applied the MOS metric (Appendix-Table I-1) <strong>in</strong>order to evaluate the visual quality <strong>of</strong> the transcoded sequence. In the depicted experimentwe set a bitrate constra<strong>in</strong>t <strong>of</strong> 15 Mb/s while the bitrate <strong>of</strong> the orig<strong>in</strong>al sequence exceeded20 Mb/s. Then we applied the rules <strong>in</strong> order to meet that constra<strong>in</strong>t. Figure 5-17 shows thebitrate <strong>of</strong> the transcoded sequence vs the orig<strong>in</strong>al bitrate. The bitrate peak <strong>in</strong>dicates thetime when a shot boundary occurred. At that time we applied the shot boundary detectiontechnique (rule 1) <strong>in</strong> order to decrease the bit rate. Subjective evaluation has been


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 95performed <strong>in</strong> accordance with the double stimulus impairment scale (DSIS) method [74]and the Mean Op<strong>in</strong>ion Score (MOS) (Appendix-Table I-1) was obta<strong>in</strong>ed by us<strong>in</strong>g theMSU tool [72]. The MOS measurement <strong>of</strong> the transcoded sequence was 4.6 accord<strong>in</strong>g toAppendix-Table I-1, which means that the subjects could hardly detect any errors.5.4.5 Further improvementsAn <strong>H.264</strong> encoder is highly configurable. The reference encoder JM16.2 presents morethan 160 configuration parameters which result <strong>in</strong> different encod<strong>in</strong>g schemes. Moreover,many <strong>of</strong> them can be detected directly or <strong>in</strong>directly <strong>in</strong> the compressed doma<strong>in</strong> by an<strong>in</strong>telligent algorithm. That makes possible the discovery <strong>of</strong> many other rules, such asthose described <strong>in</strong> Section 5.4.3.1, which can detect droppable frames. As a consequence,the possibility <strong>of</strong> detect<strong>in</strong>g droppable frames will be <strong>in</strong>creased and the performance <strong>of</strong> theproposed technique will be further improved.5.4.6 ConclusionsA new bit rate transcod<strong>in</strong>g technique, suitably adapted for <strong>H.264</strong> encoded sequences, isproposed. It works directly <strong>in</strong> the compressed doma<strong>in</strong> because it requires only entropydecod<strong>in</strong>g <strong>of</strong> the <strong>H.264</strong> bitstream. The basic concept beh<strong>in</strong>d the method is to discover theframes, which are not used as referenced by other frames, with<strong>in</strong> the <strong>H.264</strong> bitstream anddrop them <strong>in</strong> order to meet the bandwidth constra<strong>in</strong>ts <strong>of</strong> a communication channel. This isachieved by apply<strong>in</strong>g several rules, which take <strong>in</strong>to account a number <strong>of</strong> parameters, suchas the number <strong>of</strong> the reference frames used by the encoder as long as the possibledissimilarities between successive frames. The effectiveness <strong>of</strong> the method clearlydepends on the number <strong>of</strong> the non-reference frames that can detect. The method proved tobe very efficient under certa<strong>in</strong> conditions and works <strong>in</strong> real-time. Extensive objective andsubjective tests give excellent results. The technique could be further improved bydiscover<strong>in</strong>g more rules that can lead to more droppable frames.


96 Bitrate Transcod<strong>in</strong>gTable 5-2: Configuration parameters <strong>of</strong> the encoder.Rule Sequence Parameters2 akiyoIDRPeriod=5 NumberReferenceFrames=5(qcif, 150 frames)3 bridge-closeIntraPeriod=8 NumberReferenceFrames=1(qcif, 150 frames)4 grandmaIntraPeriod=10 NumberReferenceFrames=5(qcif, 150 frames)5 bridge-farIntraPeriod=10 NumberReferenceFrames=5(qcif, 150 frames)6 stefan(cif, 89 frames)DisposableP=1Table 5-3: Bit rate variation and APSNR.Rule Bit Rate Var. (Kb/s) calc. as <strong>in</strong> eq. (I-1) APSNR (dB)kcalc. as <strong>in</strong> eq. (I-4)2 1 -6.475694 88.134772 -7.040593 67.873953 -8.377952 65.933873 1 -6.255422 92.324052 -10.296116 42.001083 -14.074904 41.310704 1 -3.641593 95.073962 -7.493855 70.677833 -11.168888 66.593415 1 -26.828371 62.028236 1 -44.526611 60.36717Figure 5-12: PSNR results between the orig<strong>in</strong>al and the transcoded sequence akiyo(QCIF, 150 frames) for k=1, 2 and 3.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 97Rule 3PSNR (dB)11010090807060504030201 11 21 31 41 51 61 71 81 91 101 111 121 131 141Framebridge-close, k=1bridge-close, k=2bridge-close, k=3Figure 5-13: PSNR results between the orig<strong>in</strong>al and the transcoded sequence bridgeclose(QCIF, 150 frames) for k=1, 2 and 3.Rule 4PSNR (dB)110100908070605040301 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145Framegrandma, k=1grandma, k=2grandma, k=3Figure 5-14: PSNR results between the orig<strong>in</strong>al and the transcoded sequence grandma(QCIF, 150 frames) for k=1, 2 and 3.Figure 5-15: PSNR results between the orig<strong>in</strong>al and the transcoded sequence bridge-far(QCIF, 150 frames).


98 Bitrate Transcod<strong>in</strong>gFigure 5-16: PSNR results between the orig<strong>in</strong>al and the transcoded sequence stefan (CIF,89 frames).Figure 5-17: Bitrate <strong>of</strong> the transcoded movie (Bourne Ultimatum) vs the orig<strong>in</strong>al one.


6 EpilogueIn the previous chapters we described our research work <strong>in</strong> three different areas <strong>of</strong> <strong>H.264</strong>video cod<strong>in</strong>g, namely Inter Prediction (Chapter 3), Data Hid<strong>in</strong>g (Chapter 4) and Bit RateTranscod<strong>in</strong>g (Chapter 5). Moreover, our research was expanded on more than one aspectswith<strong>in</strong> each area. More specifically, <strong>in</strong> the context <strong>of</strong> the Inter Prediction, we developed: A new fast full search algorithm, which reduces the complexity <strong>of</strong> the <strong>H.264</strong>encoder by 53.57% and 32.34% compared to the full search and the fast fullsearch algorithms, which are <strong>of</strong>ficially adopted by the reference encoder,respectively. A new spatio-temporal predictor, like the median predictor, which actually def<strong>in</strong>esa new search area dur<strong>in</strong>g motion estimation. This may result <strong>in</strong> 7.3% reduction <strong>of</strong>the motion estimation time. This is a considerable improvement <strong>of</strong> the exist<strong>in</strong>gfast motion estimation algorithms (FME) tak<strong>in</strong>g <strong>in</strong>to account that the proposedscheme leaves the ma<strong>in</strong> core <strong>of</strong> these algorithms as is and it simply modifies the<strong>in</strong>itial search po<strong>in</strong>t. A fast multiple reference frame selector, which reduces the number <strong>of</strong> thereference frames used by the motion estimation process. This may result <strong>in</strong> 80%reduction <strong>of</strong> the motion estimation time <strong>in</strong> average. A mov<strong>in</strong>g object detection method, which uses the motion vectors produced bythe motion estimation <strong>in</strong> order to detect a mov<strong>in</strong>g object. The method is very fastbecause it works directly <strong>in</strong> the compressed doma<strong>in</strong> and thus it suits well a variety


100 Epilogue<strong>of</strong> applications, which <strong>in</strong>duce time constra<strong>in</strong>ts, such as CCTV based videosurveillance.In the context <strong>of</strong> the Data Hid<strong>in</strong>g research we developed: A method, which manipulates the modes dur<strong>in</strong>g the <strong>in</strong>ter prediction <strong>in</strong> order tohide data. The method results <strong>in</strong> high capacity <strong>of</strong> hidden data. It can hide 1600 bits<strong>of</strong> data <strong>in</strong> 3 sec (30 fps) <strong>of</strong> a wide range <strong>of</strong> video sequences. However, thecapacity can be further improved due to the expandable nature <strong>of</strong> the method. Itsma<strong>in</strong> advantage is that it achieves this capacity without affect<strong>in</strong>g the visual quality<strong>of</strong> the video. A scene change detection method, which is based on the previous data hid<strong>in</strong>gmethod. The method can be comb<strong>in</strong>ed with any exist<strong>in</strong>g scene change detectionmethod which works <strong>in</strong> the uncompressed doma<strong>in</strong>, enabl<strong>in</strong>g fast scene detection<strong>in</strong> the compressed doma<strong>in</strong>. Hence, the method can work <strong>in</strong> real time and suits wellfor applications such as video <strong>in</strong>dex<strong>in</strong>g. A method, which exploits the special I_PCM macroblocks <strong>in</strong> order to hide data.The method can hide 18 Kbits <strong>of</strong> data <strong>in</strong> just 10 sec (30 fps) <strong>of</strong> a wide range <strong>of</strong>video sequences without affect<strong>in</strong>g the visual quality <strong>of</strong> the video. In addition tothat, the method has unique capabilities, not exist<strong>in</strong>g <strong>in</strong> other previous methodsand which have not been overcome yet. First <strong>of</strong> all it can work <strong>in</strong> real timedirectly <strong>in</strong> the compressed doma<strong>in</strong>. Secondly, the marked <strong>H.264</strong> bitstream can bereused for hid<strong>in</strong>g new data numerous times without the need <strong>of</strong> the orig<strong>in</strong>al video,without hav<strong>in</strong>g to decode and re-encode the bitstream and without degrad<strong>in</strong>g thequality <strong>of</strong> the video. This is due to the nature <strong>of</strong> the I_PCM macroblock, whichallows pixel values to be <strong>in</strong>serted to the bitstream <strong>in</strong>tact, i.e. without be<strong>in</strong>gpredicted.It must be stressed, that the data hid<strong>in</strong>g methods open new directions <strong>in</strong> the data hid<strong>in</strong>gresearch <strong>in</strong> video not only because <strong>of</strong> their unique capabilities (high data capacity, realtime operation, reusability <strong>of</strong> the marked streams, etc.) but also because, for the first time,they moved the cost <strong>of</strong> the hidden data from the PSNR to the bit rate <strong>in</strong> contrast to all <strong>of</strong>the previously exist<strong>in</strong>g methods.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 101In the context <strong>of</strong> the Bit Rate Transcod<strong>in</strong>g we developed: A transcoder, which controls the bit rate <strong>of</strong> a <strong>H.264</strong> bitstream by dropp<strong>in</strong>g frames<strong>in</strong> the compressed doma<strong>in</strong>. Frame dropp<strong>in</strong>g is expected to result <strong>in</strong> high visualdegradation due to “drift” errors. However, our method controls the framedropp<strong>in</strong>g <strong>in</strong> such a way that it either elim<strong>in</strong>ates or makes imperceptible the “drift”errors. It achieves this by detect<strong>in</strong>g the frames, which are not used as referencesby other frames and thus they can be dropped. The droppable frame detection ispossible by analyz<strong>in</strong>g the NAL Units, which compose the <strong>H.264</strong> bitstream.Look<strong>in</strong>g at the above methods, one may get the impression that there is a great diversityand possibly some <strong>in</strong>consistency among them. However, a closer look will reveal that thecommon underly<strong>in</strong>g core is the <strong>in</strong>ter prediction scheme <strong>of</strong> <strong>H.264</strong>. Apart from the obvious<strong>in</strong>ter prediction methods (fast full search algorithm, spatio-temporal predictor andreference frame selector), the object detection is also based on the motion vectors <strong>of</strong> the<strong>in</strong>ter prediction. Also the first data hid<strong>in</strong>g method (and its associated scene changedetection method) takes advantage <strong>of</strong> the <strong>in</strong>ter prediction. Even, the I_PCM based datahid<strong>in</strong>g method, which takes place dur<strong>in</strong>g the <strong>in</strong>tra prediction, actually <strong>in</strong>terferes with themode decision process while encod<strong>in</strong>g an <strong>in</strong>ter frame. This happens because the <strong>H.264</strong>encoder allows for a macroblock to be predicted as <strong>in</strong>tra (and as I_PCM) even if itbelongs to an <strong>in</strong>ter frame. F<strong>in</strong>ally, the bit rate transcoder is also related to the <strong>in</strong>terprediction because it deals with drift errors. These errors are generated when someframes, which the <strong>H.264</strong> decoder needs to use as references dur<strong>in</strong>g the motioncompensation, are dropped.The above brief analysis makes clear the importance as well the key role <strong>of</strong> the <strong>in</strong>terprediction with<strong>in</strong> the <strong>H.264</strong> encod<strong>in</strong>g. As a matter <strong>of</strong> fact, the <strong>in</strong>ter prediction affects<strong>H.264</strong> <strong>in</strong> multiple ways; it <strong>in</strong>creases the compression ratio but at the same time it<strong>in</strong>creases the complexity <strong>of</strong> the encoder; it puts time constra<strong>in</strong>ts <strong>in</strong> several applicationsbut at the same time enables other applications to work <strong>in</strong> real time. It is thiscontradictory behavior <strong>of</strong> the <strong>in</strong>ter prediction, which makes it an excellent research area.In the follow<strong>in</strong>g sections, we discuss the contribution <strong>of</strong> the proposed methods to the<strong>H.264</strong> field and their potential improvements.


102 Epilogue6.1 CONTRIBUTIONIn this section we discuss the overall contribution <strong>of</strong> our work to the field <strong>of</strong> <strong>H.264</strong> videocod<strong>in</strong>g. The section is divided <strong>in</strong> three parts, namely Inter Prediction, Data Hid<strong>in</strong>g andBit Rate Transcod<strong>in</strong>g <strong>in</strong> consistence with the research areas described <strong>in</strong> Chapters 3, 4and 5, respectively. The contribution <strong>of</strong> each part is estimated based on the novelty <strong>of</strong> theresearch as well as on its pioneer<strong>in</strong>g <strong>in</strong> the sense whether our work <strong>in</strong>troduced newaspects <strong>in</strong> <strong>H.264</strong>. Other criteria that could also be taken <strong>in</strong>to account, not necessarilythough, are the number <strong>of</strong> the produced publications (8) and the number <strong>of</strong> the so farcitations (30).6.1.1 Inter predictionThe contribution <strong>of</strong> the proposed <strong>in</strong>ter prediction methods is somewhat limited. This ispartly due to the nature <strong>of</strong> the proposed methods <strong>in</strong> the sense that these methods do notcover all <strong>of</strong> the <strong>in</strong>ter prediction aspects. They are rather designed to work <strong>in</strong> conjunctionwith other exist<strong>in</strong>g techniques <strong>in</strong> order to improve their performance. The fact that manyother similar techniques, already exist <strong>in</strong> the literature also, justifies the low rat<strong>in</strong>g. Apartfrom the above, there are also a couple <strong>of</strong> other reasons why the contribution <strong>of</strong> ourmethods was kept low. First <strong>of</strong> all, most (if not all) <strong>of</strong> the <strong>in</strong>ter prediction methods areheuristic and they are based on experimental observations. This means that they are likelyto be sequence wise and hence cannot perform well for all <strong>of</strong> the sequences. Apparently,this puts some limitations on the performance and makes it very difficult for a newmethod to have an outstand<strong>in</strong>g contribution. F<strong>in</strong>ally, there is a great competition <strong>in</strong> thearea because many researchers, be<strong>in</strong>g familiar with the motion estimation techniques <strong>of</strong>previous standards (MPEG-2, H.263, etc.), cont<strong>in</strong>ue their research <strong>in</strong> <strong>H.264</strong> <strong>in</strong>terprediction. As a consequence all <strong>of</strong> the <strong>in</strong>ter prediction aspects are covered and newmethods will eventually be similar or comparable with exist<strong>in</strong>g ones. The mov<strong>in</strong>g objectdetection method described <strong>in</strong> Section 3.7 is an exception. This is a novel method, whichworks exclusively <strong>in</strong> <strong>H.264</strong> videos, directly <strong>in</strong> the compressed doma<strong>in</strong>. However, itslimitations described <strong>in</strong> Section 3.7.4, also constricts the method’s value.6.1.2 Data hid<strong>in</strong>gThe contribution <strong>of</strong> the proposed data hid<strong>in</strong>g methods is high. The ma<strong>in</strong> reason is becausewe followed a different approach from the ma<strong>in</strong>stream. We moved the cost <strong>of</strong> the hidden


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong> 103data from the visual quality to the bit-rate, i.e. the hidden data do not degrade the visualquality but <strong>in</strong>crease, although slightly, the bit rate. Previous methods were hid<strong>in</strong>g data atthe expense <strong>of</strong> the visual quality. Moreover, the method described <strong>in</strong> Section 4.5, is one<strong>of</strong> the very few data hid<strong>in</strong>g methods, which work <strong>in</strong> real time. In addition to that, ourmethod has unique capabilities (Section 4.5.3.3), which, to our knowledge, do not exist <strong>in</strong>any other data hid<strong>in</strong>g methods.6.1.3 Bitrate transcod<strong>in</strong>gThe contribution <strong>of</strong> the proposed bit rate transcod<strong>in</strong>g method is medium. The method isbased on the well-known technique <strong>of</strong> dropp<strong>in</strong>g frames <strong>in</strong> order to satisfy the bandwidthconstra<strong>in</strong>ts <strong>of</strong> a communication channel. However, frame dropp<strong>in</strong>g, although simple, isvery <strong>in</strong>effective because it causes severe distortions <strong>in</strong> the video. Our method controls theframe dropp<strong>in</strong>g <strong>in</strong> such a way that the distortions are either elim<strong>in</strong>ated or becomeimperceptible <strong>in</strong> the worst case. To our knowledge, there is no other similar method <strong>in</strong> theopen literature.6.2 FURTHER IMPROVEMENTSThe proposed methods have been categorized as enhancements and applied methods. The<strong>in</strong>ter prediction methods (fast full search algorithm, spatio-temporal predictor andreference frame selector) fall <strong>in</strong> the first category, while all <strong>of</strong> the rest fall <strong>in</strong> the latter.The role <strong>of</strong> an enhancement is to enhance the performance, here by reduc<strong>in</strong>g the <strong>in</strong>terprediction complexity <strong>of</strong> the <strong>H.264</strong> encoder. As such, an enhancement is not open todrastic improvement. However, the enhancements could be comb<strong>in</strong>ed either with eachother or with other exist<strong>in</strong>g techniques <strong>in</strong> order to <strong>in</strong>crease the overall performance. Onthe other hand, the applied methods present many potential improvements, which areenumerated below:The current design <strong>of</strong> the mov<strong>in</strong>g object detection has some limitations. First <strong>of</strong> all, theaccuracy <strong>of</strong> the method, especially the detection <strong>of</strong> an object’s contour, heavily dependson the number <strong>of</strong> the sub-blocks dur<strong>in</strong>g the motion estimation. This means that lack <strong>of</strong>sufficient number <strong>of</strong> sub-blocks due to either high quantization parameter (QP) or to slowmotion may lead to rather crude object detection. Moreover, the method cannot handlecomplex motions, such as the overlapp<strong>in</strong>g motions <strong>of</strong> two or more mov<strong>in</strong>g objects. Theimprovement has to do with elim<strong>in</strong>at<strong>in</strong>g these limitations.


104 EpilogueThe current design <strong>of</strong> the data hid<strong>in</strong>g method dur<strong>in</strong>g the <strong>in</strong>ter prediction uses only 4different block types, namely 16×16, 16×8, 8×16, 8×8. However, the scheme can also usethe sub-partitions <strong>of</strong> the 8×8 type (8×4, 4×8, 4×4), thus <strong>in</strong>creas<strong>in</strong>g the available bits forcod<strong>in</strong>g to 8. Apparently, the additional bits will <strong>in</strong>crease the data capacity decreas<strong>in</strong>g thenumber <strong>of</strong> the “tweaked” macroblocks at the same time. Moreover, the scheme usedconsecutive macroblocks with<strong>in</strong> a s<strong>in</strong>gle frame <strong>in</strong> order to hide the data. Anotherimprovement would have been if the macroblocks were widespread with<strong>in</strong> the frame oreven better if the macroblocks were widespread with<strong>in</strong> multiple frames. This approachwould improve the cod<strong>in</strong>g efficiency, s<strong>in</strong>ce the “motion error”, which is produced by thescheme, will not be accumulated <strong>in</strong> one place. In addition to that, the assignment <strong>of</strong> theb<strong>in</strong>ary codes <strong>in</strong> Table 4-1 could be modified so as to take <strong>in</strong>to account some videostatistics. For example the 16 × 16 block type appears more <strong>of</strong>ten than the other types.The message can therefore be coded us<strong>in</strong>g a Huffman cod<strong>in</strong>g and the Huffman code withthe highest probability could be assigned to the 16× 16 block type. The ga<strong>in</strong> <strong>of</strong> thisapproach will be that our scheme will most likely choose the block type, which wouldhave been chosen by the encoder <strong>in</strong> normal operation without our <strong>in</strong>terference.The I_PCM based data hid<strong>in</strong>g method <strong>in</strong>serts raw <strong>in</strong>formation <strong>in</strong>to the <strong>H.264</strong> bitstream.This makes possible a lot <strong>of</strong> potential improvements because the I_PCM macroblock canbe regarded as part <strong>of</strong> a still image. Therefore, many data hid<strong>in</strong>g and watermark<strong>in</strong>gtechniques, which work <strong>in</strong> the spatial doma<strong>in</strong>, can be applied.The Bit Rate Transcoder could take advantage <strong>of</strong> the high configurability <strong>of</strong> the <strong>H.264</strong>encoder <strong>in</strong> order to detect more droppable frames. The reference encoder presents morethan 160 configuration parameters, which result <strong>in</strong> different encod<strong>in</strong>g schemes.Moreover, many <strong>of</strong> them can be detected directly or <strong>in</strong>directly <strong>in</strong> the compressed doma<strong>in</strong>by an <strong>in</strong>telligent algorithm. That makes possible the discovery <strong>of</strong> more droppable frames<strong>in</strong>creas<strong>in</strong>g the performance <strong>of</strong> the method.


References[1] ITU-T Rec. (05/2003), “Advanced video cod<strong>in</strong>g for generic audiovisual services”,T-REC-<strong>H.264</strong>-200903-S.[2] W. Li and E. Salari, “Successive elim<strong>in</strong>ation algorithm for motion estimation”,IEEE Transactions on Image Process<strong>in</strong>g, Volume 4, Issue 1, Jan 1995.[3] T. Toivonen and J. Heikkila, “Fast full search block motion estimation for<strong>H.264</strong>/<strong>AVC</strong> with multilevel successive elim<strong>in</strong>ation algorithm”, InternationalConference on Image Process<strong>in</strong>g (ICIP) S<strong>in</strong>gapore, October 2004.[4] M. Yang, H. Cui and K. Tang, “Efficient tree structured motion estimation us<strong>in</strong>gsuccessive elim<strong>in</strong>ation”, IEEE Proceed<strong>in</strong>gs on Vision, Image and SignalProcess<strong>in</strong>g, Volume 151, Issue 5, 30 Oct. 2004.[5] Ce Zhu, Wei-Song Qi and W. Ser, “Predictive f<strong>in</strong>e granularity successiveelim<strong>in</strong>ation for fast optimal block-match<strong>in</strong>g motion estimation”, IEEE Transactionson Image Process<strong>in</strong>g, Volume 14, Issue 2, Feb. 2005.[6] Tiand<strong>in</strong>g Chen and Quan Xue, “Fast Motion Estimation with Multilevel SuccessiveElim<strong>in</strong>ation Algorithm and Early Term<strong>in</strong>ation for <strong>H.264</strong>/<strong>AVC</strong> Video Cod<strong>in</strong>g”,International Conference on Wireless Communications, Network<strong>in</strong>g and MobileComput<strong>in</strong>g (WiCOM) Wuhan , Ch<strong>in</strong>a, Sep.2006.[7] Yang Song, Zhenyu Liu, T. Kenaga and S.Goto, “Enhanced Strict MultilevelSuccessive Elim<strong>in</strong>ation Algorithm for Fast Motion Estimation”, IEEE InternationalSymposium on Circuits and Systems (ISCAS) New Orleans, USA, May 2007.[8] Jong-Nam Kim and Tae-Sun Choi, “A fast full-search motion-estimation algorithmus<strong>in</strong>g representative pixels and adaptive match<strong>in</strong>g scan”, IEEE Transactions onCircuits and Systems for Video Technology, Volume 10, Issue 7, Oct 2000.


106 References[9] Chen-Fu L<strong>in</strong> and J<strong>in</strong>-Jang Leou, “An adaptive fast full search motion estimationalgorithm for <strong>H.264</strong>”, IEEE Int. Symp. on Circuits and Systems (ISCAS), Kobe,Japan, May 2005.[10] I. Ahmad, Weiguo Zheng, Jiancong Luo and M<strong>in</strong>g Liou, “A fast adaptive motionestimation algorithm”, IEEE Transactions on Circuits and Systems for VideoTechnology, Vol. 16, No. 3. (2006).[11] Yan-Ho Kam and Wan-Chi Siu, “A Fast Full Search Scheme for Rate-DistortionOptimization <strong>of</strong> Variable Block Size and Multi-frame Motion Estimation”, IEEEInternational Midwest Symposium on Circuits and Systems (MWSCAS), PuertoRico, 2006.[12] Xuan J<strong>in</strong>g and Lap-Pui Chau, “Partial Distortion Search Algorithm Us<strong>in</strong>gPredictive Search Area for Fast Full-Search Motion Estimation”, IEEE SignalProcess<strong>in</strong>g Letters, Vol. 14, Nov. 2007.[13] Lung-Chun Chang, Kuo-Liang Chung and Tsung-Cheng Yang, “An improvedsearch algorithm for motion estimation us<strong>in</strong>g adaptive search order”, IEEE, SignalProcess<strong>in</strong>g Letters,Vol. 8, Issue 5, May, 2001.[14] Tian Song, K. Ogata, K. Saito and T. Shimamoto, “Adaptive Search Range MotionEstimation Algorithm for <strong>H.264</strong>/<strong>AVC</strong>”, IEEE International Symposium on Circuitsand Systems (ISCAS) New Orleans, USA, May 2007.[15] I.E.G. Richardson, “<strong>H.264</strong> and MPEG-4 Video Compression”, John Wiley & SonsLtd., 2003.[16] F. Crow, “Summed-area tables for texture mapp<strong>in</strong>g”, Proceed<strong>in</strong>gs <strong>of</strong>SIGGRAPH,volume 18(3), pages 207–212, 1984.[17] V.A. Nguyen and Y.P. Tan, “Efficient Block-Match<strong>in</strong>g Motion Estimation Basedon Integral Frame Attributes”, IEEE Transactions on Circuits and Systems forVideo Technology, Vol. 16, No 3, March 2006, pp. 375-385.[18] C.H. Cheung and L.M. Po, “<strong>Novel</strong> Cross-Diamond-Hexagonal Search Algorithmsfor Fast Block Motion Estimation”, IEEE Trans. Multimedia, vol. 7, no. 1, pp 16-22, Feb. 2005.[19] Y.K. Tu, J.F. Yang, M.T. Sun and Y.T Tsai, “Fast variable-size block motionestimation for efficient <strong>H.264</strong>/<strong>AVC</strong> encod<strong>in</strong>g”, Signal Process<strong>in</strong>g: ImageCommunication, vol. 20, pp. 595-623, 2005.[20] L. Yang, K. Yu, J. Li and S. Li, “Prediction-based directional fractional pixelmotion estimation for <strong>H.264</strong> video cod<strong>in</strong>g”, Proc. ICASSP, pp. II901-II904, 2005.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 107[21] J. Stottrup-Andersen, S. Forchhammer and S.M. Aghito, “Rate-distortioncomplexity optimization <strong>of</strong> fast motion estimation <strong>in</strong> <strong>H.264</strong>/MPEG-4 <strong>AVC</strong>”, Proc.ICIP 2004, Oct. 24-27, S<strong>in</strong>gapore, 2004.[22] H.C. Fei, C.J. Chen and S.H. Lai, “Enhanced downhill simplex search for fast videomotion estimation”, PCM 2005, Part I, LNCS 3767, pp. 512-523, 2005.[23] X.Q. Banh and Y.P. Tan, “Adaptive dual cros search algorithm for block-match<strong>in</strong>gmotion estimation”, IEEE Trans. Consumer Electronics, vol. 50, no. 2, pp. 766-775,May 2004.[24] P.Y Burgi, “Motion estimation based on the direction <strong>of</strong> <strong>in</strong>tensity gradient”, Imageand Video Comput<strong>in</strong>g, vol. 22, pp. 637-653, 2004.[25] A. Tourapis, O.C. Au and M.L. Liou, “Highly efficient predictive zonal algorithmfor fast block-match<strong>in</strong>g motion estimation”, IEEE Trans. Circuits and Systems forVideo Technology, vol. 12, pp. 934-947, Oct. 2002.[26] Z. Chen, P. Zhou and Y. He, “Fast <strong>in</strong>teger and fractional pel motion estimation forJVT”, JVT-F017r.doc, Jo<strong>in</strong>t Video Team (JVT) <strong>of</strong> ISO/IEC MPEG & ITU-TVCEG. 6th Meet<strong>in</strong>g, Awaji, Island, Japan, Dec. 5-13, 2002.[27] X. Yi, J. Zhang, N. L<strong>in</strong>g and W. Shang, “Improved and simplified fast motionestimation for JM”, Jo<strong>in</strong>t Video Team (JVT) <strong>of</strong> ISO/IEC MPEG and ITU-T VCEG16th meet<strong>in</strong>g, Poznan, Poland, 24-29 July 2005.[28] P. Y<strong>in</strong>, A.M. Tourapis and J.M. Boyce, "Fast mode decision and motion estimationfor JVT/<strong>H.264</strong>", <strong>in</strong> Proc. ICIP (3), 2003, pp.853-856.[29] Hung-Ju Li; Ch<strong>in</strong>g-T<strong>in</strong>g Hsu and Mei-Juan Chen, “Fast Multiple Reference FrameSelection Method for Motion Estimation <strong>in</strong> JVT/<strong>H.264</strong>”, IEEE Asia-PacificConference on Circuits and Systems, 6-9 Dec. 2004.[30] A. Chang, O.C. Au and Y.M. Yeung, “A <strong>Novel</strong> Approach To Fast Multi-FrameSelection For <strong>H.264</strong> Video Cod<strong>in</strong>g”, 2003 IEEE Int. Conf. on Acoustics, Speechand Signal Process<strong>in</strong>g, Vol. 3, 6-10 April 2003.[31] J-S Sohn and D-G Kim, “Fast Multiple Reference Frame Selection Method Us<strong>in</strong>gCorrelation <strong>of</strong> Sequence <strong>in</strong> JVT/<strong>H.264</strong>”, IEICE Trans. Fundamentals, Vol. E89–A,No. 3, March 2006.[32] C.W. T<strong>in</strong>g, L.M. Po and C.H. Cheung, “Center-Motion Estimation <strong>in</strong> <strong>H.264</strong>”, Proc.<strong>of</strong> the 2003 Int. Conf. on Neural Networks and Signal Process<strong>in</strong>g, Volume 2, 14-17Dec. 2003.


108 References[33] S.K. Kapotas and A.N. Skodras, “A New Spatio-Temporal Predictor for MotionEstimation <strong>in</strong> <strong>H.264</strong> Video Cod<strong>in</strong>g”, 8th Int. Workshop on Image Analysis forMultimedia Interactive Services (WIAMIS 2007), Santor<strong>in</strong>i, Greece, 6-8 June 2007.[34] Z. Chen and K.N. Ngan, “Recent Advances <strong>in</strong> Rate Control for Video Cod<strong>in</strong>g”,Signal Process<strong>in</strong>g Image Communication, Elsevier, pp. 19-38, Vol. 22, 2007.[35] A. Ahmad, D. Chen and S. Lee, “Robust compressed doma<strong>in</strong> object detection <strong>in</strong>MPEG videos”, Proceed<strong>in</strong>gs <strong>of</strong> Internet and Multimedia Systems and Applications,August 2003.[36] O. Sukmarg and K. Rao, “Fast algorithm detection and segmentation <strong>in</strong> MPEGcompressed doma<strong>in</strong>”, Proceed<strong>in</strong>gs <strong>of</strong> IEEE Region 10 Technical Conference,September 2000.[37] R. Wang, H. Zhang and Y. Zhang, “A confidence measure based mov<strong>in</strong>g objectextraction system for compressed doma<strong>in</strong>”, IEEE International Symposium onCircuits and Systems, pages 21–24, May 2000.[38] R. Venkatesh Babu and K. Ramakrishna, “Compressed doma<strong>in</strong> motionsegmentation for video object extraction”, IEEE International Conference onAcoustics, Speech and Signal Process<strong>in</strong>g, 4:3788–3791, May 2002.[39] Z. Wei, D. Jun, G. Wen and H. Q<strong>in</strong>gm<strong>in</strong>g, “Robust mov<strong>in</strong>g object segmentation on<strong>H.264</strong>/<strong>AVC</strong> compressed video us<strong>in</strong>g the block-based MRF model”, Real-TimeImag<strong>in</strong>g, 11(4):290– 299, 2005.[40] M. Ibrahim and S. Rao, “Motion Analysis In Compressed Video-A HybridApproach”, IEEE Workshop on Motion and Video Comput<strong>in</strong>g (WMVC), February2007.[41] R. Babu, K.R. Ramakrishnan and S.H. Sr<strong>in</strong>ivasan, “Video object segmentation: acompressed doma<strong>in</strong> approach”, IEEE Transaction on Circuits Systems for VideoTechnology 2004;14(4):462–74.[42] J.J. Chae and B.S. Manjunath, “Data Hid<strong>in</strong>g <strong>in</strong> Video”, IEEE Proc. Int. Conf. onImage Precess<strong>in</strong>g, pp.243-246, 1999.[43] V. Fotopoulos, A.N. Skodras, “Transform Doma<strong>in</strong> Water-mark<strong>in</strong>g: AdaptiveSelection <strong>of</strong> the Watermark's Position and Length”, Proc. Visual Communicationsand Image Process<strong>in</strong>g, VCIP2003, July 2003.[44] A. Sarkar, U. Madhow, S. Chandrasekaran and B.S. Manjunath, “Adaptive MPEG-2 Video Data Hid<strong>in</strong>g Scheme”, Proc. SPIE Security, Steganography andWatermark<strong>in</strong>g <strong>of</strong> Multimedia Contents IX, Jan. 2007.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 109[45] H. Liu, J. Huang and Y. Q. Shi, “DWT-Based Video Data Hid<strong>in</strong>g Robust to MPEGCompression and Frame Loss”, Int. Journal <strong>of</strong> Image and Graphics Vol.5 No.1, pp.111-134, Jan. 2005.[46] J. Zhang, J. Li and L. Zhang, “Video Watermark Technique <strong>in</strong> Motion Vector”,Proc. <strong>of</strong> XIV Symposium on Computer Graphics and Image Process<strong>in</strong>g, pp.179-182, Oct.2001.[47] Y. Bodo, N. Laurent and J.L. Dugelay, “Watermark<strong>in</strong>g Video, HierarchicalEmbedd<strong>in</strong>g <strong>in</strong> Motion Vectors”, Proc. Int. Conference on Image Process<strong>in</strong>g, Sept.2003.[48] D.Y. Fang and L.W. Chang, “Data Hid<strong>in</strong>g for Digital Video with <strong>Ph</strong>ase <strong>of</strong> MotionVector”, IEEE Proc. Int. Symposium on Circuits and Systems, ISCAS 2006, May2006.[49] M. Noorkami and R.M. Mersereau, “Towards Robust Compressed-Doma<strong>in</strong> VideoWatermark<strong>in</strong>g for <strong>H.264</strong>”, Proc. SPIE, Vol. 6072, pp. 489-497, 2006.[50] H. Cao, J. Zhou and S. Yu, “An Implement <strong>of</strong> Fast Hid<strong>in</strong>g Data <strong>in</strong>to <strong>H.264</strong>Bitstream based on Inter-Prediction Cod<strong>in</strong>g”, Proc. SPIE, Vol. 6043, pp. 123-130,2005.[51] D. Proefrock, H. Richter, M. Schlauweg and E. Mueller, “<strong>H.264</strong>/<strong>AVC</strong> VideoAuthentication Us<strong>in</strong>g Skipped Macroblocks for an Erasable Watermark”, Proc.SPIE Vol. 5960, pp. 1480-1489, 2005.[52] S. Chen, M. Shyu, C. Zhang and R.L. Kashyap, “Video scene change detectionmethod us<strong>in</strong>g unsupervised segmentation and object track<strong>in</strong>g”, IEEE InternationalConference on Multimedia and Expo (ICME), 2001, pp.57-60.[53] S. Han, “Shot detection comb<strong>in</strong><strong>in</strong>g bayesian and structural <strong>in</strong>formation”, Storageand Retrieval for Media Databases 2001, Vol. 4315, December 2001, pp509-516.[54] J. Oh, K.A. Hua and N. Liang, “A content-based scene change detection andclassification technique us<strong>in</strong>g background track<strong>in</strong>g”, IS&T/SPIE Conference onMultimedia Comput<strong>in</strong>g and Network<strong>in</strong>g 2000, San Jose CA, January 2000, pp.254-265.[55] J. Bescos, “Real-time shot change detection over onl<strong>in</strong>e MPEG-2 video”, IEEETrans on Circuits and Systems for Video Technology, Vol. 14, No.4, April 2004,pp.475-484.


110 References[56] C. Dulaverakis, S. Vagionitis, M. Zervakis and E. Petrakis, “Adaptive methods formotion characterization and segmentation <strong>of</strong> MPEG compressed frame sequences”,ICIAR 2004, Porto, Portugal, September 29-October 1, 2004, pp.310-317.[57] E. Saez, J.I. Benavides and N. Guil, “Reliable time scene change detection <strong>in</strong>MPEG compressed video”, ICME 2004, June 2004, Taipei, pp.567-570.[58] W. Fernando, C. Canagarajah and D. Bull, “Scene change detection algorithms forcontent-based video <strong>in</strong>dex<strong>in</strong>g and retrieval”, Electronics & CommunicationEng<strong>in</strong>eer<strong>in</strong>g Journal, June 2001, pp.117-126.[59] D. Robie and R. Mersereau, “Video error correction us<strong>in</strong>g steganography”,<strong>EURASIP</strong> Journal on Applied Signal Process<strong>in</strong>g, Feb. 2002, pp. 164-173.[60] Y.J. Jung, H.K. Kang and Y.M. Ro, “Metadata hid<strong>in</strong>g for content adaptation”, Dig.Watermark<strong>in</strong>g Int. Work. IWDW, 2003, pp. 456-467.[61] Sung M<strong>in</strong> Kim, Sang Beom Kim, Youpyo Hong and Chee Sun Won, "Data hid<strong>in</strong>gon H. 264/<strong>AVC</strong> compressed video", Image Analysis and Recognition, 2007 –Spr<strong>in</strong>ger.[62] R.C. Gonzalez and R.E. Woods, “Digital Image Process<strong>in</strong>g, Second Edition”,Prentice Hall, ISBN 0-201-18075-8.[63] Y. Hu, C. Zhang and Y. Su, "Information Hid<strong>in</strong>g Based on Intra Prediction Modesfor <strong>H.264</strong>/<strong>AVC</strong>", IEEE International Conference on Multimedia and Expo (ICME),Beij<strong>in</strong>g, Ch<strong>in</strong>a, July 2-5, 2007.[64] W. Bender, D. Gruhl and N. Morimoto, “Techniques for data hid<strong>in</strong>g”, TechnicalReport, Massachusetts Institute <strong>of</strong> Technology Media Lab, 1994.[65] J. X<strong>in</strong>, C.W. L<strong>in</strong> and M.T. Sun, “Digital video transcod<strong>in</strong>g”, IEEE Proceed<strong>in</strong>gs,vol. 7, issue 1, pp. 84-97, January 2005.[66] A. Vetro, C. Christopoulos and H. Sun, “Video transcod<strong>in</strong>g architectures andtechniques: An overview”, IEEE Signal Process. Mag., vol. 20, no 2, pp. 18-29,March 2003.[67] D. Lefol, D. Bull and N. Canagarajah, “Performance evaluation <strong>of</strong> transcod<strong>in</strong>galgorithms for <strong>H.264</strong>”, IEEE Trans. Consumer Electronics, vol. 52, issue 1, pp.215-222, February 2006.[68] S.K. Bandyopadhyay, Z. Wu, P. Pandit and J.M. Boyce, “An error concealmentscheme for entire frame losses for <strong>H.264</strong>/<strong>AVC</strong>”, Proc. IEEE Sarn<strong>of</strong>f Symposium,Mar. 2006.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 111[69] S. De Bruyne, W. De Neve, K. De Wolf, D. De Schrijver, P. Verhoeve and R.Walle, “Temporal video segmentation on <strong>H.264</strong>/<strong>AVC</strong> compressed bitstreams”,Lecture Notes <strong>in</strong> Computer Science, vol. 4351, pp. 1–12, Spr<strong>in</strong>ger, Berl<strong>in</strong>, 2007.[70] S.M. Kim, J. Byun and C. Won, “A scene change detection <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong>compression doma<strong>in</strong>”, Proc. PCM , Korea, 2005.[71] W Zeng and W Gao, “Shot Change Detection on <strong>H.264</strong>/<strong>AVC</strong> compressed video”,Proc. IEEE ISCAS, Kobe, Japan, 2005.[72] MSU Video Quality Measurement Tool.[73] Elecard <strong>AVC</strong> HD player, http://www.elecard.com/[74] F. Pereira and T. Ebrahimi, “The MPEG-4 Book”, IMSC Press, Prentice Hall PTR,2002.[75] I. Ahmad, X. Wei, Y. Sun and Y. Zhang, “Video Transcod<strong>in</strong>g: An Overview <strong>of</strong>Various Techniques and Research Issues”, IEEE Transactions On Multimedia, Vol.7, No. 5, October 2005.


Appendices


114 AppendicesAppendix I. MetricsVarious metrics, objective and subjective, were used <strong>in</strong> order to evaluate the performance<strong>of</strong> the proposed methods. The evaluation was done by compar<strong>in</strong>g the reference encoderprovided by the JVT and a modified one. Refer to Appendix II for more details about themethodology that was followed.I.I. OBJECTIVE METRICSI.i.i. Bit Rate (bits/sec)The bit rate refers to the generated bit rate after <strong>H.264</strong> compression has been applied to araw video sequence.I.i.ii. Bit Rate Variation (%)It is a comparative metric, which compares the reference <strong>H.264</strong> encoder with a modified<strong>H.264</strong> encoder <strong>in</strong> terms <strong>of</strong> the generated Bit Rate. The Bit Rate Variation is calculated asfollows:R′− R∆R = ×100(%)(I-1)Rwhere R is the Bit Rate <strong>of</strong> the reference encoder and R′ is the Bit Rate <strong>of</strong> the modifiedencoder.A negative∆ R means that the modified encoder generates fewer bits <strong>in</strong> the output, i.e. itoutperforms the reference encoder.I.i.iii. Encod<strong>in</strong>g Time (sec)The Encod<strong>in</strong>g Time refers either to the total encod<strong>in</strong>g time or only to the MotionEstimation time. It is assumed that the Encod<strong>in</strong>g Time is the total encod<strong>in</strong>g time unless itis explicitly stated that it is the Motion Estimation time throughout the document.


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 115I.i.iv. Encod<strong>in</strong>g Time Variation (%)It is a comparative metric, which compares the reference <strong>H.264</strong> encoder with a modified<strong>H.264</strong> encoder <strong>in</strong> terms <strong>of</strong> the Encod<strong>in</strong>g Time. The Encod<strong>in</strong>g Time Variation iscalculated as follows:T ′ − T∆T = ×100(%)(I-2)Twhere T is the Encod<strong>in</strong>g Time <strong>of</strong> the reference encoder and T ′ is the Encod<strong>in</strong>g Time <strong>of</strong>the modified encoder.A negative∆ T means that the modified encoder takes less time to encode a sequencethan the reference encoder, i.e. it outperforms the reference encoder.I.i.v. PSNR/APSNR (dB)The PSNR/APSNR metrics are used to evaluate the visual quality <strong>of</strong> the compressedvideo. The PSNR is calculated as follows:2MaxError w×hPSNR ( n)= 10log10(I-3)w-1,h-12( x - y )∑i=0, j=0i, ji, jwhere MaxError is the maximum possible absolute value <strong>of</strong> color components difference.The MaxError is 255 for 8-bit color components. Thethe video frames andw, h are the width and height <strong>of</strong>x, y are the pixel’s luma values <strong>of</strong> the decoded frames <strong>of</strong> theorig<strong>in</strong>al bitstream and <strong>of</strong> the modified bitstream, respectively.The APSNR is calculated as follows:APSNRN= t∑n=1PSNR(n)Nt(I-4)where PSNR (n)is the outcome <strong>of</strong> eq. (I-3) and N t is the total number <strong>of</strong> the frames <strong>in</strong>the video sequence.


116 AppendicesI.i.vi. PSNR Variation (dB)The PSNR Variation is calculated as follows∆ PSNR = APSNR′− APSNR(I-5)where APSNR is the outcome <strong>of</strong> eq. (I-4) and refers to reference encoder whilst theAPSN R′ refers to the modified encoder.A positive∆ PSNR means that the modified encoder results <strong>in</strong> better visual quality thanthe reference encoder, i.e. it outperforms the reference encoder.I.II.SUBJECTIVE METRICSI.ii.i. Mean Op<strong>in</strong>ion Score (MOS)The perception <strong>of</strong> visual quality is <strong>in</strong>fluenced by spatial fidelity (how clearly parts <strong>of</strong> thescene can be seen, whether there is any obvious distortion) and temporal fidelity (whethermotion appears natural and ‘smooth’). However, a viewer’s op<strong>in</strong>ion <strong>of</strong> ‘quality’ is alsoaffected by other, subjective, factors such as the view<strong>in</strong>g environment, the observer’sstate <strong>of</strong> m<strong>in</strong>d and the extent to which the observer <strong>in</strong>teracts with the visual scene.As a subjective quality metric the mean op<strong>in</strong>ion score (MOS) was employed, whichprovides a numerical <strong>in</strong>dication <strong>of</strong> the perceived quality <strong>of</strong> the video stream. MOS isexpressed as a s<strong>in</strong>gle number <strong>in</strong> the range 1 to 5, where 1 is the lowest perceived videoquality and 5 is the highest one. MOS is generated by averag<strong>in</strong>g the results <strong>of</strong> a set <strong>of</strong>subjective tests where a number <strong>of</strong> viewers rate the transcoded video us<strong>in</strong>g the rat<strong>in</strong>gscheme shown <strong>in</strong> Appendix-Table I-1.Appendix-Table I-1: Mean op<strong>in</strong>ion score.MOS Quality Impairment5 Excellent Imperceptible4 Good Perceptible but not annoy<strong>in</strong>g3 Fair Slightly annoy<strong>in</strong>g2 Poor Annoy<strong>in</strong>g1 Bad Very annoy<strong>in</strong>g


<strong>Novel</strong> methods <strong>in</strong> <strong>H.264</strong>/<strong>AVC</strong> 117Appendix II. Simulation environmentII.I.HARDWAREThe simulation tests were executed <strong>in</strong> W<strong>in</strong>dows XP OS on an Intel CPU T2400 at 1.83GHz with 1.50 GB RAM.II.II.SOFTWAREAll <strong>of</strong> the proposed methods were implemented <strong>in</strong> standard C and they were <strong>in</strong>corporated<strong>in</strong>to the reference <strong>H.264</strong> code provided by the JVT.II.III. TESTING SEQUENCESSeveral video sequences <strong>in</strong> YUV 4:2:0 format <strong>of</strong> various resolutions QCIF (176x144),CIF (352x288) and SIF (352x288) were tested. The test<strong>in</strong>g sequences are also separated<strong>in</strong>to the follow<strong>in</strong>g classes:Class AClass BClass CClass DClass ELow spatial detail and low amount <strong>of</strong> movementMedium spatial detail and low amount <strong>of</strong> movement or vice versaHigh spatial detail and medium amount <strong>of</strong> movement or vice versaStereoscopic (out <strong>of</strong> scope)Hybrid natural and synthetic (out <strong>of</strong> scope)Us<strong>in</strong>g a comb<strong>in</strong>ation <strong>of</strong> sequences <strong>of</strong> all <strong>of</strong> the above classes gives an <strong>in</strong>dication <strong>of</strong> thegenerality <strong>of</strong> the method under test. This <strong>in</strong>dication is important because <strong>in</strong> many cases amethod, which is good for a Class A sequence it gives poor results for a Class C sequenceand vice versa. Appendix-Table I-1 presents the classification <strong>of</strong> some <strong>of</strong> the most knownvideo sequencesAppendix-Table II-1: Classification <strong>of</strong> video sequences.Class A Class B Class C Class DMother & daughterAkiyoHall MonitorConta<strong>in</strong>er ShipForemanNewsSilentParisTable TennisStefanMobile & CalendarTempeteChildrenBreamWeather


118 AppendicesII.IV. METHODOLOGYThe reference <strong>H.264</strong> codec was used as a basis. Each method was embedded <strong>in</strong>to thereference <strong>H.264</strong> code seamlessly. The data flow was directed to the embedded code <strong>in</strong>such a way that no other part <strong>of</strong> the codec was affected. Then, both <strong>of</strong> the reference <strong>H.264</strong>codec and the modified one ran us<strong>in</strong>g the same configuration successively. Theperformance <strong>of</strong> each method was evaluated by compar<strong>in</strong>g the results <strong>of</strong> the two runs. Themetrics, which are described <strong>in</strong> Appendix I, were used for the evaluation. The metrics’<strong>in</strong>put parameters, Bit Rate, Encod<strong>in</strong>g Time and PSNR, were obta<strong>in</strong>ed by the <strong>in</strong>tr<strong>in</strong>siclogg<strong>in</strong>g mechanism <strong>of</strong> the <strong>H.264</strong> codec. Appendix-Table II-2 presents the log file(log.dat), which is generated by the <strong>H.264</strong> encoder.Appendix-Table II-2: <strong>H.264</strong> reference encoder’s log file.Name Format PurposeVer W.X/Y.Z Encoder VersionDate MM/DD Simulation End DateTime HH:MM Simulation End TimeSequence %30.30s Sequence Name#Img %5d Coded Primary FramesP/MbInt %d/%d Picture level/ Macroblock levelQPI %-3d I slice QuantizerQPP %-3d P slice QuantizerQPB %-3d B slice QuantizerFormat %4dx%4d Width x HeightIperiod %3d Intra Period#B %3d Number <strong>of</strong> B coded framesFMES FS|FFS|HEX|SHEX|EPZS Fast Motion Estimation usageHdmd %1d%1d%1d Distortion functions for Motion estimationS.R %3d Maximum Search Range#Ref %2d Maximum number <strong>of</strong> referencesFreq %3d Coded Video Frame RateCod<strong>in</strong>g CABAC|CAVLC Entropy Mode UsedRD-opt %d Rate Distortion Optimization OptionIntra upd ON|OFF Use <strong>of</strong> MbL<strong>in</strong>eIntraUpdate.8x8Tr %d Mode usage <strong>of</strong> 8x8 transformSNRY 1 %-5.3f Luma PSNR for first frame <strong>in</strong> sequenceSNRU 1 %-5.3f Chroma U PSNR for first frame <strong>in</strong> sequenceSNRV 1 %-5.3f Chroma V PSNR for first frame <strong>in</strong> sequenceSNRY N %-5.3f Luma PSNR for entire sequenceSNRU N %-5.3f Chroma U PSNR for entire sequenceSNRV N %-5.3f Chroma V PSNR for entire sequence#Bitr I %6.0f Bitrate assigned to I coded frames#Bitr P %6.0f Bitrate assigned to P coded frames#Bitr B %6.0f Bitrate assigned to B coded frames#Bitr IPB %6.0f Sequence Bitrate <strong>in</strong>clud<strong>in</strong>g overheadsTotal Time %12d Encod<strong>in</strong>g Time <strong>in</strong> msMe Time %12d Motion Estimation only time <strong>in</strong> ms

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!