Novel Methods in H.264/AVC - Library of Ph.D. Theses | EURASIP

Novel Methods in H.264/AVC - Library of Ph.D. Theses | EURASIP Novel Methods in H.264/AVC - Library of Ph.D. Theses | EURASIP

from theses.eurasip.org More from this publisher

12.07.2015 Views

Novel Methods in H.264/AVCInter Prediction, Data Hiding, Bit Rate TranscodingSPYRIDON K. KAPOTASJune 25, 2011HELLENIC OPEN UNIVERSITYSchool of Science and TechnologyDigital Systems & Media Computing Laboratory

Novel Methods in H.264/AVCInter Prediction, Data Hiding, Bit Rate TranscodingSPYRIDON K. KAPOTASJune 25, 2011HELLENIC OPEN UNIVERSITYSchool of Science and TechnologyDigital Systems & Media Computing Laboratory

AbstractH.264 Advanced Video Coding has become the dominant video coding standard in themarket, within a few years after the first version of the standard was completed by theISO/IEC MPEG and the ITU-T VCEG groups in May 2003. That happened mainly due tothe great coding efficiency of H.264. Compared to MPEG-2, the previous dominantstandard, the H.264 compression ratio is about twice as higher for the same video quality.That makes H.264 ideal for a numerous of applications, such as video broadcasting, videostreaming and video conferencing. However, the H.264 efficiency is achieved at theexpense of the codec’s complexity. H.264 complexity is about four times that of MPEG-2. As a consequence, many video coding issues, which have been addressed in previousstandards, need to be re-considered. For example the H.264 encoding of a video in realtime is now an open issue. Re-applying older solutions is feasible but insufficient becausethe new H.264 characteristics are not taken into account and thus the problems caused bythese characteristics are not properly addressed. On the other hand, these characteristicsmake possible a series of applications that either were not possible or showed inferiorresults prior the H.264 era.This dissertation aims at investigating novel methods, which take advantage of the newcharacteristics introduced by H.264. These methods are of two categories, namelyenhancements and applied methods. The goal of the enhancements is to improve theperformance of the H.264 encoder by reducing its complexity. We focused on the interprediction part of the encoder. Three representative methods of this category areintroduced; a fast full search algorithm, which reduces the motion estimation time(53.7%), a predictor, which optimizes the search area during the motion estimation and areference frame selector, which reduces the motion estimation time (80%) by reducing thenumber of the reference frames during the motion estimation. The applied methods, onthe other hand, exploit the special H.264 characteristics in order to improve theirperformance. Two data hiding methods are introduced, which result in high capacity ofhidden data, e.g. 18 Kbits of data in 10 sec (30 fps) of video. In particular, the data hidingmethods opened new directions in the research of the data hiding in video not onlybecause of their unique capabilities (high data capacity, real time operation, reusability ofthe marked streams, etc.) but also because they moved the cost of the hidden data fromthe PSNR to the bit rate in contrast to all of the previously existing methods. In addition

to the data hiding methods, a bit rate transcoder, which controls the bit rate directly in thecompressed domain, is also introduced. Finally, a moving object detection method and ascene change detection method complete the repertoire of the applied methods.

ΠερίληψηTo πρότυπο κωδικοποίησης video H.264 κυριάρχησε στη αγορά µέσα σε λίγα χρόνιααφότου η πρώτη έκδοσή του ολοκληρώθηκε από τις οµάδες εργασίας MPEG και VCEGτων οργανισµών ISO και ITU αντίστοιχα, τον Μάιο του 2003. Αυτό οφείλεται κυρίωςστην αποτελεσµατικότητα του Η.264 όσον αφορά στην κωδικοποίηση του video.Χαρακτηριστικά, σε σύγκριση µε το MPEG-2, το προηγούµενο κυρίαρχο πρότυπο, ολόγος συµπίεσης που επιτυγχάνει το Η.264 είναι διπλάσιος για τη ίδια ποιότητα video.Αυτό καθιστά ιδανικό το H.264 για πολλές εφαρµογές, όπως τηλεοπτικές µεταδόσεις,video streaming και τηλεδιασκέψεων. Ωστόσο, η αποτελεσµατικότητα του H.264επιτυγχάνεται εις βάρος της πολυπλοκότητας του κωδικοποιητή. Η πολυπλοκότητα τουκωδικοποιητή Η.264 είναι περίπου τέσσερις φορές όσο αυτή του MPEG-2. Κατάσυνέπεια, πολλά προβλήµατα κατά την κωδικοποίηση, τα οποία έχουν αντιµετωπιστείστα προηγούµενα πρότυπα, πρέπει να επαναθεωρηθούν. Για παράδειγµα η κωδικοποίησηενός video σε πραγµατικό χρόνο είναι τώρα ένα ανοικτό θέµα. Παλαιότερες λύσεις είναιεφικτές, αλλά ανεπαρκείς, διότι τα νέα χαρακτηριστικά H.264 δεν λαµβάνονται υπόψηκαι έτσι τα προβλήµατα που προκαλούνται από τα χαρακτηριστικά αυτά δεναντιµετωπίζονται αποτελεσµατικά. Από την άλλη πλευρά, τα χαρακτηριστικά αυτάκαθιστούν δυνατές εφαρµογές, οι οποίες είτε δεν ήταν εφικτές είτε παρουσίαζαν φτωχάαποτελέσµατα πριν την έλευση του H.264.Αυτή η διατριβή αποσκοπεί στη διερεύνηση νέων µεθόδων, οι οποίες επωφελoύνται απότα νέα χαρακτηριστικά του H.264. Οι µέθοδοι αυτές χωρίζονται σε δύο κατηγορίες, σεαναβαθµίσεις (enhancements) και σε εφαρµοσµένες µεθόδους (applied methods). Οστόχος των αναβαθµίσεων είναι να βελτιώσουν τις επιδόσεις του κωδικοποιητή H.264µείωνοντας την πολυπλοκότητά του. Εστιάσαµε την προσοχή µας στο κοµµάτι τουκωδικοποιητή που αφορά στην χρονική πρόβλεψη (inter prediction). Αναπτύχθηκαν τρειςαντιπροσωπευτικές µέθοδοι αυτής της κατηγορίας. Μία µέθοδος γρήγορης πλήρουςαναζήτησης (fast full search algorithm), η οποία µειώνει το χρόνο εκτίµησης κίνησης(motion estimation) κατά 53,7%, µια µέθοδος, η οποία βελτιστοποιεί την περιοχήαναζήτησης κατά την εκτίµηση της κίνησης και έναν επιλογέα εικόνας αναφοράς, πουµειώνει το χρόνο εκτίµησης της κίνησης (80%), µειώνοντας τον αριθµό των εικόνωναναφοράς κατά την εκτίµηση της κίνησης. Οι εφαρµοσµένες µέθοδοι, από την άλληπλευρά, εκµεταλλεύονται τα ειδικά χαρακτηριστικά H.264 προκειµένου να βελτιώσουν

τις επιδόσεις τους. Αναπτύχθηκαν δύο µέθοδοι απόκρυψης δεδοµένων (data hiding), πουοδηγούν σε υψηλή χωρητικότητα των κρυφών δεδοµένων (data capacity), π.χ. 18 Kbitsδεδοµένων σε 10 δευτερόλεπτα (30 fps) του video. Ειδικότερα, οι µέθοδοι απόκρυψηςδεδοµένων ανοίγουν νέες κατευθύνσεις στο πεδίο έρευνας της απόκρυψης δεδοµένων σεvideo, όχι µόνο λόγω των µοναδικών δυνατοτήτων τους (δεδοµένα υψηλήςχωρητικότητας, δυνατότητα επαναχρησιµοποίησης των δυαδικών ακολουθιών(bitstream), πραγµατικό χρόνο λειτουργίας κ.λπ.), αλλά επίσης επειδή µετέφεραν τοκόστος απόκρυψης δεδοµένων από το PSNR στο ρυθµό µετάδοσης δεδοµένων (bitrate),σε αντίθεση µε τις ήδη υπάρχουσες µεθόδους. Επίσης αναπτύχθηκε µία τεχνικήµετατροπής (transcoding), η οποία ελέγχει το ρυθµό µετάδοσης δεδοµένων (bitratetranscoder) απευθείας στον συµπιεσµένο χώρο. Τέλος, µία µέθοδος ανίχνευσηςκινούµενου αντικειµένου (moving object detection) και µία µέθοδος ανίχνευσης αλλαγήςσκηνής (scene change detection) ολοκληρώνουν το ρεπερτόριο των εφαρµοσµένωνµεθόδων.

Submitted in total fulfillment of the requirements of the degree ofDoctor of PhilosophyJune 25, 2011

Examination CommitteeAthanassios Skodras * , Professor of Hellenic Open University, Greece.Athanassios Stouraitis * , Professor of University of Patras, Greece.Stefanos Kollias * , Professor of National Technical University of Athens, Greece.Konstantinos Berberidis, Professor of University of Patras, Greece.Vassilios Verykios, Associate Professor of Hellenic Open University, Greece.George Economou, Associate Professor of University of Patras, Greece.Emmanouil Psarakis, Assistant Professor of University of Patras, Greece.* Member of the Advisory Committee

Στη Ντόρα

DeclarationThis is to certify that: the dissertation comprises only my original work towards the PhD exceptwhere indicated, due acknowledgement has been made in the text to all other material used

AcknowledgementsI owe my deepest gratitude to my supervisor, Professor Skodras, a truly inspired teacher,whose encouragement and support enabled me to complete my research.I am also grateful to my family for their support and for being patient with me over thelast five years.

Novel methods in H.264/AVCiCONTENTSGLOSSARYPUBLICATIONSVIX1 INTRODUCTION 11.1 Motivation and goals 11.2 Structure of the dissertation 22 OVERVIEW OF H.264 32.1 Introduction 32.2 Terminology 52.3 Profiles and levels 62.4 Coded Data Format 72.5 Reference Pictures 72.6 Slices 82.7 Macroblocks 92.8 Technical overview 102.8.1 Encoder (forward path) 102.8.2 Encoder (reconstruction path) 112.8.3 Decoder 113 INTER PREDICTION 133.1 Introduction 133.2 Problem formulation 153.2.1 Inter prediction complexity 153.2.2 Special video applications 163.3 Solutions 163.4 Fast Successive Elimination Algorithm 173.4.1 Literature review 173.4.2 Full search in H.264 reference encoder 183.4.3 Fast SEA 183.4.4 Two-level motion estimation 223.4.5 Simulation Results 243.4.6 Conclusions 25

iiNovel methods in H.264/AVC3.5 Spatio-Temporal Predictor for Motion Estimation 263.5.1 Literature review 263.5.2 Effectiveness of the EPZS predictors 273.5.3 Spatio-temporal predictor 293.5.4 Simulation results 323.5.5 Conclusions 323.6 Fast Multiple Reference Frame Selection 343.6.1 Literature review 343.6.2 Multiple Reference Frame in H.264 343.6.3 Frame selection method 353.6.4 Simulation Results 363.6.5 Conclusions 373.7 Moving Object Detection in the Compressed Domain 393.7.1 Literature review 393.7.2 Moving object detection in the compressed domain 403.7.3 Simulation results 423.7.4 Further improvements 423.7.5 Conclusions 434 DATA HIDING 454.1 Introduction 454.2 Problem Formulation 464.3 Solutions 464.4 Data Hiding during the inter-prediction 474.4.1 Literature review 474.4.2 Data hiding method 474.4.3 Simulation results 514.4.4 Message Extractor 524.4.5 Further improvements 524.4.6 Conclusions 524.4.7 Application based on this method: A Data Hiding Scheme for Scene Change Detection 544.5 Real Time Data Hiding by Exploiting the I_PCM Macroblocks 614.5.1 Literature review 614.5.2 Intra mode prediction in H.264 614.5.3 Real time Data Hiding 624.5.4 Simulation results 674.5.5 Message extractor 694.5.6 Further improvements 704.5.7 Conclusions 705 BITRATE TRANSCODING 755.1 Introduction 755.2 Problem Formulation 755.3 Solution 765.4 Bit Rate Transcoding by Dropping Frames in the Compressed Domain 78

Novel methods in H.264/AVCiii5.4.1 Literature review 785.4.2 Main concepts 805.4.3 Bit Rate Transcoder 855.4.4 Simulation Results 935.4.5 Further improvements 955.4.6 Conclusions 956 EPILOGUE 996.1 Contribution 1026.1.1 Inter prediction 1026.1.2 Data hiding 1026.1.3 Bitrate transcoding 1036.2 Further improvements 103REFERENCES 105APPENDICES 113

Novel methods in H.264/AVCvGlossary4:2:0 (sampling) Sampling method: chrominance components have half thehorizontal and vertical resolution of luminance componentArithmetic coding Coding method to reduce statistical redundancyArtifactVisual distortion in an imageBlockRegion of macroblock (8 × 8 or 4 × 4) for transform purposesBlock matching Motion estimation carried out on rectangular picture areasB-picture (slice) Coded picture (slice) predicted using bidirectional motioncompensationCABACContext-based Adaptive Binary Arithmetic CodingCAVLCContext Adaptive Variable Length CodingCCTVClosed-circuit televisionChrominance Color difference componentCIFCommon Intermediate Format, a color image formatCODECCOder / DECoder pairColor space Method of representing color imagesDCTDiscrete Cosine TransformDFTDiscrete Fourier TransformDWTDiscrete Wavelet TransformEntropy coding Coding method to reduce redundancyError concealment Post-processing of a decoded image to remove or reduce visibleerror effectsFieldOdd- or even-numbered lines from an interlaced video sequenceFMOFlexible Macroblock Order, in which macroblocks may be codedout of raster sequenceFPSFrame Rate (Frame Per Second)Full Search A motion estimation algorithmGOPGroup Of Pictures, a set of coded video imagesH.261 A video coding standardH.263 A video coding standardH.264 A video coding standardHDTVHigh Definition TelevisionHuffman coding Coding method to reduce redundancyHVSHuman Visual System, the system by which humans perceive andinterpret visual imagesHybrid (CODEC) CODEC model featuring motion compensation and transformIECInternational Electrotechnical Commission, a standards bodyIDRinstantaneous decoding refresh, a picture, which causes thedecoding process to mark all reference pictures as "unused forreference" immediately after the decoding of the IDR pictureInter (coding) Coding of video frames using temporal prediction or compensationInterlaced (video) Video data represented as a series of fieldsIntra (coding) Coding of video frames without temporal predictionIPCM or I_PCM A macroblock, which is neither predicted nor quantized. It issubject only to entropy encoding.I-picture (slice) Picture (or slice) coded without reference to any other frameISOInternational Standards Organization, a standards bodyITUInternational Telecommunication Union, a standards body

viNovel methods in H.264/AVCJPEGJPEG2000JVTLoop filterMacroblock (MB)MacroblockpartitionMacroblock subpartitionMedia processorMOSMotioncompensationMotion estimationMotion vector(MV)MPEGMPEG-1MPEG-2MPEG-4MSEANALObjective qualityPicture (coded)POCP-picture (slice)ProfileProgressive (video)PSNRQCIFQuantizeQPRate controlRate–distortionRBSPRGBRTPRVLCSEASI sliceSIFJoint Photographic Experts Group, a committee of ISO (also animage coding standard)An image coding standardJoint Video Team consisting of experts from VCEG and MPEGSpatial filter placed within encoding or decoding feedback loopRegion of frame coded as a unit (usually 16×16 pixels in theoriginal frame)Region of macroblock with its own motion vector (H.264)Region of macroblock with its own motion vector (H.264)Processor with features specific to multimedia coding andprocessingMean Opinion Score. A subjective quality metricPrediction of a video frame with modeling of motionEstimation of relative motion between two or more video framesVector indicating a displaced block or region to be used for motioncompensationMotion Picture Experts Group, a committee of ISO/IECA multimedia coding standardA multimedia coding standardA multimedia coding standardMultilevel Successive Elimination Algorithm, a full searchalgorithmNetwork Abstraction LayerVisual quality measured by algorithm(s)Coded (compressed) video framePicture Order Count, a number to keep the ordering of the picturesand the values of samples in the decoded pictures isolated fromtiming informationCoded picture (or slice) using motion-compensated prediction fromone reference frameA set of functional capabilities (of a video CODEC)Video data represented as a series of complete framesPeak Signal to Noise Ratio, an objective quality measureQuarter Common Intermediate Format, a color image formatReduce the precision of a scalar or vector quantityQuantization ParameterControl of bit rate of encoded video signalMeasure of CODEC performance (distortion at a range of coded bitrates)Raw Byte Sequence PayloadRed/Green/Blue color spaceReal Time Protocol, a transport protocol for real-time dataReversible Variable Length CodeSuccessive Elimination Algorithm, a full search algorithmIntra-coded slice used for switching between coded bitstreams(H.264)Source Input Format, a color image format.

Novel methods in H.264/AVCviiSliceSP sliceVCEGA region of a coded pictureInter-coded slice used for switching between coded bitstreams(H.264)Video Coding Experts Group of ITU-T

Novel methods in H.264/AVCixPublicationsIn the context of our research we published eight papers in journals and internationalconferences:[1] S. Kapotas and A.N. Skodras, "Bit Rate Transcoding of H.264 Encoded Movies byDropping Frames in the Compressed Domain", IEEE Transactions on ConsumerElectronics, vol. 56, no. 3, pp. 1593-1601, 2010.[2] S. Kapotas and A.N. Skodras, "Rate Control of H.264 Encoded Sequences byDropping Frames in the Compressed Domain", 20th Int. Conference on PatternRecognition (ICPR 2010), Istanbul, Turkey, 23-26 Aug. 2010.[3] S. Kapotas and A.N. Skodras, "Moving Object Detection in the H.264 CompressedDomain", IEEE International Conference on Imaging Systems and Techniques (IST2010), Thessaloniki, Greece, 1-2 July 2010.[4] S. Kapotas and A.N. Skodras, "Real Time Data Hiding by Exploiting the IPCMMacroblocks in H.264/AVC Streams", Journal of Real-Time Image Processing, vol.4, no. 1, pp. 33-41, Mar. 2009.[5] S. Kapotas and A.N. Skodras, “A New Data Hiding Scheme for Scene ChangeDetection in H.264 Encoded Video Sequences", IEEE International Conference onMultimedia & Expo (ICME) 2008, Hannover, Germany, June 23-26, 2008.[6] S. Kapotas and A.N. Skodras, "Fast Multiple Reference Frame Selection Method inH.264 Video Encoding", 26th Picture Coding Symposium (PCS 2007), Lisbon,Portugal, 7-2 Nov. 2007.[7] S. Kapotas, E.E. Varsaki and A.N. Skodras, "Data Hiding in H.264 Encoded VideoSequences", 2007 IEEE Int. Workshop on Multimedia Signal Processing, Chania,Greece, 1-3 Oct. 2007.[8] S. Kapotas and A.N. Skodras, "A New Spatio-Temporal Predictor for MotionEstimation in H.264 Video Coding", 8th Int. Workshop on Image Analysis forMultimedia Interactive Services (WIAMIS 2007), Santorini, Greece, 6-8 June 2007.

1 Introduction1.1 MOTIVATION AND GOALSH.264/AVC, the latest standard for video coding, is the result of the collaborationbetween the ISO/IEC Moving Picture Experts Group and the ITU-T Video CodingExperts Group. The goals of this standardization effort were enhanced compressionefficiency, network friendly video representation for interactive (video telephony) andnon-interactive applications (broadcast, streaming, storage, video on demand).H.264/AVC provides gains in compression efficiency of up to 50% over a wide range ofbit rates and video resolutions compared to previous standards. However, the H.264/AVCcomplexity is about four times that of MPEG-2. As a consequence, video coding issues,which were considered to having been resolved by previous standards (H.263 and MPEG-2) such as the encoding speed, need to be re-considered in case of H.264/AVC. Reapplyingolder solutions is feasible but insufficient because the special H.264/AVCcharacteristics are not taken into account and thus the problems caused by thesecharacteristics are not properly addressed. On the other hand, the new H.264/AVCcharacteristics, also referred to as coding tools, make possible a series of applications thateither were not possible or showed inferior results prior the H.264/AVC era.This dissertation aims at investigating methods, which take advantage of the specialcharacteristics introduced by H.264/AVC. We classify the methods in two categories,namely enhancements and applied methods. The goal of the enhancements is to improvethe performance of the H.264/AVC encoder by reducing its complexity. We focused onthe inter prediction part of the encoder with respect to its complexity i.e. the proposedmethods reduce the time that the inter prediction takes. The applied methods propose

2 Introductiontechniques, such as data hiding and bit rate transcoding techniques, which exploit thespecial H.264/AVC characteristics in order to improve their performance.Finally, this work shows how some of the H.264/AVC new characteristics can beexploited on behalf of the consumer.1.2 STRUCTURE OF THE DISSERTATIONThis dissertation is mainly separated in three research parts, namely Inter Prediction,Data Hiding and Bitrate Transcoding, which are described in three chapters. Each chapteris treated independently, i.e. it has its own literature review and research sections. As amatter of fact, each chapter is a set of methods, which justify our motivations. The interprediction methods fall into the enhancements category whilst the data hiding and the bitrate transcoding methods fall into the applied methods category. More specifically thedissertation has the following structure:In Chapter 2 we give a brief overview of the H.264/AVC standard. In Chapters 3, 4 and 5we present the Inter Prediction, the Data Hiding and the Bit Rate Transcoding researchparts respectively. The chapters begin by describing the problem that we are going to dealwith (Problem Formulation). Then we enumerate the proposed solution(s)-method(s),which can be integrated into the H.264/AVC codec. The methods are described in thesections/sub-sections that follows. Each section begins with reviewing the literature andthen continues with the description of the proposed method. Whenever is needed, a subsectiondescribes a part of interest of the H.264/AVC reference encoder, e.g. full searchmethod, reference frame selection, etc. Chapter 6 is the epilogue. There, we give a briefanalysis of the methods, their achievements as long as the potential improvements.Finally, we evaluate the methods with respect to their contribution to the H.264/AVCfield.In Appendix I we present the metrics that we use in our simulation results whilst inAppendix II we describe the simulation environment and the methodology that wefollowed during our tests.

2 Overview of H.264 12.1 INTRODUCTIONInternational study groups, VCEG (Video Coding Experts Group) of ITU-T (InternationalTelecommunication Union—Telecommunication sector) and MPEG (Moving PictureExperts Group) of ISO/IEC, have researched the video coding techniques for variousapplications of moving pictures since the early 1990s. Since then, ITU-T developedH.261 as the first video coding standard for videoconferencing application. MPEG-1video coding standard was established for storage in compact disk and MPEG-2 (ITU-Tadopted it as H.262) standard for digital TV and HDTV as extension of MPEG-1. Also,for covering the very wide range of applications such as shaped regions of video objectsas well as rectangular pictures, MPEG-4 part 2 standard was developed. This includesalso natural and synthetic video/audio combinations with interactivity built in. On theother hand, ITU-T developed H.263 in order to improve the compression performance ofH.261 and the base coding model of H.263 was adopted as the core of some parts inMPEG-4 part 2. MPEG-1, 2 and 4 also cover audio coding. To provide bettercompression of video compared to previous standards, H.264/MPEG-4 part 10, alsoknown as H.264/AVC, video coding standard was developed by the JVT (Joint VideoTeam), consisting of experts from VCEG and MPEG, in 2003.Table 2-1 compares the compression rate of H.264/AVC vs older compression standards.1 Some text, figures and tables of Chapter 2 are copied from Chapter 6 of IainRichardson’s book “H.264 and MPEG-4 Video Compression” [15]. Courtesy of Prof.Richardson.

4 Overview of H.264Table 2-1: Compression ratios to maintain excellent quality.StandardJPEG 10:1MPEG2 – H.263 30:1MPEG4/H.264 AVC 50:1Compression ratioH.264/AVC, hereafter H.264, fulfills significant coding efficiency, simple syntaxspecifications and seamless integration of video coding into all current protocols andmultiplex architectures. Thus, H.264 can support various applications like videobroadcasting, video streaming and video conferencing over fixed and wireless networksand over different transport protocols. H.264 has the same basic functional elements asprevious standards (MPEG-1, MPEG-2, MPEG-4 part 2, H.261 and H.263), i.e. transformfor reduction of spatial correlation, quantization for bitrate control, motion compensatedprediction for reduction of temporal correlation and entropy encoding for reduction ofstatistical correlation. However, to fulfill better coding performance, the importantchanges in H.264 occur in the details of each functional element by including intrapictureprediction, a new 4 × 4 integer transform, multiple reference pictures, variableblock sizes and a quarter pel precision for motion compensation, a de-blocking filter andimproved entropy coding. Improved coding efficiency comes at the expense of addedcomplexity to the coder/decoder. Therefore, H.264 utilizes some coding tools (methods)to reduce the implementation complexity e.g. multiplier-free integer transform isintroduced. Multiplication operation for the exact transform is combined with themultiplication of quantization. The noisy channel conditions, like the wireless networks,obstruct the perfect reception of coded video bitstream in the decoder. Incorrect decodingby the lost data degrades the subjective picture quality and propagates to the subsequentblocks or pictures. So, H.264 utilizes some tools to exploit error resilience to networknoise. The parameter setting, flexible macroblock ordering, switched slice, redundantslice methods are added to the data partitioning, used in previous standards. For theparticular applications, H.264 defines the Profiles and Levels specifying restrictions onbitstreams like some of the previous video standards. Three Profiles are defined to coverthe various applications from the wireless networks to digital cinema. These are describedin Section 2.3 in detail.

Novel methods in H.264/AVC 52.2 TERMINOLOGYSome of the important terminology adopted in the H.264 standard is as follows: A field(of interlaced video) or a frame (of progressive or interlaced video) is encoded to producea coded picture. A coded frame has a frame number (signaled in the bitstream), which isnot necessarily related to decoding order and each coded field of a progressive orinterlaced frame has an associated picture order count, which defines the decoding orderof fields. Previously coded pictures (reference pictures) may be used for inter predictionof further coded pictures. Reference pictures are organized into one or two lists (sets ofnumbers corresponding to reference pictures), described as list 0 and list 1. A codedpicture consists of a number of macroblocks, each containing 16× 16 luma samples andassociated chroma samples ( 8× 8 C b and 8 × 8 C r samples in the current standard).Within each picture, macroblocks are arranged in slices, where a slice is a set ofmacroblocks in raster scan order. An I slice may contain only I macroblock types (seebelow), a P slice may contain P and I macroblock types and a B slice may contain B and Imacroblock types. (There are two further slice types, SI and SP, which are not in thescope of this dissertation).I macroblocks are predicted using intra prediction from decoded samples in the currentslice. A prediction is formed either (a) for the complete macroblock or (b) for each 4 × 4block of luma samples (and associated chroma samples) in the macroblock. (Analternative to intra prediction, I_PCM mode, is described in Section 4.5.2).P macroblocks are predicted using inter prediction from reference picture(s). An intercoded macroblock may be divided into macroblock partitions, i.e. blocks of size 16 × 16 ,16× 8 , 8× 16 or 8× 8 luma samples (and associated chroma samples). If the 8× 8partition size is chosen, each 8× 8 sub-macroblock may be further divided into submacroblockpartitions of size 8 × 8, 8 × 4 , 4 × 8 or 4× 4 luma samples (and associatedchroma samples). Each macroblock partition may be predicted from one picture in list 0.If present, every sub-macroblock partition in a sub-macroblock is predicted from thesame picture in list 0.B macroblocks are predicted using inter prediction from reference picture(s). Eachmacroblock partition may be predicted from one or two reference pictures, one picture in

6 Overview of H.264list 0 and/or one picture in list 1. If present, every sub-macroblock partition in a submacroblockis predicted from (the same) one or two reference pictures, one picture in list0 and/or one picture in list 1.Baseline profileFigure 2-1: H.264 Baseline, Main and Extended profiles.2.3 PROFILES AND LEVELSH.264 [1] defines a set of three Profiles 2 , each supporting a particular set of codingfunctions and each specifying what is required of an encoder or decoder that complieswith the Profile. The Baseline Profile supports intra and inter-coding (using I-slices andP-slices) and entropy coding with context-adaptive variable-length codes (CAVLC). TheMain Profile includes support for interlaced video, inter-coding using B-slices, intercoding using weighted prediction and entropy coding using context-based adaptivearithmetic binary coding (CABAC). The Extended Profile does not support interlacedvideo or CABAC but adds modes to enable efficient switching between coded bitstreams(SP and SI slices) and improved error resilience (Data Partitioning). Potential applications[1]2 The latest draft, ITU-T Rec. (03/2010) defines a set of 17 profiles.

Novel methods in H.264/AVC 7of the Baseline Profile include videotelephony, videoconferencing and wirelesscommunications; potential applications of the Main Profile include televisionbroadcasting and video storage; and the Extended Profile may be particularly useful forstreaming media applications. However, each Profile has sufficient flexibility to support awide range of applications and so these examples of applications should not beconsidered definitive. Figure 2-1 shows the relationship between the three Profiles andthe coding tools supported by the standard. It is clear from this figure that the BaselineProfile is a subset of the Extended Profile, but not of the Main Profile.2.4 CODED DATA FORMATH.264 [1] makes a distinction between a Video Coding Layer (VCL) and a NetworkAbstraction Layer (NAL). The output of the encoding process is VCL data, a sequence ofbits representing the coded video data, which are mapped to NAL units prior totransmission or storage. Each NAL unit contains a Raw Byte Sequence Payload (RBSP),a set of data corresponding to coded video data or header information. A coded videosequence is represented by a sequence of NAL units (Figure 2-2) that can be transmittedover a packet-based network or a bitstream transmission link or stored in a file. Thepurpose of separately specifying the VCL and NAL is to distinguish between codingspecificfeatures (at the VCL) and transport-specific features.Figure 2-2: Sequence of NAL units.2.5 REFERENCE PICTURESAn H.264 encoder may use one or two (even more in the reference H.264 encoder) ofpreviously encoded pictures as a reference for motion-compensated prediction of eachinter coded macroblock or macroblock partition. This enables the encoder to search forthe best ‘match’ for the current macroblock partition from a wider set of pictures than justthe previously encoded picture. Multiple reference frames result in significantcompression efficiency especially when the motion is periodic by nature, as is illustratedin Figure 2-3.

8 Overview of H.264Figure 2-3: Multiple reference frames.The encoder and decoder each maintain one or two lists of reference pictures, containingpictures that have previously been encoded and decoded (occurring before and/or after thecurrent picture in display order). Inter coded macroblocks and macroblock partitions in Pslices (see below) are predicted from pictures in a single list, list 0. Inter codedmacroblocks and macroblock partitions in a B slice may be predicted from two lists, list 0and list 1.Table 2-2: H.264 slice modes.Slice Type Description Profile(s)I (Intra)Contains only I macroblocks (each block or macroblock is predicted Allfrom previously coded data within the same slice).P (Predicted) Contains P macroblocks (each macroblock or macroblock partition is Allpredicted from one list 0 reference picture) and/or I macroblocksB (Bi-predictive) Contains B macroblocks (each macroblock or macroblock partition ispredicted from a list 0 and/or a list 1 reference picture) and/or IExtendedand MainmacroblocksSP (Switching P) Facilitates switching between coded streams; contains P and/or I ExtendedmacroblocksSI (Switching I) Facilitates switching between coded streams; contains SI macroblocks(a special type of intra coded macroblock)Extended2.6 SLICESA video picture (frame) is coded as one or more slices, each containing an integralnumber of macroblocks from one (one MB per slice) to the total number of macroblocksin a picture (one slice per picture). The number of macroblocks per slice need not beconstant within a picture. There is minimal inter-dependency between coded slices whichcan help to limit the propagation of errors. There are five types of coded slice (Table 2-2)and a coded picture may be composed of different types of slices. For example, aBaseline Profile coded picture may contain a mixture of I and P slices and a Main orExtended Profile picture may contain a mixture of I, P and B slices.

Novel methods in H.264/AVC 9Figure 2-4 shows a simplified illustration of the syntax of a coded slice. The slice headerdefines (among other things) the slice type and the coded picture that the slice ‘belongs’to and may contain instructions related to reference picture management. The slice dataconsists of a series of coded macroblocks and/or an indication of skipped (not coded)macroblocks. Each MB contains a series of header elements and coded residual data.Figure 2-4: Slice syntax.Table 2-3: Macroblock syntax elements.mb_type Determines whether the macroblock is coded in intra or inter (P or B)modeDetermines macroblock partition sizemb_pred Determines intra prediction modes (intra macroblocks); determines list 0 and/or list 1references and differentially coded motion vectors for each macroblock partition (intermacroblocks, except for inter MBs with 8× 8 macroblock partition size).sub_mb_pred (Inter MBs with 8× 8 macroblock partition size only). Determines sub-macroblockpartition size for each sub-macroblock; list 0 and/or list 1 references for eachmacroblock partition; differentially coded motion vectors for each macroblock subpartitioncoded block pattern. Identifies which 8× 8 blocks (luma and chroma) containcoded transform coefficients.mb_qp_delta Changes the quantizer parameterresidual Coded transform coefficients corresponding to the residual image samples afterprediction2.7 MACROBLOCKSA macroblock contains coded data corresponding to a 16× 16 sample region of the videoframe ( 16 × 16 luma samples, 8 × 8 Cband 8 × 8 Crsamples) and contains the syntaxelements described in Table 2-3. Macroblocks are numbered (addressed) in raster scanorder within a frame.

10 Overview of H.2642.8 TECHNICAL OVERVIEWIn common with earlier standards (such as MPEG-1, MPEG-2 and MPEG-4), the H.264draft standard does not explicitly define a CODEC (enCOder/DECoder pair). Rather, thestandard defines the syntax of an encoded video bitstream together with the method ofdecoding this bitstream. In practice, however, a compliant encoder and decoder are likelyto include the functional elements shown in Figure 2-5 and Figure 2-6. Whilst thefunctions shown in these figures are likely to be necessary for compliance, there is scopefor considerable variation in the structure of the CODEC. The basic functional elements(prediction, transform, quantization, entropy encoding) are different from previousstandards (MPEG-1, MPEG-2, MPEG-4, H.261, H.263); the important changes in H.264occur in the details of each functional element.The Encoder (Figure 2-5) includes two dataflow paths, a “forward” path (left to right,shown in blue) and a “reconstruction” path (right to left, shown in magenta). Thedataflow path in the Decoder (Figure 2-6) is shown from right to left to illustrate thesimilarities between Encoder and Decoder.2.8.1 Encoder (forward path)An input frameF nis presented for encoding. The frame is processed in units of amacroblock (corresponding to 16× 16 pixels in the original image). Each macroblock isencoded in intra or inter mode. In either case, a prediction macroblock P is formed basedon a reconstructed frame. In intra mode, P is formed from samples in the current frame nthat have previously encoded, decoded and reconstructed ( u ′ in the Figures; note thatthe unfiltered samples are used to form P ). In inter mode, P is formed by motioncompensatedprediction from one or more reference frame(s). In the Figures, thereference frame is shown as the previous encoded frame F ′ n −1; however, the prediction foreach macroblock may be formed from one or two past or future frames (in time order)that have already been encoded and reconstructed. The prediction P is subtracted fromthe current macroblock to produce a residual or difference macroblockFnD n. This istransformed (using a block transform) and quantized to give X , a set of quantizedtransform coefficients. These coefficients are re-ordered and entropy encoded. Theentropy encoded coefficients, together with side information required to decode themacroblock (such as the macroblock prediction mode, quantizer step size, motion vector

Novel methods in H.264/AVC 11information describing how the macroblock was motion-compensated, etc.) form thecompressed bitstream. This is passed to a Network Abstraction Layer (NAL) fortransmission or storage.2.8.2 Encoder (reconstruction path)Figure 2-5: H.264 Encoder.The quantized macroblock coefficients X are decoded in order to reconstruct a frame forencoding of further macroblocks. The coefficients X are re-scaled ( Q−1transformed ( T ) to produce a difference macroblockoriginal difference macroblockis a distorted version ofreconstructed macroblock−1) and inverse′Dn. This is not identical to theDn; the quantization process introduces losses and soDn. The prediction macroblock P is added to′Dn′Dnto create au F ′n (a distorted version of the original macroblock). A filter isapplied to reduce the effects of blocking distortion and reconstructed reference frame iscreated from a series of macroblocksu F ′n.2.8.3 DecoderThe decoder receives a compressed bitstream from the NAL. The data elements areentropy decoded and reordered to produce a set of quantized coefficients X. These arerescaled and inverse transformed to give′D n(this identical to the′D nshown in theEncoder). Using the header information decoded from the bitstream, the decoder creates aprediction macroblock P, identical to the original prediction P formed in the encoder. P

12 Overview of H.264is added tomacroblock′D nin order to produceu F ′n, which is filtered to create the decodedF′n . It should be clear from the figures and from the discussion above that thepurpose of the reconstruction path in the encoder is to ensure that both encoder anddecoder use identical reference frames to create the prediction P. If this is not the case,then the predictions P in encoder and decoder will not be identical, leading to anincreasing error or “drift” between the encoder and decoder.Figure 2-6: H.264 Decoder.

3 Inter Prediction3.1 INTRODUCTIONThe goal of the inter prediction is to reduce redundancy between transmitted frames byforming a predicted frame and subtracting this from the current frame. The output of thisprocess is a residual (difference) frame and the more accurate the prediction process, theless energy is contained in the residual frame. The residual frame is encoded and sent tothe decoder which re-creates the predicted frame, adds the decoded residual andreconstructs the current frame. The key part of the inter prediction is the block basedmotion estimation-compensation. The motion estimation deals with finding the bestmatch of the current block (sub-block) whilst the motion compensation refers to thepredicted block (sub-block), which is the result (residuals) of the subtraction of theoriginal block (sub-block) from its best match. The residuals are encoded and transmittedtogether with a motion vector describing the position of the best matching block (subblock),relative to the current macroblock position.As specified in H.264 [1], there are 7 different block sizes, also known as modes,( 16 × 16 , 16 × 8 , 8× 16 , 8× 8, 8× 4 , 4 × 8 and 4× 4 ) that can be used in motionestimation-compensation. These different block sizes actually form a two-level hierarchyinside a macroblock. The first level comprises block sizes of 16× 16, 16× 8 or 8× 16 . Inthe second level, the macroblock is specified as P8× 8type, of which each 8× 8 block canbe one of the subtypes such as 8× 8, 8× 4 , 4× 8 or 4 × 4 . This macroblock partitioning isdepicted in Figure 3-1.

14 Inter PredictionFigure 3-1: Macroblock partitions.In order to choose the best block size for a macroblock, the H.264 reference code makesuse of computationally intensive Lagrangian Rate-Distortion (RD) optimization, thegeneral form of which is:Jmod= SSD + λmodR(3-1)e e ×where λmod eis the Lagrange multiplier used in mode decision, R reflects the number ofbits associated with choosing the mode and macroblock quantizer value,Qpincluding thebits for the macroblock header, the motion vector(s) and all the DCT residue blocks. SSDis the sum of the squared differences. For a block ofM 1 1( ) 2( , ( )) ∑∑− N −s c m = s(x,y)− c(x − mx , y − m y)x=0 y=0M × N the SSD is calculated as:SSD (3-2)where s is the pixel value of the current block, c is the value of the reconstructedreference block and m is the motion vector.In H.264 reference code, the motion estimation and the inter mode decision are executedtogether. For each mode, motion estimation is done first and the resulted motion cost isused for the mode decision. The inter mode decision is therefore an extremely timeconsuming process. For each position in the search window, motion estimation has to beperformed in order to find the motion vector that minimizes eq. (3-3).J= SAD + λ R(3-3)motion motion×

Novel methods in H.264/AVC 15where λmotionis the Lagrange multiplier used in motion estimation, R is the number of bitsassociated with the motion vectors and SAD is the sum of the absolute differences. For ablock ofM × N, the SAD is calculated as:N − 1 M −1∑∑SAD ( m,n)= C(i,j)− S(i + m,j + n)(3-4)i=0j=0where C(i, j) and S(i+m, j+n) represent the pixels (i, j) in the current luma block and thecandidate luma block, respectively.Then, eq. (3.2) is calculated using the best match resulted by eq. (3-4), for every mode, inorder to choose the best mode.3.2 PROBLEM FORMULATION3.2.1 Inter prediction complexityAs shown in Section 3.1, H.264 [1] has various motion estimation-compensation units(inter prediction modes) in sizes of 16×16, 16×8, 8×16, 8×8 and sub8×8. For sub8×8,there are further four sub-partitions of sub8×8, sub8×4, sub4×8 and sub4×4. Moreover,quarter-pixel motion compensation can be applied. Such wide blocks choices greatlyimprove coding efficiency but at the expense of largely increased inter prediction time.The computational complexity becomes even higher when larger search ranges, bidirectionaland multiple reference frames are used. It has been observed that in the case ofexhaustive search of all candidate blocks, up to 80% of the computational power of theencoder is consumed by motion estimation. Such high computational complexity is oftena bottle-neck for real-time applications.Various motion estimation methods, legacy of previous standards as well as new ones,have been applied to H.264 in an attempt to reduce the inter prediction complexity.However, these methods had limited success mainly due to the new inter predictionscheme, which is introduced by H.264. For example, no matter how effective a motionestimation algorithm is, it needs to be executed for every single mode (16×16, 16×8,8×16, etc.). Moreover, it must be executed for every reference frame backwards andforward. These requirements increase the complexity and eventually the time of the interprediction. In this dissertation we propose various methods, which exploit the new inter

16 Inter Predictionprediction scheme of H.264 and combined with existing motion estimation methods theyreduce the inter prediction complexity drastically.3.2.2 Special video applicationsThe new inter prediction scheme, introduced by H.264, increases the encoder’scomplexity, on one hand, but it also makes possible the development of various methodsapplied to special video applications on the other hand. In this dissertation we present anobject detection method, which exploits the motion vectors generated by the interprediction. Previous standards could not make such a use of the motion vectors becausethey were few and were applied only to 16×16 blocks within a frame.3.3 SOLUTIONSIn the following sections we shall present four novel methods. Three of these methods,the Fast Successive Elimination (Section 3.4), the Spatio-Temporal Predictor (Section3.5) and the Fast Multiple Reference Frame Selector (Section 3.6) interfere with theexisting inter prediction process of the H.264 reference encoder and aim at reducing theencoding time. Their major advantage is that they can be easily combined with manyother inter prediction techniques and make them more effective. These methods areconsidered to be enhancements according to the definition of the term enhancement givenin the Introduction. The fourth method (Section 3.7) detects a moving object within anH.264 video sequence. The method works directly in the compressed domain and thus itis suitable for real time applications. The method is possible only due to the nature of theH.264 inter prediction, which generates a sufficient amount of motion vectors. Thismethod falls in the category of the applied methods as it was explained in theIntroduction. In that way we demonstrate how various applications can take advantage ofthe new characteristics of the H.264 encoder.

Novel methods in H.264/AVC 173.4 FAST SUCCESSIVE ELIMINATION ALGORITHM3.4.1 Literature reviewOne common technique to speed up the full search is the successive elimination techniqueproposed by Li and Salari [2]. The basic idea of this technique is to obtain the bestestimate of the motion vectors by successively eliminating the search positions in thesearch window and thus decreasing the number of matching evaluations. Variations ofthis technique have been proposed in [3-7]. To decrease the amount of computations ofthe full-search algorithm, Jong-Nam Kim and Tae-Sun Choi [8] propose a fast blockmatchingalgorithm based on an adaptive matching scan and representative pixels. Chen-Fu Lin and Jin-Jang Leou [9] propose a fast full search method, which reduces the sum ofabsolute differences (SAD) computations. In Ahmad et al. work [10], the proposedalgorithm takes advantage of the correlation between the motion vectors, controls to curbthe search, avoids of search stationary regions and uses switchable shape search patternsto accelerate motion search. Yan-Ho Kam and Wan-Chi Siu [11], propose two new fastfull search (FFS) methods. The first method combines the concepts of both conventionalsum of absolute difference (SAD) reuse method and row-based partial distortion search(PDS) together to make the performance largely better than these two conventionalmethods for variable block size motion estimation. The second method speeds up themulti-frame motion estimation process by using the previous results obtained inconsidered reference frames to set extra thresholds for rejecting search points sooner inreference frames. Xuan Jing and Lap-Pui Chau [12], propose a fast full search methodusing a predictive search area. Lung-Chun Chang et al. [13], propose a fast full searchmethod using adaptive search order so as the best matched block to be found in an earlysearch stage. Tian Song et al. [14], propose a fast full search method using an adaptivesearch range. In this section we propose a new fast full search algorithm. Our algorithmreduces the computations during the full search by applying a two-level search rangeadaptation technique.The proposed algorithm achieves a considerable speed up of the full search motionestimation process by applying a fast Successive Elimination Algorithm (SEA) and byreducing the search area around an adaptive search center. In the following paragraphs we

18 Inter Predictionshall first describe the fast SEA that we developed and then we shall give the overalldescription of the proposed algorithm.3.4.2 Full search in H.264 reference encoderThe motion estimation in H.264 is performed upon the inter prediction modes (Figure3-1). Furthermore, the rate-constrained motion estimation is utilized, where the criterionto find the optimum motion vector is to minimize a Lagrangian cost functional (eq. (3-3)).The full search motion estimation is performed within a search area around a searchcenter. The center position is placed on the position, which is pointed by the motionvector (MV) prediction used for conducting the differential coding for the MVs [1]. This isbecause generally the true motion vectors have a high correlation with the predicted MV.Furthermore, the full search algorithm scans the search area in a spiral fashion. Typicalvalues of the search range are 8× 8, 16× 16 and 32× 32 . The H.264 reference encoderincorporates two full search algorithms. The first one is a typical full search algorithm,which calculates the SAD for each block size separately and it finally discovers the bestblock match by minimizing the (eq. (3-3)). The second one is a fast full search algorithm,which calculates only the SADs of the 4× 4 sub-blocks for every block. The SADs of theother block sizes are calculated by merging the 4× 4 SADs.The proposed algorithm achieves a considerable speed up of the full search motionestimation process by applying a fast Successive Elimination Algorithm (SEA) and byreducing the search area around an adaptive search center.3.4.3 Fast SEASEA [2] is based on the following inequality:a + b ≤ a + b(3-5)If we apply eq. (3-5) to eq. (3-4), it can be shown that:N − 1 M −1∑∑SAD ( m,n)= C(i,j)− S(i + m,j + n)≥ C − S ≡ sea(m,n (3-6)i=0j=00 0)where C 0 and S 0 are the sum norms of the current block and the candidate block,respectively.

Novel methods in H.264/AVC 19N −1M −1N −1M −10= ∑∑C(i,j),S0= ∑∑ S(i + m,j + n)i=0 j=0i=0 j=0C (3-7)The workflow of the SEA algorithm is as follows:StepAction1 Calculate the SAD of the first search point. This is considered to be the currentminimum SAD ( SAD )min2 Calculate the SEA for the next search point. If the SEA is larger than SADmin, skipthis point. Otherwise calculate the new SAD and update theSADmin3 Proceed with the next search point and repeat step 2 until all of the search pointshave been examinedApparently, the speed of the SEA highly depends on the fast calculations of the sumnorms C 0 and S 0 . The C 0 is calculated once, while the S 0 is traditionally calculated usingthe frame method, which was first described by Li and Salari [2]. However, this methodcan be applied in blocks of fixed sizes, usually NXN. On the other hand, the motionestimation in H.264 is performed upon a variety of block sizes (prediction modes)( 16× 16 , 16× 8 , 8× 16 , 8× 8, 8× 4 , 4 × 8 and 4 × 4 ). A variation of the SEA, denoted asMultilevel Successive Elimination Algorithm (MSEA) [6,7], has been proposed by manyauthors partly in order to increase the efficiency of the SEA but mostly in order toovercome the problem of the variable block sizes in H.264. In MSEA algorithm the NXMblocks are divided into K sub-blocks. The sum norms of the sub-blocks are accumulatedto get the msea value, as shown in (3-8):N −1M −1∑∑SAD ( m,n)= C(i,j)− S(i + m,j + n)≥ C − S ≡ msea(m,n)(3-8)i=0j=0thwhere C k and S k are the sum norms of the k sub-block of the current block C and of thecandidate block S , respectively.K −1∑k = 0kkThis approach increases the complexity of the motion estimation in many cases. Thereason is because the SEA/MSEA is not close enough to the true SAD and quite often theSAD for a given search point must be calculated besides the calculation of the

20 Inter PredictionSEA/MSEA. In that case the SEA/MSEA is clearly an overhead. We therefore concludethat there are two requirements for a SEA method to be efficient:1. The minimum SAD must be found as close as possible to the initial search pointso as the inequality (3-6) to occur more times.2. The SEA must be calculated as fast as possible so as even if the SAD must be alsocalculated, the impact of the SEA calculation to be negligible.In order to satisfy the first requirement we propose a new method, which adapts the initialsearch point and the search range according to the results of the motion estimation for the16 × 16 block, as we shall show later. Moreover, in order to satisfy the secondrequirement we adopted the method, which was proposed by Franklin Crow [16] in 1984.This method was used for texture mapping but it can be also used for calculating the sumnorms. The basic idea of this method is to map a frame to a “Summed Area Table (SAT)”as is described below (the description has been copied from [17]):Given a frame, let g ( m,n)be the pixel intensity at (m,n). The SAT at pixel (m,n),denoted as ( m,n), is defined as the sum of the values g(m,n)’s, over the region that isI gabove and to the left of pixel (m, n), inclusive (Figure 3-2). That isIgmn∑∑( m,n)= g(x,y)(3-9)x= 0 y=0Let R g ( m,n)denote the cumulative row sum of the pixel intensities g(m,n)’s, defined asRgm( m n) = ∑, g(x,n)(3-10)x=0Assuming R g( −1 , n) = 0 and I g( m, −1) = 0 , one can compute the ( m n)by using two recursive formulas:R( m, n) = R ( m −1,n)g(m,n), I ( m, n) I ( m,n −1)R ( m,n)g g+Hence, for a frame ofg g+gI , , in one passg= (3-11)W × H pixels, only 2 ×W × H additions are required to computethe whole Summed Area Table (SAT). Using the SAT, the sum norms in any rectangularblock can be computed with three arithmetic operations; one addition and twosubtractions. This can been seen in Figure 3-3, where the sum norms (SN) of the block D

Novel methods in H.264/AVC 21can be computed by using the four corresponding SATs at the corners of the block asfollows:SNSNDD= Imx= r+1 y s+1gn= ∑ ∑g(x,y)⇔= (3-12)( m,n)− Ig( r,n)− Ig( m,s)+ Ig( r,s)I g( m,n)Figure 3-2: Value g(m,n) is the pixel intensity at (m,n), while I g (m,n) is the sum of thevalues g(m,n)’s over the region that is above and to the left of pixel (m,n), inclusive.Figure 3-3: The sum norms in block D can be computed by using the four SATs at theblock boundaries.3.4.3.1 Cost analysisTo facilitate the analysis we assume that each addition, subtraction and conditionaloperation, including the calculation of the absolute value, needs one operation. Accordingto [2], the total computations required to obtain the sum norm of all the blocks in thereference frame are:

22 Inter PredictionT = 4 × W × H − (H − N) × (N + 3) − 3 × W × (N + 1)(3-13)If we consider a QCIF video frame (W=176, H=144), which is divided in 16 × 16 blocks(N=16), T will be equal to 89968 operations. Moreover, if the 16 × 16 blocks are dividedinto 4 × 4 sub-blocks (N=4), as required by the MSEA methods, then T will be equal to99716 operations. The calculation of each MSEA value in eq. (3-8) will also cost 48operations (k=16 for N=4). If we consider the motion estimation of the 16 × 16 blocksonly, the computation overhead for each reference 16 × 16 block is:T99716+ 48 = + 48 ≈ 1055(3-14)Total Blocks per Frame 99On the other hand, using the proposed SAT method, it can be shown from eq. (3-11) andeq. (3-12) that the computation overhead for each reference 16 × 16 block is:2 × W × H50688+ 3 = + 3 = 515Total Blocks per Frame 99(3-15)The great advantage of the SAT method is that the overhead (eq. (3-15)) remains thesame for every block size whilst the overhead (eq. (3-14)) varies according to the currentblock size and to the selection of the number k of the sub-blocks in eq. (3-8).The proposed method uses the SAT method for calculating both of the sum norms C 0 andS 0 in order to save as many computations as possible.3.4.4 Two-level motion estimationThe proposed method performs a two-level motion estimation. At the first level, it doesthe motion estimation for the 16 × 16 blocks exactly like the conventional full searchalgorithm does, as described in Section 3.4.2; the motion estimation of the 16 × 16macroblock is performed for a number of search points within a search area (typically of16× 16 points) in order to find its best match in the reference frame. At the second level,the method moves the search center to the best matching block position resulted by theprevious motion estimation. Moreover, it reduces the search range according to thedistance between the initial search center and the best matching block position as isillustrated in Figure 3-4.

Novel methods in H.264/AVC 23Figure 3-4: The two-level motion estimation.The workflow of the proposed method is as follows:StepAction1 Perform the motion estimation and find the best match for the 16 × 16 modearound the MV predicted search center. Let the initial search center be at position 1(0, 0) and the best match at position 2 (x, y) as is shown in Figure 3-4 (Level A)2 Move the search center to position 2 (x, y)3 Reduce the search range according to the following formula (3-16)4 Perform the motion estimation for the rest modes ( 16× 8 , 8× 16 , 8× 8, 8× 4 ,4× 8 and 4× 4 ) as is shown in Figure 3-4 (Level B)SearchRang e = max( x,y,5)(3-16)Notice that the adapted search range (eq. (3-16)) cannot be less than 5 positions. Thereason is because the max( x , y)may be too small and the best match might be located out

24 Inter Predictionof the search range. Besides, it is reported that about 93% of the best matching points arelocated in the 5× 5 area near the search center [18].3.4.5 Simulation ResultsThe simulation tests were executed in the simulation environment, which is described inAppendix II. The most important configuration parameters of the reference software areshown in Table 3-1. The rest of the parameters have retained their default values. Theproposed algorithm was tested against the conventional full search and the fast full searchalgorithms used by the reference encoder (JM12.0).In our tests we enabled the bit rate control mechanism of the encoder and we set a 45kbps bit rate constraint on the encoder. By enabling the bit rate control the quantizationparameters were automatically controlled by the encoder, which had to generate a bit ratelower or equal to the bit rate constraint. For our tests we used 200 frames of three wellknownrepresentative video sequences in QCIF format (YUV 4:2:0): themother&daughter (Class A), the foreman (Class B) and the mobile (Class C). The testingprocedure (ref. Appendix II.4) was to run the reference encoder with and without ouralgorithm and then compare the results in respect with the bit rate, the PSNR and theencoding time. We used the bit rate, the encoding time and the PSNR variations ascomparative metrics, which were calculated as in eq. (I-1), (I-2) and (I-5) respectively.The results are shown in Table 3-2. The variations are between our proposed method andthe Full Search (FS) algorithm and between our proposed method and the Fast FullSearch (FFS) algorithm, which are used by the reference H.264 encoder (JM12.0).From the results we see that the proposed algorithm achieves a significant reduction of53.57% in average of the motion estimation time compared to the full search algorithm atthe cost of 0.01 dB average loss of the PSNR. Moreover, it results in bit rate reduction(0.087% in average). On the other hand, it achieves a reduction of 32.34% in average ofthe motion estimation time compared to the fast full search algorithm. The PSNR and thebit rate variations obviously remained the same, since the full search and the fast fullsearch algorithms are the same with respect to the PSNR and the bit rate outcomes.

Novel methods in H.264/AVC 253.4.6 ConclusionsThe proposed method achieves a considerable speed up of the full search motionestimation process by applying a fast Successive Elimination Algorithm (SEA) and byreducing the search area around an adaptive search center. Moreover, it leaves the PSNRand the bit rate practically unaffected.Table 3-1: Configuration parameters of the encoder.ProfileBaselineNumber of Frames 200Frame Rate30 fpsReference frames 5RD OptimizationFast High Complexity ModeMotion EstimationFull Search & Fast Full SearchIntra Period0: only the first frame is intraSymbol ModeUVLCBit Rate ControlEnabledBit Rate Constraint45000 b/sTable 3-2: Simulation results.SequencePSNRVar. (dB)Bit RateVar. (%)Encoding TimeVar. (%)FSFFSMother & Daughter -0.01 -0.17 -56.56 -42.13Foreman 0.00 0.01 -49.40 -25.61Mobile -0.02 -0.10 -54.75 -29.26Average -0.01 -0.087 -53.57 -32.34

26 Inter Prediction3.5 SPATIO-TEMPORAL PREDICTOR FOR MOTION ESTIMATION3.5.1 Literature reviewThere have been many fast motion estimation (FME) techniques proposed in the literature[19-24]. Two popular approaches are chosen to reduce computation in block matchingmotion estimation. The first approach reduces the number of candidate blocks in thesearch window (fast searching techniques). These algorithms usually show good speedgain but have relatively larger rate-distortion (R-D) performance degradation. The secondapproach reduces the complexity of SAD computation (fast matching techniques). Thesealgorithms often achieve good coding efficiency but have limited speed up gain. Othertechniques include predicted spatial-temporal search, adaptive early termination anddynamic search range adjustment. It is possible to combine several of the abovetechniques to form a hybrid search method. For example, PMVFAST [25] andUMHexagonS [26] utilize prediction, diamond search, hexagon search, partial distortionand adaptive early termination. They are proven to be more robust than a single searchstrategy.Reference software is often optimized for coding efficiency rather than encoding speedbecause R-D performance is the paramount concern during standardization process. Thereference H.264 encoder adopted three FME algorithms due to their competitive R-Dperformance over Full Search. The three FME algorithms are the UMHexagonS [26], thesimplified UMHexagonS [27] and the EPZS [28]. The first two algorithms make use ofthe notorious median predictor, described in the standard, in order to find a better searchcenter and then perform a limited search around this center. On the other hand, the EPZSalgorithm defines some sets of predicted search points, which are likely to give the bestmatch. For that purpose the EPZS uses various predictors, such as the median predictorand temporal predictor(s).In the following section we study the effectiveness of different predictors used by theEPZS algorithm in order to verify that their use is justified. In addition to that, we takeinto account the results of the EPZS study in order to form a new predictor, which maysubstitute the median predictor used in [26] and [27] or may also be included in the set ofthe predictors used by the EPZS.

Novel methods in H.264/AVC 273.5.2 Effectiveness of the EPZS predictorsThe EPZS [28] is considered to be the most advanced Fast Motion Estimation algorithmamong the three, which are used by the H.264 reference code. The basic idea of EPZS isto reduce the candidate search points by predicting search points, which are likely to givegood results. For that purpose EPZS uses various search points predictors such as thewell-known median predictor, the (0,0) position, the motion vectors of the adjacentblocks in the current frame, the motion vectors of the collocated block and of its adjacentblocks in the reference frame and many others. In particular the motion vector predictor,which plays a key role in the motion estimation, is calculated in the following way [15].Let E be the current macroblock, macroblock partition or sub-macroblock partition. Let Abe the partition or sub-partition immediately to the left of E. Let B be the partition or subpartitionimmediately above E and C the partition or sub-macroblock partition above andto the right of E. If there is more than one partition immediately to the left of E, thetopmost of these partitions is chosen as A. If there is more than one partition immediatelyabove E, the leftmost of these is chosen as B. Figure 3-5 illustrates the choice ofneighboring partitions when all the partitions have the same size. Figure 3-6 shows anexample of the choice of prediction partitions when the neighboring partitions havedifferent sizes from the current partition E.Figure 3-5: Current and neighboring macroblocks.

28 Inter PredictionB4x8C16x8A8x4E16x16The Motion Vector PredictorConditionFigure 3-6: Current and neighboring macroblock partitions.MVpis calculated as followsCalculation1 For transmitted partitions excluding 16 × 8 and 8× 16 partition sizes, MVpisthe median of the motion vectors for partitions A, B and C2 For 16 × 8 partitions, MVpfor the upper 16 × 8 partition is predicted from BandMVpfor the lower 16× 8 partition is predicted from A3 For 8× 16 partitions, MVpfor the left 8× 16 partition is predicted from AandMVpfor the right 8× 16 partition is predicted from C4 For skipped macroblocks, a 16 × 16 vector MVpis generated as in case (1)above (i.e. as if the block were encoded in 16 × 16 inter mode)5 If one or more of the previously transmitted blocks is not available (e.g. if itis outside the current slice), the choice ofMVpis modified accordinglyWe examined 11 video sequences in QCIF format. The H.264 encoder was configuredwith the default parameters of the baseline profile and the results are shown in Figure 3-7.This figure shows the percent (%) contribution of each predictor over the different videosequences. It is clear that the median predictor is the dominant predictor. The second bestpredictor seems to be the (0, 0) position. Moreover, the motion vectors of the adjacentblocks in the current frame (Left, Up, UpRight, UpLeft, Mem Left, Mem Up, MemUpRight) have a significant contribution. The contribution of the motion vector of the

Novel methods in H.264/AVC 29collocated block and the contributions of the motion vectors of its adjacent blocks in thereference frame are all summed over the label “Collocate” for simplicity. In practice thisnumber is spread over 9 different predictors. Moreover, the predictors of “Block Type”also have considerable contribution. Finally, the “Window type” predictors seem to havenegligible contribution to the motion estimation and they might have been skipped.Figure 3-7: Effectiveness of the EPZS predictors.3.5.3 Spatio-temporal predictorIt has been found in previous section that the median predictor is more reliable and hashigher probability to be the true predictor, especially for nonzero biased sequences. Onthe other hand, the collocated (0, 0) prediction is more suitable for sequences, whichcontain a lot of stationary data, i.e. the block is exactly the same with the one at the sameposition in the previous frame. Finally, the prediction based on the motion vector of thecollocated macroblock is better in a number of cases. Apparently, each predictor by itselfperforms well for specific sequences and not so well for others. The proposed predictorcombines the aforementioned predictors in order to form a new predictor which covers awider range of video sequences.Let mv the desired predictor andcol _ mv the motion vector of the collocatedmacroblock anddistinguish the following cases:med _ mv the median predictor, as is illustrated in Figure 3-8. We

mv30 Inter Prediction0,0med_mvCurrent Luma block3.5.3.1 Stationary blockCondition:Figure 3-8: Spatio-temporal predictor.Both of the coordinates x, y of the col_mv are zero.Choice:mv ( x,y)= (0,0)(3-17)3.5.3.2 Vertical movementCondition:Both of the x coordinates of the col_mv and the med_mv are zero.Choice:If col _ mv y> 2 then we consider the movement to be fast and we setmv x,y)= (0, max( col _ mv , med _ mv ))(3-18)(yyOtherwise we setmv x,y)= (0, min( col _ mv , med _ mv ))(3-19)(yy3.5.3.3 Horizontal movementCondition:

Novel methods in H.264/AVC 31Both of the y coordinates of the col_mv and the med_mv are zeroChoice:If col _ mvx > 2 then we consider the movement to be fast and we setmv(x,y) = ( max(col_mv,med_mv ), )(3-20)Otherwise we setx x0mv(x,y) = ( min (col_mv ,med_mv ), )(3-21)x x03.5.3.4 The current block is moving at the same direction and at the same speed withthe collocated blockConditions:Same direction: Both of the col_mv and the med_mv lie in the same quadrant as in Figure3-8.Same speed:med_mvmed_mvxy− 2 ≤ col_mv− 2 ≤ col_mvxy≤ med_mvx≤ med_mvy+ 2+ 2(3-22)Choice:mv(x,y) = (col_mv ,col_mv )(3-23)xy3.5.3.5 The current block is moving at the same direction with the collocated blockbut at different speedConditions:Same direction: Both of the col_mv and the med_mv lie in the same quadrant as in Figure3-8.Different speed: The inequality (3-22) does not apply.Choise:col _ mv _ _ _x+ med mv col mvxy+ med mvymv(x,y)= ,(3-24)22

32 Inter Prediction3.5.3.6 All other casesChoice:mv(x,y) = (med_mv ,med_mv )(3-25)xy3.5.4 Simulation resultsThe simulation tests were executed in the simulation environment, which is described inAppendix II. The most important configuration parameters of the reference software(JM11.0) are shown in Table 3-3. The rest of the parameters have retained their defaultvalues.The reference code uses three fast motion estimation algorithms, those in [26], [27] and[28], as described in Section 3.5.1. All of these algorithms consider the Median Predictoras the initial search point and then they perform a fast search around this point. In ourtests we substituted the Median Predictor by our Spatio-Temporal Predictor and then welet the three algorithms do the fast search around the predicted point. We have tested theproposed scheme on different video sequences and the results are shown in Table 3-4. Weused the encoding time and the PSNR variations as comparative metrics, which werecalculated as in eq. (I-2) and (I-5) respectively. From the results we observe that theproposed scheme does not actually affect the PSNR. This was expected since the PSNR isaffected mainly by the search pattern of the FME algorithm rather than its initial searchpoint. We also observe that the proposed scheme speeds up the Motion Estimation inmost of the test cases. In the vast majority of the cases a speedup was observed, whichvaried from 0.6% to 7.3 %. This is a considerable improvement of the existing FMEalgorithms [26], [27] and [28], taking into account that the proposed scheme leaves themain core of the FME algorithms as is and it simply modifies the initial search point.However, in some cases the proposed scheme was proved to be ineffective since itincreased the motion estimation time.3.5.5 ConclusionsThe proposed predictor may be used as the initial search center by [26] and [27] and byother FME algorithms of this type. The predictor actually defines an optimized searcharea around the predictor, where the best match during the motion estimation is likely tobe close to the center of this area. Moreover, it may be used as an additional search

Novel methods in H.264/AVC 33candidate by [28]. The proposed scheme in conjunction with the study of the EPZSpredictors shows that it is possible to combine different spatial and temporal predictors inorder to form a new better predictor.Table 3-3: Configuration parameters of the encoder.ProfileBaselineNumber of Frames 100Number of reference frames 5Motion Estimation AlgorithmHEX, SHEX, EPZSRD OptimizationHigh ComplexityRate ControlEnabledBit Rate45000 bpsTable 3-4: Evaluation of the Predictor.Sequence Format FME PSNR Y % Speed up %bridge-close QCIF HEX 0.00 1.4SHEX 0.00 4.7EPZS 0.00 1.9bridge-far QCIF HEX 0.00 1.0SHEX 0.00 6.7EPZS 0.00 -0.7highway QCIF HEX 0.10 4.4SHEX 0.10 7.3EPZS 0.00 6.4salesman QCIF HEX 0.00 1.9SHEX 0.00 3.4EPZS 0.00 2.5carphone QCIF HEX 0.20 1.0SHEX 0.00 6.4EPZS 0.00 5.0news QCIF HEX 0.00 0.6SHEX 0.10 1.8EPZS 0.00 -3.0grandma QCIF HEX 0.00 3.5SHEX 0.00 3.3EPZS 0.00 1.5container QCIF HEX 0.10 1.3SHEX 0.00 2.9EPZS 0.00 -1.7claire QCIF HEX 0.14 -3.0SHEX 0.00 4.5EPZS 0.10 6.1silent QCIF HEX 0.18 3.4SHEX 0.00 3.2EPZS 0.00 1.4foreman QCIF HEX 0.00 -1.0SHEX 0.00 -0.7EPZS 0.2 3.9

34 Inter Prediction3.6 FAST MULTIPLE REFERENCE FRAME SELECTION3.6.1 Literature reviewSeveral methods for reducing the number of reference frames have been proposed overthe past years. In [29] a method is proposed, which employs the best reference frames ofneighboring blocks in order to determine the best reference frame of the current block. In[30] the frame selection is based on the sub-pixel movement across the reference frames.In [31] the number of reference frames is reduced by using the correlation of differentvalues between the block of the current frame and that of previous frame. In [32] amethod is proposed, which applies some well-known fast motion estimation methods oneach reference frame. The local minimum SAD (eq. (3-4)) found in the selection path isused as the indicator of the final reference frame. In the following sections we propose anapproach, which is based on the same concept as [32], i.e. it performs a SAD test acrossthe reference frames in order to reveal the optimal reference frame. However, our methodoutperforms [32], since it uses a significantly smaller number of test points, it takes intoaccount the Lagrangian reference cost and it leads to better results with respect to videoquality as well as to reduction of the motion estimation time.3.6.2 Multiple Reference Frame in H.264H.264 uses multiple reference frames to achieve better motion prediction performance.The encoder performs the motion estimation upon every reference frame for everymacroblock of the current frame. Rate-distortion optimization is the criterion of selectionof the best coding mode. The rate-distortion algorithm evaluates the cost of every possiblereference frame, considering the balance of the distortion and the number of bitsconsumed at the same time [34]. After that, the reference frame which results in thesmallest cost is considered as the optimum choice. The cost function of the selection ofthe optimal reference frame is calculated as follows:J( REFλmotion(λmotion) =R( m( REF )SAD( s,c( REF,m( REF )))−+p( REF ))+R( REF ))(3-26)where λmotionis the Lagrange operator used in motion estimation, R(REF) is the numberof bits consumed for coding the index of the reference frame and it is computed by table

Novel methods in H.264/AVC 35look-up, m(REF) is the motion vector to be decided, p(REF) is the prediction of motionvector and R(m(REF)-p(REF)) is the size of the bit stream of the motion vector afterentropy coding. SAD (Sum of Absolute Difference) is calculated as in eq. (3-4).From the above, it is clear that we can achieve significant reduction of the encoding timeif we reduce the number of the reference frames.3.6.3 Frame selection methodIn the previous section, we saw that the motion vector predictor and the center position(0,0) of the collocated macroblock in the reference frame are by far the most accuratepredictors of the search center.The proposed method takes advantage of the great prediction accuracy of both of themotion vector and the collocated macroblock’s center predictors in order to find theoptimal reference frame. This frame is found by minimizing a cost function. Apparently,we cannot use eq. (3-26) as a cost function since the R(m(REF)-p(REF)) value is notknown prior to the motion estimation. We therefore take into account only the frame costvalue and the cost function (3-26) is modified as followsJ(REFλ ) = SAD( s,c( REF ,m( REF ))) + λ ( R( REF )) (3-27)motionIt has been found that applying the cost function (3-27) only on the 16 × 16 blocks issufficient enough to give the optimal reference frame. The proposed approach is asfollows (for every macroblock in the current frame):Step1 Get the 16× 16 luma blockActionmotion2 Get the next frame from the frame list3 Calculate the cost function (3-27) for the 16 × 16 luma block of the macroblock,which is pointed by the motion vector predictor in the reference frame4 Calculate the cost function (3-27) for the 16 × 16 luma block of the collocatedmacroblock in the reference frame5 Compare the two results and keep the minimum

36 Inter Prediction6 Compare this minimum with the one from the previous frame and set the globalminimum7 Repeat steps 2, 3, 4, 5, 6 until all of the reference frames have been examined8 The remaining global minimum will denote the optimal reference frame for themacroblock, which is under testThe method incorporates also a simple early-stop criterion in order to speed up the SADcalculation. The encoder compares the total SAD with the previous minimum. If the totalSAD so far exceeds the previous minimum, the calculation is terminated.The proposed method is quite simple and very easy to be implemented. There is nocalculation overhead since the position of the collocated block is known and the motionvector predictor is calculated by the encoder according to H.264 standard, anyway. Inaddition to that, our method can be combined with any existing motion estimationalgorithm, either the full search or any other fast motion estimation algorithm.3.6.4 Simulation ResultsThe simulation tests were executed in the simulation environment, which is described inAppendix II. The most important configuration parameters of the reference software(JM11.0) are shown in Table 3-5. The rest of the parameters have retained their defaultvalues.First we conducted some quick tests, where we let our algorithm being executed withoutinterfering with the reference frame selection process. Our purpose was to find out howmany times our algorithm succeeds in finding the same optimal reference frame with theone, which is found by the full search algorithm in normal operation mode. The resultsare shown in Table 3-6 and they indicate that the proposed method may sufficientlyreplace the normal frame selection procedure.Several video sequences in QCIF format were tested. The results are shown in Table 3-7and Table 3-8. The bit rate variations are negligible since we set a very low bit rateconstraint at 45000 bps and therefore the comparisons can be made by examining thevariations of the Motion Estimation time (eq. (I-2)) and the PSNR (eq. (I-5)). From the

Novel methods in H.264/AVC 37comparisons we realize that the proposed method results in a significant reduction of theMotion Estimation time, while the degradation of the PSNR does not exceed 0.5 dB.Moreover, the proposed method has a uniform behavior for all of the testing sequences,no matter which class they belong to.3.6.5 ConclusionsThe method performs a simple and fast test prior to motion estimation in order to choosethe best reference frame among a number of candidates, which are usually more than two(typically 5). In this way, the method reduces the number of the reference frames to one.As a consequence, the motion estimation process is performed against only one frame.Experimental results showed that this frame is, in most of the cases, the frame that theH.264 encoder would choose anyway if it had performed the motion estimation for everyreference frame. Thus, the method decreases the Motion Estimation time withoutconsiderably affecting the video quality.Table 3-5: Configuration parameters of the encoder.ProfileBaselineNumber of Frames 100Number of reference frames 5Motion Estimation AlgorithmFull SearchRD OptimizationHigh ComplexityRate ControlEnabledBit Rate45000 bpsTable 3-6: Successful matches between (3-26) and (3-27).Video Sequence (QCIF, 50 frames) Success (%)Foreman 73.6Salesman 95.4News 94.8Carphone 71.5Grandma 94.0

38 Inter PredictionTable 3-7: Motion estimation time variation.Video Sequence Encoding Time (ms) Variation (%)ReferenceProposedbridge-close 68664 14182 -79.345bridge-far 68980 13974 -79.741highway 71306 14665 -79.433salesman 69236 13814 -80.047carphone 72852 14942 -79.489news 69410 14170 -79.585grandma 69785 14341 -79.449container 69505 13785 -80.166claire 69850 14410 -79.370silent 69520 13735 -80.243foreman 73965 14218 -80.777akiyo 70853 13665 -80.71mobile 79776 15155 -80.81Average Variation of Motion Estimation Time: -79.935 %Table 3-8: PSNR variation.Video Sequence PSNR (dB) Variation (dB)ReferenceProposedbridge-close 32.733 32.774 0.041bridge-far 39.216 39.239 0.023highway 36.092 35.885 -0,207salesman 33.057 32.988 -0.069carphone 33.459 33.100 -0.359news 33.590 33.530 -0.060grandma 36.799 36.683 -0.116container 36.410 35.981 -0.429claire 41.126 40.893 -0.233silent 32.427 32.389 -0.038foreman 31.827 31.331 -0.496akiyo 39.096 38.901 -0.195mobile 23.742 23.312 -0.430Average Variation of PSNR: -0.197 dB

Novel methods in H.264/AVC 393.7 MOVING OBJECT DETECTION IN THE COMPRESSED DOMAIN3.7.1 Literature reviewTraditionally, moving object detection algorithms in compressed domain usually rely ontwo types of features in terms of macroblock (MB): motion vector (MV) and DCTcoefficients. For instance, Ahmad et al. [35], have analyzed the performance of MVsmoothing with different spatial filters. Sukmarg and Rao [36] propose regionsegmentation and clustering based algorithm to detect objects in MPEG compressedvideo. Wang et al. [37] suggest several confidence measures to improve motion layerseparation. Babu and Ramakrishnan [38] use only aggregated motion vectors. Wei Zenget al. [39] employ Markov Random Field (MRF) to segment moving objects from thesparse motion vector field obtained directly from the bit stream. Ibrahim and Rao [40]propose the use of a spatio-temporal filter for filtering the motion vectors and a hybridapproach to exploit both compressed domain processing and spatial domain processing.Babu et al. [41] have proposed automatic video object segmentation algorithm for theMPEG video. They first estimate the number of independently moving objects in thescene using a block-based affine clustering method. The object segmentation is thenobtained by the expectation maximization (EM) clustering algorithm.Most of the above methods are designed to work in the MPEG compressed domain.However, H.264 employs several new coding tools. In H.264, the intra-coded block isspatial intra-predicted according to its neighbor pixels. So, the transform coefficientsprovide the spatial prediction residues information for blocks now. Moreover, H.264supports variable block-size motion compensation. A macroblock may be partitioned intoseveral blocks and has several motion vectors. As a result, the motion vector field forH.264 compressed video consists of motion vectors with variable block size. This is quitedifferent from the former MPEG standard video with regular block size motion vectors.The proposed method is specially designed to work in the H.264 compressed domain. Itmainly takes advantage of the variable block sizes in order to detect a moving object.Thus, it is very fast and easy in its implementation.

40 Inter Prediction3.7.2 Moving object detection in the compressed domainThe proposed method can detect a moving object by simply examining the motioninformation of a macroblock directly in the compressed domain as shown in the blockdiagram of Figure 3-9. The block size (inter mode) of a block and its correspondingmotion vector (MV) can be obtained by entropy decoding the H.264 bitstream. Theproposed method consists of three phases:• Classification,• Merging and• Refinement.H.264 bitsream...00010101010101000110101...Get next frame from thebitstreamTake actionsEntropy Decode the HeaderInformation of the nextmacroblock in the frameYesNoCalculate the MV ThresholdHas a movingobject beendetected?NoNoClassify the pixels as moving andstatic according to the previousthresholdMerge the pixels of the framesinto one frameYesThis is the lastmacroblockYesHave weexaminedenoughframes?Figure 3-9: Block diagram of the proposed method.

Novel methods in H.264/AVC 413.7.2.1 ClassificationThe proposed method is suitable for CCTV based video surveillance. Hence, we assumethat the video source is a static camera. Therefore, each pixel of a frame will either belongto the static background or to a moving object. In this phase we classify the pixels of ablock in static and moving pixels by comparing the block’s motion vector MV with athresholdMVTH: If the MV of the block is less than the MVTH, then its pixels areclassified as static. Otherwise, the pixels are classified as moving pixels. The reason whywe prefer to apply the classification on the pixels of a block and not on the block itself isbecause in this way we can merge the pixels of different frames in order to get the wholemoving object. This procedure is further analyzed in Section 3.6.7.2. Furthermore, weexamine the pixels of the macroblocks, which have been encoded in modes other than16 × 16 , e.g. 16 × 8 , 8× 16 , 8× 8, etc., because the presence of such sub-blocks usuallydenotes a motion. In that way, we also avoid to take into account false motion vectors dueto the inter prediction.The thresholdMVTHcannot be zero because even the static blocks may have non-zeromotion vectors. This is due to the nature of the inter prediction where every block ismotion compensated with regard to previous and/or next reconstructed reference frames,which have suffered quantization. Moreover, the threshold cannot be set to a fixed predefinedvalue because the afore-mentioned behavior is highly related to the quantizationparameter (QP), which is used by the encoder: the higher the QP the more zero motionvectors. The proposed method uses, therefore, a dynamic threshold per frame, which isactually the mean value of the motion vectors of all of the blocks of the frame. This iscalculated as in eq. (3-28).∑ − 11 NTH= MV bN b=0MV (3-28)whereMVTHis the threshold, MVbis the motion vector of thetotal blocks/sub-blocks of the (current) frame.3.7.2.2 Mergingthb block and N is theIn this phase we accumulate successive inter frames (P/B) and we merge their movingpixels in order to get the complete contour of the moving object. This process enables us

42 Inter Predictionto capture objects, which have slow motion. In that case some pixels of the object arestatic in one frame whereas they are moving in the next frames. Another case, which ishandled by the merging phase, is the case where the moving object is occluded by a staticobject and its moving parts are uncovered in successive frames.3.7.2.3 RefinementIn the refinement phase we can obtain the details of the moving object by fully decodingonly the blocks, which contain moving pixels. The typical usage of this phase is to allowthe supervisor of a surveillance system to take a quick look at the object, which intrudedinto the surveyed area.3.7.3 Simulation resultsThe simulation tests were executed in the simulation environment, which is described inAppendix II. The most important configuration parameters of the reference software(JM14.1) are shown in Table 3-9. The rest of the parameters have retained their defaultvalues. The tests were performed using various video sequences in different formats. Herewe present the simulation results of using the tennis video sequence in SIF format. Theresults are shown in Figure 3-10 and they demonstrate the three phases of Section 3.6.7,the classification, the merging and the refinement. Figure 3-10.a shows the 6 th frame ofthe tennis sequence. Figure 3-10.b shows the classification of pixels. The black colordenotes moving pixels while the white color denotes static pixels. Figure 3-10.c showsthe effect of merging the pixels of the different frames. More specifically, the classifiedmoving pixels of the 4 th , 5 th and 6 th frames were merged in one frame. In that way thecontour of the moving object was revealed. Figure 3-10.d shows the final stage of thealgorithm, which is the refinement. In that stage the blocks, which compose the movingobject, are fully decoded in order to present the real moving object in detail.3.7.4 Further improvementsThe proposed method presents good results but under conditions. As a matter of fact, ithas two major disadvantages, which should be handled in the future. First of all, theaccuracy of the method, especially the detection of an object’s contour, heavily dependson the number of the sub-blocks during the motion estimation. That means that lack ofsufficient number of sub-blocks due to either high QP or to slow motion may lead to

Novel methods in H.264/AVC 43rather crude object detection. Moreover, the method cannot handle complex motions,such as the overlapping motions of two or more moving objects.3.7.5 ConclusionsThe method is specially designed to work in the H.264 compressed domain as it exploitsthe variable block sizes used by the H.264 encoder during the motion estimation as longas the generated motion vectors. The proposed method works in the compressed domainbecause it requires only the entropy decoding of the H.264 bitstream in order to obtain theinter prediction modes and the associated motion vectors. Moreover, once the interinformation has been obtained, the method is able to detect a moving object by combiningthe modes with the motion vectors and applying some thresholds. This makes the methodsimple as long as fast. It can be therefore used in real time applications such as videosurveillance. Future work must be done to either eliminate or limit the drawbacks, whichare described in the previous section.Table 3-9: Configuration parameters of the encoder.ProfileBaselineQuantization Parameter 28Frame Rate30 fpsRD OptimizationHigh Complexity ModeMotion EstimationFull SearchIntra Period0: only the first frame is intraSymbol ModeUVLCRate ControlDisabled

44 Inter Predictiona. Original Frameb. Classificationc. Mergingd. RefinementFigure 3-10: The three phases of the proposed algorithm applied in the 6 th frame of thetennis video sequence. The frame has been merged with the 4 th and the 5 th frames duringthe merging phase.

4 Data Hiding4.1 INTRODUCTIONData hiding can be considered as a communication problem where the embedded data isthe signal to be transmitted. A typical data hiding framework starts with an originaldigital media, known as the host media or cover media. The data hiding module inserts init a set of secondary data, known as embedded data or watermark, to obtain the markedmedia. The insertion or embedding is done in such a way that the marked media isperceptually identical to the original media. In most cases, the embedded data is acollection of bits, which may come from an encoded character string, from a pattern, orfrom some executable agents, depending on the application. The embedded data will beextracted from the marked media by an extractor, which performs the inverse embeddingprocess.Data hiding techniques are categorized as non-blind or blind techniques. In non-blind datahiding systems, it is assumed that the original host or cover is available at the decoder.For blind information hiding systems, the decoder does not have an access to the originalcover signal. Data hiding techniques are also categorized as robust, semi-fragile andfragile based on their robustness against attacks. Robust are those techniques, which theembedded data can be extracted even if the marked media has been suffered by variousattacks. In fragile Data Hiding techniques the embedded data cannot be recovered whencompression or other small alteration is applied to the marked media. In semi-fragiletechniques the embedded data can be extracted out, if the marked media has gone throughcompression or other alterations to some extent.

46 Data Hiding4.2 PROBLEM FORMULATIONTraditionally the data hiding in video uses legacy techniques, which are applied to staticimages. These techniques can also be applied to videos if each video frame is treated as astatic image. Most of these techniques modify the coefficients, which are generated bysome transformation in the frequency domain, in order to hide the desired data. Thisresults in degradation of the video quality when the video stream is decoded. Othertechniques modify the motion vectors, which are generated by the motion estimationprocess, in order to hide data. These techniques cause drift errors, which also degrades thevideo quality. The goal of all of these techniques is the hidden data not to causeperceptible errors to the viewer. In the context of our research we followed a completelydifferent approach. We moved the cost of the hidden data from the visual quality to thebit-rate, i.e. the hidden data do not degrade the visual quality but increase, althoughslightly, the bit rate.4.3 SOLUTIONSIn the following sections we shall present two novel data hiding methods. The data hidingduring the inter prediction method (Section 4.4) demonstrates how the inter predictionprocess can be exploited for hiding data. The main advantage of the method is that it doesnot affect the visual quality of the video. Similarly the second data hiding method(Section 4.5) does not affect the visual quality of the video whilst it is capable to hide alarge amount of data. Moreover, it has the unique capability to re-use the video directly inthe compressed domain numerous times. Finally, a scene change detection method(Section 4.4.7) is also presented. This method makes use of the data hiding during theinter prediction. All of the aforementioned methods fall into the category of the appliedmethods as described in the Introduction.

Novel methods in H.264 474.4 DATA HIDING DURING THE INTER-PREDICTION4.4.1 Literature reviewEarly video data hiding approaches were proposing still image watermarking techniquesextended to video by hiding the message in each frame independently [42]. Methods suchas spread spectrum are used, where the basic idea is to distribute the message over a widerange of frequencies of the host data. Transform domain is generally preferred for hidingdata since, for the same robustness as for the spatial domain, the result is more pleasant tothe Human Visual System (HVS). For this purpose the DFT (Discrete FourierTransform), the DCT (Discrete Cosine Transform) and the DWT (Discrete WaveletTransform) domains are usually employed [43-45].Recent video data hiding techniques are focused on the characteristics generated by videocompressing standards. Motion vector based schemes have been proposed for MPEGalgorithms [46-48]. Motion vectors are calculated by the video encoder in order toremove the temporal redundancies between frames. In these methods the original motionvector is replaced by another locally optimal motion vector to embed data. Only few datahiding algorithms considering the properties of H.264 standard [49-51] have recentlyappeared in the open literature. In [49] a subset of the 4×4 DCT coefficients are modifiedin order to achieve a robust watermarking algorithm for H.264. In [50] the blindalgorithm for copyright protection is based on the intra prediction mode of H.264. In [51]some skipped macroblocks are used to embed data.In the following section we propose a new data hiding scheme, which takes advantage ofthe different block sizes (16×16, 16×8, 8×16, 8×8, etc.) used by the H.264 encoder duringthe inter prediction, in order to hide the desirable data. The message can be extracteddirectly from the encoded stream without knowing of the original host video. This methodis best suited for content-based authentication and covert communication applications.4.4.2 Data hiding methodThe main blocks of the H.264 video encoder are depicted in Figure 4-1. The TemporalPrediction block is responsible for the inter prediction of each inter frame. Our schemeintervenes in the inter prediction process in order to hide the data.

48 Data HidingThe most important part of inter prediction is the motion estimation process, which aimsat finding the “closest” macroblock (best match) in the previously coded frame for everymacroblock of the current input frame. Then each macroblock, within the current frame,is motion compensated, i.e. its best match is subtracted from it, and the residualmacroblock is coded. In order to increase the coding efficiency, the H.264 standard, asalready described in previous sections, has adopted seven different block types (16×16,16×8, 8×16, 8×8, 8×4, 4×8 and 4×4) and the motion estimation is applied on each of thesetypes. The block type, which results in the best coding, is selected in the end. The basicidea of the proposed scheme is to force the encoder to choose a block type not in terms ofcoding efficiency, but according to our data hiding requirements. This can be done in aseamless way so as the whole encoding process not to be disturbed. The procedure isdescribed below in detail.First we assign a binary code to every block type according to Table 4-1. For simplicitywe use only 4 block types. That gives us 2 bits per block. Then we convert the embeddingmessage into a binary number and we separate the bits in pairs. These pairs are mappedinto macroblocks, which are going to be motion compensated, using the chosen blocktypes as is illustrated in Figure 4-2.Figure 4-1: Data hiding module within the H.264 encoder.

Novel methods in H.264 49...6H...ASCIIConvert into binary...0011011001001000...BinarySeparate bits in pairs00 11 01 10 01 00 10 00Map pairs into block types.16x16 8x8 16x8 8x16 16x8 16x16 8x16 16x16Figure 4-2: Message mapping into block types.Table 4-1: Binary codes of the block types.Block type16×16 0016×8 018×16 108×8 11Binary codeIt is also important to define the data hiding parameters such as:1. Starting frame: It indicates the frame from which the algorithm starts messageembedding.2. Starting macroblock: It indicates the macroblock within the chosen frame fromwhich the algorithm starts message embedding.3. Number of macroblocks: It indicates how many macroblocks within a frame aregoing to be used for data hiding. These macroblocks may be consecutive, or evenbetter, they may be widespread within the frame according to a predefined pattern.Apparently, the more the macroblocks we use, the higher the embedding capacity weget. Moreover, if the size of the message is fixed, this number will be fixed, too.Otherwise it can be dynamically changed.4. Frame period: It indicates the number of the inter frames, which must pass, beforethe algorithm repeats the embedding. This parameter is very important since itincreases the possibilities of extracting the message even if some parts of the video

50 Data Hidingsequence are missing. However, if the frame period is too small and the algorithmrepeats the message very often, that might have an impact onto the coding efficiencyof the encoder. Apparently, if the video sequence is large enough, the frame periodcan be accordingly large.The encoder reads these parameters from a file. The same file is read by the software thatextracts the message, so as both of the two codes to be synchronized.Figure 4-3 shows the block diagram of the proposed embedding algorithm. As an interframe enters the Temporal Prediction module, the algorithm decides whether to use it forhiding a message or not, according to the hiding parameters. If the algorithm decides touse the frame for hiding data, it chooses the macroblock candidates and performs themotion estimation on them, forcing the encoder to choose a specific block type accordingto the message mapping (Figure 4-2). Then it lets the encoder to proceed with theencoding as in normal operation. In other words the algorithm fakes the motionestimation process, which the encoder would normally perform.Hiding parametersFigure 4-3: Block diagram of the proposed scheme.The proposed scheme may result in very high capacity proportional to the host videosequence size. Its major advantage is that it does not affect the visual quality of the videosequence and if the hiding parameters are properly controlled it does not affect the codingefficiency, either. In addition to that, it is extremely difficult for the decoder to detect thedata hiding interference and this increases the invisibility of the hidden message. Finally,

Novel methods in H.264 51the message can be extracted directly from the encoded video stream without the need ofthe original host video sequence.4.4.3 Simulation resultsThe simulation tests were executed in the simulation environment, which is described inAppendix II. The most important configuration parameters of the reference software(JM11.0) are shown in Table 4-2. The rest of the parameters have retained their defaultvalues. Note that the inter prediction optimizing parameters are disabled for simplifyingthe algorithm implementation.Several video sequences in QCIF format were tested. Figure 4-4, shows the PSNR (eq. (I-3)) results of each luma inter frame for the foreman sequence. We refer to the inter framessince the message is inserted into these frames only. By default, the H.264 encoderregards only the first frame as an intra and the rest as inter frames. The first intra framehas been excluded from Figure 4-4.From the results we observe that the proposed scheme does not actually affect the PSNRof the inter frames. This was expected since there is no bit rate constraint and thus ourscheme does not provoke any loss of information. We would rather expect to seedifferences in the total bit-rate of the inter frames, due to the fact that the schemeinterferes with the optimizing part of the inter prediction. Figure 4-5 shows the bit ratevariations (eq. (I-1)) of the inter frames between the original sequences and the markedones. The bit-rate is generally increased proportionally to the capacity size.Based on Figure 4-4 and Figure 4-5 we can assume that if we put a bit rate constraint onthe encoder we should expect a PSNR decrease. Figure 4-6 shows the PSNR variations(eq. (I-5)) of the inter frames between the original sequences and the marked ones whenwe enforce a 40 kbps bit rate constraint on the encoder. A maximum of 1.4 dB differenceis experienced.The small bit-rate reduction and the PSNR increase that we see in some cases in Figure4-5 and Figure 4-6 respectively, are partly due to the stochastic choice of the message andmainly to the fact that the optimizing parameters of the encoder were disabled, in thesense that the encoder was not able to perform the best possible inter prediction during itsnormal operation.

52 Data HidingThe proposed scheme should ideally affect both the PSNR and the bit rate as less aspossible. A few approaches, which may result in a great improvement of the proposedscheme, are discussed in Section 4.4.5.4.4.4 Message ExtractorThe message extractor is software, not necessarily an H.264 decoder, which extracts thehidden message from the marked H.264 bitstream. The message extractor needs topartially decode the bitstream in order to discover the chosen block type of eachmacroblock of each inter frame. Then, it can form the hidden message according to Table4-1. Apparently, the message extractor must be aware of the hiding parameters, whichwere used by the encoder.4.4.5 Further improvementsIn our current scheme we used only 4 different block types, namely 16×16, 16×8, 8×16,8×8. However, the scheme can also use the sub partitions of the 8×8 type (8×4, 4×8, 4×4),thus increasing the available bits for coding to 8. Apparently, the additional bits willincrease the data capacity decreasing the number of the “tweaked” macroblocks at thesame time. Moreover, the scheme used consecutive macroblocks within a single frame inorder to hide the data. Another improvement would have been if the macroblocks werewidespread within the frame or even better if the macroblocks were widespread withinmultiple frames. This approach would improve the coding efficiency, since the “motionerror”, which is produced by the scheme, will not be accumulated in one place. Inaddition to that, the assignment of the binary codes in Table 4-1 could be modified so asto take into account some video statistics. For example the 16× 16 block type appearsmore often than the other types. The message can therefore be coded using a Huffmancoding and the Huffman code with the highest probability could be assigned to the16 × 16 block type. The gain of this approach will be that our scheme will most likelychoose the block type, which would have been chosen by the encoder in normal operationwithout our interference.4.4.6 ConclusionsThe proposed method embeds the data during the encoding process and utilizes theadvanced inter prediction features of the H.264 encoder. Its main advantage is that it is ablind scheme and its impact on video quality or coding efficiency is almost negligible. It

Novel methods in H.264 53is highly configurable, thus it may result in high data capacities. Finally, it can be easilyextended, resulting in better robustness, better data security and higher embeddingcapacity.Table 4-2: Configuration parameters of the encoder.ProfileBaselineFrames 100Frame Rate30 fpsNumber of reference frames 10Motion Estimation AlgorithmFull SearchRD OptimizationDisabled8x8 Sub-blocksDisabledRate ControlDisabledFigure 4-4: Foreman–PSNR of the luma inter frames.Figure 4-5: Bit rate variations of the luma inter frames.

54 Data HidingFigure 4-6: PSNR variations of the luma inter frames.4.4.7 Application based on this method: A Data Hiding Scheme for Scene ChangeDetection4.4.7.1 IntroductionVideo data can be divided into different shots. A shot is a video sequence that consists ofcontinuous video frames for one action. Scene change detection is an operation thatdivides video data into physical shots. Scene change detection is an important means forvideo editing, video indexing, error resilience, etc. and it has been recognized as one ofthe significant research areas in recent years.Many scene change detection methods can be found in the literature, but some of themare either computationally expensive or ineffective. In the uncompressed domain, themajor techniques are based on pixels, histogram comparisons and/or edge differencecalculations [52-54]. Recent trends focus on developing scene change detectionalgorithms directly in the compressed domain, especially for MPEG compressed videos[55-58]. In general, the methods, which work in the uncompressed domain are moreefficient but not as useful as the methods that work in the compressed domain.The proposed data hiding scheme can be combined with any existing uncompresseddomain scene change detection method in order to enable real time scene changedetection in the compressed domain. The idea is to detect the scene change during the

Novel methods in H.264 55H.264 encoding process and hide this information as metadata in the encoded sequence. Itis then easy for a metadata aware application to detect the scene change and possiblyextract other useful information about the scene, in the compressed domain. The schemeis based on the technique described in Section 4.4.2 in order to hide the metadata. Thedata can then be extracted directly from the encoded stream without knowing the originalhost video.The idea of using metadata hiding in video, audio and images in order to create datachannels is not new. Metadata hiding has also been used for error correction [59] and forcontent adaptation [60].4.4.7.2 The Proposed SchemeThe proposed scheme is combined with one or more well-known scene change detectiontechniques. As soon as a scene change has been detected, the scheme inserts an additionalinter frame in the video sequence. This extra frame is inter-encoded in such a way that:1. It marks the end of the scene,2. It hides useful information about the detected scene, such as the number of thescene frames and the key frames,3. It does not considerably affect the bit rate and the PSNR of the encoded sequence.In general the proposed algorithm consists of two phases, namely the scene changedetection and the metadata hiding. These two phases are presented below.4.4.7.3 Scene Change DetectionAs scene change detection is out of the scope of this dissertation, we only present somebasic principles of it. Apparently, a scene change detection method, which works in theuncompressed domain, must be used. Scene detection is based on shot grouping. Shotsprovide users with better access than an unstructured raw video stream. However, thegranularity of the shot is too small for accessing and thus not so useful. Working on ahigher level unit of video content, such as a scene, i.e. a group of shots sharing similarvisual content is beneficial to human perception and reduces substantially the data needed

56 Data Hidingto deal with compared to the shot level structure. The outcome of the scene changedetection contains, among others, the following information:1. Type of the scene change (cut, dissolve, fade, etc.),2. Scene duration measured in seconds or frames,3. Key frame(s) which can be used to represent the salient content of the scene.4.4.7.4 Metadata HidingThe main blocks of the H.264 video encoder are depicted in Figure 4-1. The TemporalPrediction block is responsible for the inter prediction of each inter frame. The H.264reference encoder, in its default condition, will encode the first frame as an intra frameand will consider the rest frames to be inter frames. Therefore, our scheme intervenes inthe inter prediction process in order to hide the metadata. The basic idea is to insert anextra inter frame (PX) whenever a scene change is detected. ThePXis exactly the samewith the current reconstructed frame ( P C), which is meant to be the reference frame forthe next frames. The encoder will treat PXas a normal inter frame and it will inter encodethePXusing itself as a reference. Hence, the inter encoding of each macroblock of thePXwill result in both zero motion vectors and zero residuals. Only that we force theencoder to inter encode thePXmacroblocks choosing the block types not in terms ofcoding efficiency, but according to our data hiding requirements. As an additionaloptimization, we do not allow the encoder to use thePXas a reference frame for the nextframes. At the end, the overhead of the extra frame’s insertion will be some headerinformation bits to denote the chosen block types and some payload bits to entropyencode the zero coefficients, which are produced by the quantization stage. The scenechange detection information is hidden as metadata in the extra frame using the DataHiding method described in 4.4.2.Figure 4-7 shows the block diagram of the proposed data hiding algorithm. As the interframes enter the Temporal Prediction module, the algorithm decides whether there is ascene change detection or not. If there is a scene change, indeed, the algorithm copies thereconstructed current frame. Then, it allows the inter prediction of this copy using itself asa reference, forcing the encoder to choose specific block types according to the message

Novel methods in H.264 57mapping (Figure 4-2). In the end, it lets the encoder to proceed with the encoding as innormal operation. In other words, the algorithm emulates the inter mode decision process,which the encoder would normally perform.There are two remaining issues to be discussed: the metadata capacity and the metadataformat.Data HidingScenechangedetectedyesMotion estimationemulationMotionCompensationintra codingnoReconstructedreferenceframeReconstructedcurrent frameacting as referenceframeMotionestimation4.4.7.5 Metadata CapacityFigure 4-7: Block diagram of the proposed scheme.The metadata capacity (MDC) in bits for a single scene change detection are calculated asin eq. (4-1):MDC = M f× B p(4-1)where M f is the number of macroblocks per frame and B p is the number of the availablebits per macroblock according to Table 4-1. For example a QCIF frame ( 176× 144) gives:176×144MDC = × 2 = 198 bits(4-2)256The metadata capacity per scene change can be increased if we insert more than one extraframe for every scene change detection. However, this may affect the PSNR and the bitrate.

58 Data Hiding4.4.7.6 Metadata FormatThe metadata format is highly related to the metadata capacity in the sense that the highercapacity the more metadata can be hidden in the extra frame. The general metadata formatis depicted in Figure 4-8.Figure 4-8: The metadata format.Magic String (98 bits): This is a unique string that identifies the extra frame, which marksthe scene change. This extra frame is located immediately after the last frame of the scenechange.Start Scene Change (8 bits): It is a number which indicates the starting frame of the scenechange. It is a zero indexed number which counts backwards starting from the extraframe. In case of a sudden scene change this number is equal to 1.Scene Duration (16 bits): The scene duration in frames.Key Frame(s) (16 bits): It is one or more numbers separated by comas, which indicate thekey frame(s). These are zero indexed numbers which count backwards starting from theextra frame.Other (MDC-138 bits): Other useful scene information.4.4.7.7 Metadata ExtractionThe metadata are extracted from the marked video as described in Section 4.4.4. Theextractor needs to entropy decode the stream in order to discover the magic string, whichindicates a scene change. Then, it can recover the metadata according to Table 4-1mapping. The whole process takes place in real time since the extractor does not need tocompletely decode the H.264 stream.4.4.7.8 Simulation ResultsThe proposed scheme is in effect a supplement to an existing scene change detectionmethod. In fact, it turns an effective scene change detection method, working in the

Novel methods in H.264 59uncompressed domain, into an effective scene change detection method, working in thecompressed domain. Thus, we did not focus to the effectiveness of the scene changedetection method itself. Our intention was rather to measure the Bit Rate and the PSNRdistortion due to the insertion of the metadata. This is the reason why the videosequences, which were tested, did not necessarily have to contain scene changes. Threevideo sequences in QCIF format were tested. The most important configurationparameters of the reference software are shown in Table 4-3. The rest of the parametershave retained their default values.During our experiments we inserted 5, 10, 15 and 20 extra frames into the testingsequences, which corresponded to 5, 10, 15 and 20 scene changes, respectively. Then wemeasured the Variation of the PSNR (eq. (I-5)) of the Luma samples and the Bit RateVariation (eq. (I-1)) caused by these extra frames. The results are shown in Table 4-4,Table 4-5, Table 4-6 and Table 4-7.From the results we see that the PSNR was not affected. This was expected, since wedisabled the bit rate control and the encoder produced the best video quality at theexpense of the bit rate. Indeed, the bit rate showed an increment proportional to thenumber of scene changes, as expected. However, this increment did not exceed the 0.85% in average.4.4.7.9 ConclusionsThe proposed scheme is quite simple and it can be combined with any existing scenechange detection method which works in the uncompressed domain, enabling fast scenedetection in the compressed domain. In addition to that, it does not substantially affecteither the video quality or the bit rate. The scheme is suitable for video indexing andretrieval.Table 4-3: Configuration parameters of the encoder.ProfileBaselineFrames 800Q parameter for I & P frames 28Number of reference frames 2Motion Estimation AlgorithmEPZSRD OptimizationLow ComplexityRate ControlDisabled

60 Data HidingTable 4-4: PSNR and bit rate variations for 5 scene changes (990 bits capacity).Sequence (800 frames) PSNR Variation (dB) Bit Rate Variation (%)bridge-close -0.00 0.32Highway 0.00 -0.04Grandma 0.01 0.03mother & daughter -0.01 0.23Average: 0.00 dB Average: 0.13 %Table 4-5: PSNR and bit rate variations for 10 scene changes (1980 bits capacity).Sequence (800 frames) PSNR Variation (dB) Bit Rate Variation (%)bridge-close 0.01 0.43Highway -0.00 0.37Grandma 0.01 0.09mother & daughter -0.00 0.21Average: 0.00 dB Average: 0.27 %Table 4-6: PSNR and bit rate variations for 15 scene changes (2970 bits capacity).Sequence (800 frames) PSNR Variation (dB) Bit Rate Variation (%)bridge-close 0.01 0.56highway -0.00 0.25grandma 0.01 0.36mother & daughter -0.00 0.45Average: 0.00 dB Average: 0.41 %Table 4-7: PSNR and bit rate variations for 20 scene changes (3960 bits capacity).Sequence (800 frames) PSNR Variation (dB) Bit Rate Variation (%)bridge-close 0.01 0.79highway -0.01 0.75grandma -0.00 1.19mother & daughter -0.02 0.66Average: 0.00 dB Average: 0.85 %

Novel methods in H.264 614.5 REAL TIME DATA HIDING BY EXPLOITING THE I_PCM MACROBLOCKS4.5.1 Literature reviewThere are only few H.264 Data Hiding techniques, which can work in real time such asthe [61], which embeds the watermark bit to the sign bit of the Trailing Ones in ContextAdaptive Variable Length Coding (CAVLC) of H.264. Most of the known Data Hidingtechniques take place during the H.264 encoding [43-48, 63].The proposed technique, however, can embed the data during the encoding process aswell as in the compressed domain. It exploits the I_PCM mode used by the H.264encoder during the intra prediction, in order to hide the desired data. The data can then beextracted directly from the encoded stream without knowing the original host video. Thismethod is best suited for content-based authentication and covert communicationapplications.4.5.2 Intra mode prediction in H.264In intra mode a prediction block P is formed based on previously encoded andreconstructed blocks and is subtracted from the current block prior to encoding. There aretwo primary types of intra coding supported: Intra _ 4×4 and Intra _ 16×16 prediction.Chroma intra prediction is the same in both cases. A third type of intra coding, calledI_PCM (or IPCM), is also provided for use in unusual situations. The encoder typicallyselects the prediction mode for each block that minimizes the difference between P andthe block to be encoded.The Intra _ 4×4 mode is based on predicting each 4× 4 luma block separately and iswell suited for coding parts of a picture with significant detail. The Intra _ 16×16 mode,on the other hand, performs prediction and residual coding on the entire 16 × 16 lumablock and is more suited for coding very smooth areas of a picture. In addition to thesetwo types of luma prediction, a separate chroma prediction is conducted. In contrast toprevious video coding standards (esp. H.263+ and MPEG-4 Visual), where intraprediction has been conducted in the transform domain, intra prediction in H.264 isalways conducted in the spatial domain, by referring to neighboring samples ofpreviously-decoded blocks that are to the left and/or above the block to be predicted.

62 Data HidingSince this can result in spatio-temporal error propagation when inter prediction has beenused for neighboring macroblocks, a constrained intra coding mode can alternatively beselected that allows prediction only from intra-coded neighboring macroblocks. InIntra _ 4×4 mode, each 4× 4 luma block is predicted from spatially neighboring samples.When the fidelity of the coded video is high (i.e., when the quantization step size is verysmall), it is possible in certain very rare instances of input picture content for theencoding process to actually cause data expansion rather than compression. Furthermore,it is convenient for implementation reasons to have a reasonably-low identifiable limit onthe number of bits necessary to process in a decoder in order to decode a singlemacroblock. To address these issues, the standard includes an I_PCM macroblock mode,in which the values of the samples are sent directly without prediction, transformation, orquantization. An additional motivation for support of this macroblock mode is to allowregions of the picture to be represented without any loss of fidelity. However, the I_PCMmode is clearly not efficient. Indeed it is not intended to be efficient. Rather, it is intendedto be simple and to impose a minimum upper bound on the number of bits that can beused to represent a macroblock with sufficient accuracy. If one considers the bitsnecessary to indicate which mode has been selected for the macroblock, the use of theI_PCM mode actually results in a minor degree of data expansion.4.5.3 Real time Data HidingAs explained in Section 4.5.2, the I_PCM macroblock is a macroblock, in which thevalues of the samples are sent directly – without prediction, transformation, orquantization. The concept behind our method is to hide the desired data in the low bits ofboth the luma and the chroma samples of an I_PCM macroblock. Eventually, the hiddendata will be embedded into the compressed H.264 stream intact. Simple andstraightforward though, the proposed method has to face two practical obstacles: therareness of the I_PCM macroblocks during the encoding and the low efficiency of theI_PCM mode in terms of compression. The latter turns out to be a tradeoff issue betweenthe generated bitrate and the capacity of the hidden data. This issue is discussed inSection 4.5.4 with the help of some experimental results. Regarding the rareness of theI_PCM macroblock, we conducted several tests with many well-known video sequences.All of the tests resulted in none I_PCM macroblocks no matter how low the quantizationparameter had been set. We therefore concluded that the only safe way to produce I_PCM

Novel methods in H.264 63macroblocks during the encoding is to force the encoder to regard specific macroblocksas I_PCM macroblocks. The simplified block diagram of the proposed method, integratedinto the H.264 encoder, is illustrated in Figure 4-9.IPCM Data Embedding IPCMIntra PredictionI4x4, I16x16, IPCMIPCM DecisionMode DecisionInter Prediction16x16, 16x8, 8x16, 8x8,P8x8: 8x4, 4x8, 4x4Figure 4-9: Simplified block diagram of the proposed method integrated into H.264.The proposed method adds two new blocks in the H.264 encoder: the IPCM Decisionblock and the Data Embedding block.The IPCM Decision intervenes in the Mode Decision process of the H.264 encoding andforces certain macroblocks to be encoded as I_PCM macroblocks. The decision, on whichmacroblocks are going to be encoded as I_PCM, depends on the length of the data to behidden. The rule is that the I_PCM macroblocks must be enough to cover the hidden dataand they must be widespread to the possible extend within the H.264 stream. In the raresituation where the encoder decides to encode a macroblock as I_PCM, without ourintervention, the Data Embedding block will also use this macroblock to hide data.The Data Embedding block takes action after the IPCM Decision and modifies the lowbits of the values of the aforementioned macroblocks in such a way that the modified bitsform the hidden data. The “tweaked” I_PCM macroblocks will then suffer the losslessentropy encoding and the hidden data will eventually be inserted intact into the generatedH.264 stream.

64 Data HidingThe proposed method is characterized by three features, namely, ease of implementation,high data capacity and reusability. The latter allows data hiding in real time directly in thecompressed domain. All of these features are described below.4.5.3.1 Ease of ImplementationThe proposed method can be easily integrated into the reference H.264 encoder. It takesplace in a very early stage of the encoding process, before any spatial or temporalpredictions and before the transformation and the quantization. Therefore, the impact ofthe proposed method to the encoding process is minimized. Some implementation hintson how to implement the proposed algorithm using the reference H.264 encoder version14.0 are given below:1. Add the following snippet just before the “compute_mode_RD_cost” function iscalled. This code will force the encoder to encode theslice in I_PCM mode.if(img->current_mb_nr==N && img->type ==P_SLICE)for (i=0; i < 11; i++) enc_mb.valid[i] = 0;thN macroblock of every P2. Modify the low bits of the values of this macroblock. This is done under the“IPCM” case inside the “RDCost_for_macroblocks” function.4.5.3.2 High Data CapacityThe data capacity of a video sequence in YUV 4:2:0 format is calculated in accordancewith eq. (4-3):wherewhereDataCapaci ty = LumaCapacity + 2 × ChromaCapacity(4-3)LumaCapaci ty = 256×N × L , (4-4)IPCM bitsChromaCapa city = 64×N × C (4-5)IPCM bitsNIPCMis the number of the I_PCM macroblocks used for data hiding, Lbitsis thenumber of the low bits per I_PCM luma sample used for data hiding andCbitsis thenumber of the low bits per I_PCM chroma sample used for data hiding. The luma andchroma samples are 256 and 64, respectively.

Novel methods in H.264 65According to [62], up to 3 low bits of an 8-bit sample can be modified without causingany visual distortion. However, our experiments showed that even if the 4 low bits aremodified the distortion is imperceptible. This is explained by the fact that we are notdealing with static images but with moving frames at a frame rate of 30 fps. Moreover,we embed no more than one I_PCM macroblock per frame and not in successive frames.Finally, the rest of the non-I_PCM macroblocks have possibly suffered greater distortiondue to the intra/inter prediction and to the quantization during the encoding. In order toprove the above we zeroed the 4 low bits of every luma and chroma sample of the 49 thmacroblock of the 9 th frame of the mobile sequence. The sequence was encoded (QP=28,CABAC) and decoded with the H.264 JM.14.0 codec. Figure 4-10 shows the visualresult. An interesting approach would be the modifiable bitsLbitsand Cbitsto bemathematically related to the Quantization Parameter (QP). For example for a high QP(>28) we could modify 4 bits while for a lower QP we could modify 3 bits or fewer.Other combinations are also applicable such as the use of 3 bits for the luma blocks and 4bits for the chroma blocks. In the current implementation of the proposed method wemodify the 4 low bits of both of the I_PCM luma and the chroma samples in order to hidethe data. Hence, from eq. (4-3), a single I_PCM macroblock ( N = 1) forLbits= C bits=4 gives a capacity of 1536 bits. This might be regarded as the upper limit ofthe capacity per I_PCM macroblock.IPCMa. Non-marked frame b. Marked frameFigure 4-10: Comparison of the visual results between a non-marked frame and a markedframe of the mobile sequence.4.5.3.3 Reusability and Real Time Data HidingMost of the data hiding methods hide the data during the encoding process, thus they areslow and ineffective for real time applications such as covert mobile communication. Theproposed method, as explained above, encodes some macroblocks in I_PCM mode andhides the data within their values. After the first pass and as long as the I_PCM

66 Data Hidingmacroblocks have been encoded, the same I_PCM macroblocks can be reused to hidenew data, directly in the compressed domain, numerous times in real time. The reusingprocess needs neither the original video sequence nor the original encoded stream.Furthermore, it does not cause any significant PSNR or bit rate distortions other thanthose, which were introduced by the initial data hiding. This is because the methodincreases the bitrate in a deterministic way and by a fixed amount dependent only on thedata capacity (the likely bit rate increase due to the entropy encoding is quite small). Afterthe initial bit rate increment the bit rate will be no further significantly affected by reusingthe compressed I_PCM blocks. Moreover, modifying the low I_PCM bits differently doesnot cause any perceptible distortions as explained in Section 4.5.3.2 above. To ourknowledge the proposed method is the only method, which exposes such a property. Theway of the I_PCM macroblock reuse is described below.The H.264 bitstream is organized in discrete packets, called “NAL units”. NAL units areclassified into VCL and non-VCL NAL units. The VCL NAL units contain the data thatrepresent the values of the samples in the video pictures and the non-VCL NAL unitscontain any associated additional information. The contents of the NAL units are entropyencoded. The reusing process is performed in four steps, as follows:StepAction1 Get a NAL unit from the H.264 stream2 Entropy decode the NAL and check if the NAL contains an I_PCM macroblock3 In case of a I_PCM macroblock:aEntropy decode the macroblockb Hide the new data into the low bits of the macroblock’s valuescEntropy re-encode the macroblock4 Go to step 1The real time data hiding is achieved by the fact that the method needs only to entropydecode and re-encode the compressed I_PCM macroblocks, thus avoiding the time-

Novel methods in H.264 67consuming normal encoding process. Figure 4-11 shows the block diagram of the realtime data hiding process.Figure 4-11: Block diagram of the real data hiding process.4.5.4 Simulation resultsThe simulation tests were executed in the simulation environment, which is described inAppendix II. The most important configuration parameters of the reference software areshown in Table 4-8. The rest of the parameters have retained their default values.The I_PCM macroblocks are expected to have by nature a negative impact to theproduced bit rate. We conducted several tests in order to investigate this impact. For thatpurpose we used 300 frames or 10 sec of three well-known representative videosequences in QCIF format (YUV 4:2:0): the akiyo (Class A), the foreman (Class B) andthe mobile (Class C). The QCIF format (176x144) was chosen because it is very commonin mobile applications where the demand for real time is always high. Refer to AppendixII.3 for more details about the testing sequences.The hidden message was generated by the pseudorandom integer generator function,rand, which is provided by the standard C library. The testing procedure, also described inAppendix II.4, was to run the reference encoder with and without our algorithm and thencompare the results in respect with the bit rate and the PSNR. We used the bit rate and the

68 Data HidingPSNR variation as comparative metrics, which were calculated as in eq. (I-1) and (I-5)respectively.At the first series of tests we ran the encoder for different very common QuantizationParameters (10, 20 and 30) and for different Data Capacities (3072 - 18432 bits).Moreover we did not apply any bit rate constraints. In this way the PSNR remainedpractically unaffected with the exception of the QP = 10 test case, where the PSNRshowed some degradation, less than -0.1 dB. The results are shown in Figure 4-12, Figure4-13, Figure 4-14, Figure 4-15, Figure 4-16 and Figure 4-17.From the results we see that the bit rate is increased proportionally to the data capacityand to the quantization parameter. This is expected because the higher the quantizationparameter is, the lower the produced bit rate is under the normal reference encoding. Onthe other hand, our method increases the bit rate proportionally to the I_PCMmacroblocks, i.e. to the data capacity. For QP=20 the method results in a continuous bit rate increase. It is notablethat the akiyo sequence presents much higher bit rate variations than the other twosequences. This is explained by the fact that the akiyo is a class “A” sequence, i.e. it haslow spatial details and low amount of movement. That means that both of the intra andthe inter prediction leave small residuals during the normal encoding by the referenceencoder, which eventually results in a very low bit rate. On the other hand, our modifiedencoder always produces the required I_PCM macroblocks, which increase the bit rate.Therefore, we conclude that for class “A” sequences and for high QPs (>20) the methodcannot hide more than 5000 bits efficiently.In the second series of tests we enabled the bit rate control mechanism of the encoder andwe set 60 kbps and 50 kbps bit rate constraints on the encoder, which are considered to below bit rates for the current Internet’s standards. Our purpose was to investigate theperformance of our method when the marked H.264 bitstream has to be transmitted over achannel with limited bandwidth. By enabling the bit rate control the quantizationparameters were automatically controlled by the encoder, which had to generate a bit ratelower or equal to the bit rate constraint. In this way the bit rate was practically unaffectedas is shown in Table 4-9. Apparently, the cost of doing so was put on the PSNR. Figure

Novel methods in H.264 694-18, Figure 4-19 and Figure 4-20 show the PSNR variations when the bit rate controlwas enabled.From the results we see that the overall performance became smoother when we enabledthe bit rate control. The PSNR is decreased proportionally to the data capacity but themaximum decrement does not exceed the 0.43 dB and the 0.44 dB for the 60 and the 50kbps constraints respectively at a capacity of 18432 bits. These decrements were observedin the akiyo sequence as expected. The result is regarded as an acceptable trade off, takinginto account the low bit rate constraints and the small number of frames that were used(300).In the third series of tests we compared our method with the method proposed by Y. Hu etal. [63]. Four QCIF sequences were used (bridge blose, grandma, news and silent; 199frames each). The tests were performed using the H.264 Main Profile configuration (withRDO, CABAC, QP=28 and 30 frames/sec) and with a GOP structure of "IBPBPBPBPB".The results are shown in Table 4-10. The PMC denotes the maximum capacity for theproposed method while the 11MC denotes the maximum capacity for the [63] method,which was used in the comparison tests.Finally, our experiments showed that the proposed algorithm did not introduce seriousdelays in the encoding process. On the contrary, in most cases the encoding became fasterbecause the encoder did not have to fully encode the I_PCM macroblocks.Based on all of the above results, the overall conclusion is that the proposed methodachieves to hide 18 Kbits of data in just 300 frames or 10 sec of a wide range of videosequences in real time. It works better for bit rates around 60 kbps and higher, where themaximum PSNR degradation does not exceed 0.43 dB for the higher capacities. For thatrate and for lower capacities, up to 10 Kbits, the proposed method has very small impactto the PSNR.4.5.5 Message extractorThe message extractor is a software tool, not necessarily an H.264 decoder, whichextracts the hidden message from the marked H.264 bitstream. The message extractorworks in a way similar to that of the algorithm, which reuses the I_PCM macroblocks fordata hiding and is described below:

70 Data HidingStepAction1 Get a NAL unit from H.264 stream2 Entropy decode the NAL and check if the NAL contains an I_PCM macroblock3 In case of a I_PCM macroblock:aEntropy decode the macroblockb Read the low bits of the macroblock’s values in order to extract the message4 Go to step 14.5.6 Further improvementsThe fact that the proposed method inserts raw information (I_PCM macroblocks) into theH.264 bitstream generates a lot of potential improvements. The I_PCM macroblock canbe regarded as part of a still image. Therefore, many data hiding and watermarkingtechniques, which work in the spatial domain, can be applied [64].4.5.7 ConclusionsThe proposed Data Hiding takes place during the encoding process and exploits theI_PCM coded macroblocks in order to hide the data. However, the same I_PCMmacroblocks can be reused to hide new data, directly in the compressed domain,numerous times in real time. The method is a blind scheme and it can result in relativelyhigh data capacities without considerably affecting either the video quality or the codingefficiency.Table 4-8: Configuration parameters of the encoder.ProfileMainNumber of Frames300 (10 sec)Frame Rate30 fpsRD OptimizationHigh Complexity ModeMotion EstimationSimplified UMHexagonSIntra Period0: only the first frame is intraSymbol ModeCABAC

Novel methods in H.264 71Table 4-9: Bit rate variations under bit rate control.Sequence Average Bit Rate Variation (%)50 kbps 60 kbpsAkiyo 0.05 0.06Foreman -0.02 -0.28Mobile 0.03 -0.04Table 4-10: Comparison results between the proposed method and [63].SequencePMC/11MCGrandma15360/12352Bridge-close15360/11748News18432/9972Silent18432/17368PSNR Variation (dB) Bit rate Variation (%)Proposed [63] Proposed [63]-0.01 -0.08 2,93 3.720.01 -0.04 1,15 2.90-0.01 -0.01 3.80 3.230.02 -0.02 3.59 4.14Figure 4-12: Akiyo-bit rate variation vs data capacity for different QPs.Figure 4-13: Akiyo-PSNR variation vs data capacity for different QPs.

72 Data HidingFigure 4-14: Foreman-bit rate variation vs data capacity for different QPs.Figure 4-15: Foreman-PSNR variation vs data capacity for different QPs.Figure 4-16: Mobile-bit rate variation vs data capacity for different QPs.

Novel methods in H.264 73Figure 4-17: Mobile-PSNR variation vs data capacity for different QPs.Figure 4-18: Akiyo-PSNR variation vs data capacity under bit rate control.Figure 4-19: Foreman-PSNR variation vs data capacity under bit rate control.

74 Data HidingFigure 4-20: Mobile-PSNR variation vs data capacity under bit rate control.

5 Bitrate Transcoding5.1 INTRODUCTIONVideo transcoding [75] performs one or more operations, such as bit rate and formatconversions, to transform one compressed video stream to another. Transcoding canenable multimedia devices of diverse capabilities and formats to exchange video contenton heterogeneous network platforms such as the Internet. One scenario is delivering ahigh-quality multimedia source (such as a DVD or HDTV) to various receivers (such asPDAs, Pocket PCs and fast desktop PCs) on wireless and wireline networks. Here, atranscoder (placed at the transmitter, receiver or somewhere in the network) can generateappropriate bitstream threads directly from the original bitstream without having todecode and re-encode. To suit available network bandwidth, a video transcoder canperform dynamic adjustments in the bit-rate of the video bitstream without additionalfunctional requirements in the decoder. Another scenario is a video conferencing systemon the Internet in which the participants may be using different terminals. Here, a videotranscoder can offer dual functionality: provide video format conversion to enable contentexchange and perform dynamic bit rate adjustment to facilitate proper scheduling ofnetwork resources. Thus, video transcoding is one of the essential components for currentand future multimedia systems that aim to provide universal access.5.2 PROBLEM FORMULATIONIn this section we describe the target application and the issues that we need to address.The application is illustrated in Figure 5-1. Movies are stored in a media-streamingserver. The server is connected to a gateway through an error free high speed channel, e.g.Ethernet. A gateway is a network device that acts as an entrance to another network. A

76 Bitrate Transcodingmovie is transmitted to the gateway and then to various devices with wireless capabilities,such as smart phones, PDAs, Tablet PCs and Laptops, which may belong to the same ordifferent networks or subnets.Apparently, the gateway must perform some bit rate control in order to cope with eitherthe bandwidth of the different networks or with the network congestion. A universalapproach for the gateway is to apply a bit rate transcoding technique. This assumes thatthe gateway is media aware, i.e. it recognizes that the input data is a media source.However, the gateway may be a standalone device, even embedded, with limited CPUand memory capabilities. Therefore, complicated, with regards to CPU and bufferingrequirements, transcoders cannot be implemented. On the other hand, low complexitytranscoders, such as the open-loop transcoders, generate drift errors. Real time operationof the transcoder is of course mandatory. The proposed technique addresses all of theabove issues.Figure 5-1: The target application.5.3 SOLUTIONIn the following sections we shall present a novel bit rate transcoding technique. Thebasic concept behind our method is to drop frames in order to reduce the bit rate. Framedropping is the obvious and easiest solution for a bit rate transcoder. As a matter of fact,dropping frames is inevitable when the frames cannot be transmitted (what else can the

Novel methods in H.264 77transmitter do but drop the frames?). However, the dropped frames create a gap in theH.264 bitstream and desynchronize the H.264 decoder from the encoder causingperceptible errors, known as drift errors. These errors are accumulated and propagated intime (until the decoder reaches an intra frame) causing further distortions in the video.One way to avoid this situation, i.e. frame dropping, is to reduce the bit rate by applying abit rate transcoding technique, such as decode and re-encode the video with higherquantization step. Apparently, such techniques are time consuming, as we shall explain inSection 5.4.1.1, compared to the frame dropping and hence they do not serve well ourtarget application. However, our method proves that the frames can be dropped in acontrollable manner causing imperceptible errors to the transmitted video. The methodfalls in the applied methods category as this was defined in the Introduction.

78 Bitrate Transcoding5.4 BIT RATE TRANSCODING BY DROPPING FRAMES IN THE COMPRESSED DOMAIN5.4.1 Literature reviewDue to the nature of H.264 standard [1] (high compression, low bitrate, etc.), the H.264encoded sequences suit very well for applications such as Video On Demand and videostreaming over the Internet or other networks. On the other hand, rate control is animportant issue in video streaming applications for both wired and wireless networks.Rate control techniques fall into the video transcoding category when they do not takeplace during the encoding of the original sequence. A typical scenario is delivering a highquality media source to various receivers (PCs, cell phones, PDAs, etc.) on wireless andwireline networks. The rate controller, hereafter the transcoder, must generate appropriatebitstreams directly from the original bitstream in order to accommodate different networkbandwidths. Another scenario is delivering the media source to a receiver, which supportsonly a lower frame rate. In that case the transcoder must reduce the frame rate. In our casethe target application is a bit-rate transcoder of low complexity and with low memoryrequirements, which can control the bit-rate of the H.264 encoded movies in thecompressed domain in real time.Basically, there are two ways of controlling the bit rate in the compressed domain: thetemporal transcoding, e.g. dropping frames, and the bit rate transcoding on a per framebasis. Several bit-rate and temporal transcoding techniques have been proposed in thepast. An overview of various MPEG transcoding techniques is given in [65, 66]. Most ofthem were presented many years ago and although they are applicable in H.264 videosequences, they do not take into account the special characteristics of H.264 standard.Here, we shall review some representative bit rate and temporal transcoders denoting theirlimitations when these are to be applied to H.264 sequences.5.4.1.1 Bit Rate TranscodersIn general, there exist four bit-rate transcoding categories:1. Cascaded pixel-domain transcoders: These require decoding and re-encoding ofthe bitsream.

Novel methods in H.264 792. Transform-domain transcoders: These require partial decoding of the bitstream upto the inverse transformation of the coefficients.3. Open-loop transcoders: These require entropy decoding and possibly rescalingand re-quantization of the coefficients.4. Special category, where precautions are taken during the encoding with regard tothe bit rate transcoding, such as hiding of data, detection of regions of interest,extraction of side information, etc.Lefol et al. [67] evaluate the performance of some known bit rate transcoders when theseare applied to H.264 bitstreams. The conclusion is that all of the open loop transcodersresult in severe drift errors. Drift error is defined as the error caused by the encoder–decoder prediction mismatch and it is explained in Section 5.4.2.1. In order to avoid sucherrors transcoders of the other categories must be used. However, our target applicationrequires low complexity, low memory and real time implementation. Therefore weexclude the cascaded pixel-domain transcoders since they cannot work in real time. Wealso exclude the transform-domain transcoders. These are supposed to work in real timewhen they are applied to previous standards, e.g. MPEG-2. This may not be true inH.264, especially if the CABAC entropy encoding is used. Besides, they have somecomplexity in their implementation caused mainly by the different inter-modes used bythe H.264 encoder. Finally we exclude the fourth category assuming that the encoder isnot bit rate wise. In conclusion, only the open-loop transcoders meet our requirements butthey cause errors, which may lead to unacceptable degradation of the video quality.5.4.1.2 Temporal TranscodingThe simplest temporal transcoding technique is the random frame dropping. This causessevere drift error as will be shown in Figure 5-4. Several techniques, which try to addressthis problem, have been proposed [65]. The concept behind these techniques is illustratedin Figure 5-2. Let three consecutive frames in time n-2, n-1 and n. Note that the frame n isinter and it uses n-1 as reference. For some reason the middle frame n-1 is dropped. As aconsequence the macroblocks (MBs) in frame n will lose their references. Let’s examinehow a basic technique deals with this problem for a single MB (current). The best matcharea referenced by the motion vector a of the current macroblock MB in frame n overlapsat most with four MBs in its reference frame n-1. Since the frame n-1 is dropped, the

80 Bitrate Transcodingpurpose of this technique is to discover the most suitable motion vector b, which willpoint to the best match of the current MB in the non-dropped frame n-2. Eventually,motion vector c will replace vector b.Figure 5-2: The basic concept behind the temporal transcoding.These techniques work sufficiently well in the previous standards where the size of theMB is fixed (16×16 for the luma block) during the inter prediction. However, H.264 hasvarious inter modes in sizes of 16×16, 16×8, 8×16, 8×8 and sub8×8. For sub8×8, there arefurther four sub-partitions, namely sub8×8, sub8×4, sub4×8 and sub4×4. Such wide blockchoice increases the complexity of the aforementioned techniques dramatically. Thenecessity of keeping the transcoder’s complexity low is better explained in Section 5.2.Based on our previous research, we conclude that currently there is no transcoder, whichcan serve well our target application as this is described in Section 5.2. In this dissertationwe propose a new low complexity bit rate transcoder, which works directly in thecompressed domain in real time and either eliminates or causes non-perceptible drifterrors.5.4.2 Main conceptsIn this section we describe the main concepts, which the proposed technique is based on,such as the H.264 Prediction Models, the Frame Types, the Network Abstraction Layerand the Shot Boundary Detection. We also give a detailed description of the drift errorsince this is the main problem that the proposed technique addresses.

Novel methods in H.264 815.4.2.1 Drift ErrorFigure 5-3: The block diagram of the H.264 encoder–decoder.The block diagram of an H.264 encoder and decoder is illustrated in Figure 5-3. A videointer frame X (n), is predicted from its reference frame and only the predictiondifferences are coded. As shown in Figure 5-3 the encoder embeds also a decoder in itexcept for the part of the entropy decoding. The reason is because the encoder, in order toperform the motion estimation, must use, as a reference, the same reconstructed framewith the one, which is used by the decoder in order to perform the motion compensation.For example the current frame X (n)will be predicted by the reconstructed frameX ′( n −1) rather than by the original frame X ( n −1). The same X ′( n −1)frame will beused by the decoder to reconstruct the frame X ′(n). If the frame X ′( n −1)is eithermodified or missing from the H.264 bitstream then drift errors are generated. The errors

82 Bitrate Transcodingaccumulate and cause the video quality to deteriorate with time until an intra frame isreached. The visual side-effect of the drift error is illustrated in Figure 5-4.Figure 5-4: Drift errors (in circles), when frame 31 of the tennis sequence is missing fromthe H.264 bitstream. The sequence (in SIF format) was encoded using JM16.2 referenceH.264 encoder (QP = 28). The bitsream was decoded by JM16.2 reference decoder andthe missing frame (31) was concealed by the “Frame Copy” method.5.4.2.2 Prediction ModelsIntra Prediction: H.264 introduces a new model of intra prediction, also known as spatialprediction, where a macroblock is predicted by its neighbors. Then the macroblock issubtracted from its prediction. The residuals are transformed using an integer transformand are quantized. Furthermore, an intra prediction is formed for the completemacroblock or for each 4×4 block of luma samples (and associated chroma samples) inthe macroblock. Refer to Section 4.5.2 for more details.Inter Prediction: Inter prediction, also known as temporal prediction, creates a predictionmodel where a macroblock is predicted from a previously encoded video frame usingblock-based motion compensation. Important differences from earlier standards includethe support for a range of block sizes (from 16×16 down to 4×4) and fine sub-samplemotion vectors. Refer to Section 3.1 for more details.5.4.2.3 Frame (and Slice) TypesAs explained in Section 2.6 there are three types of frames with regards to the predictionmodel that is applied to them, namely I, P and B. A brief description of these types is alsogiven below noting their differences from previous standards. However, there are alsotwo other types, IDR and D, (explained below). These two types are not defined by theway they are predicted but rather by the way they are decoded.

Novel methods in H.264 83I-Frame: (Table 2-2).The macroblocks in an I frame can be predicted only using the intraprediction model.P-Frame: (Table 2-2).The macroblocks in a P frame are predicted using the interprediction model. The macroblocks are predicted from one or more (usually five)reference frames before the current frame. Another substantial difference from previousstandards is that the H.264 encoder also allows the intra prediction of a macroblock in a Pframe. The decision of which model will be used, is based on the Rate-DistortionOptimization method, meaning that the encoder will choose to use intra instead of interprediction if this results in better compression.B-Frame: (Table 2-2) B frame is in principle the same as a P frame. However, each interpredictedmacroblock in a B frame may be predicted from one or two reference framesbefore and after the current frame in temporal order. The difference from previousstandards is that the H.264 encoder allows the B frames to be used as reference frames.IDR-Frame: The H.264 standard introduces the Instantaneous Decoder Refresh (IDR)frame. The IDR is the same as an I frame. However, the subsequent P or B frames of anIDR frame are not allowed to use frames, prior to the IDR, as references. The first framein a coded video sequence is always an IDR frame. Besides, the H.264 encoder may injectperiodical IDR frames into the bitstream as an error resilient tactic because the IDRframes stop the accumulation of the temporal prediction errors, such as the drift errors. Ofcourse, the use of IDR frames results in increased bit rate.D-Frame: H.264 standard also introduces the Disposable (D) frame. The D frame is aframe, which cannot be used as a reference for other frames. In previous standards the Dframe was synonymous to the B frame. Since H.264 standard allows a B frame to be usedas a reference, a distinct D frame had to be defined. The H.264 encoder may generateperiodical D frames. However, this affects the inter prediction by limiting the choice ofthe reference frames. As a result, using D frames deteriorates the bit rate. The D framesplay a key role in the proposed technique.5.4.2.4 Network Abstraction Layer (NAL)H.264 makes a distinction between a Video Coding Layer (VCL) and a NetworkAbstraction Layer (NAL) [15]. The purpose of separately specifying the VCL and NAL is

84 Bitrate Transcodingto distinguish between coding-specific features (at the VCL) and transport-specificfeatures (at the NAL). A coded video sequence is represented by a sequence of NAL unitsthat can be transmitted over a packet-based network or a bitstream transmission link orstored in a file. Each NAL Unit (NALU) contains a header and a set of datacorresponding to coded video data (RBSP) as is shown in Figure 2-2. In the context ofthis Chapter many times the term frame implies a NALU, which contains a frame. Theopposite also holds, i.e. the term NALU implies a frame. Figure 5-5 shows the first octetof the NALU header.Figure 5-5: First octet of the NAL Unit (NALU) header.What is interesting is that the NALU header contains useful information about the videodata, which are contained in the NALU such as:The F or forbidden_zero_bit was included to support gateways. H.264 specificationdeclares a value of 1 as a syntax violation.The NRI, or nal_ref_idc signals the relative importance of that NALU. A value of 00 inbinary format indicates that the content of the NALU is not used to reconstruct referenceframes for inter frame prediction.The Type or nal_unit_type specifies the NALU payload type as defined in [1] and is alsoshown in Table 5-1.The importance of the NALU header is that it reveals information about the video datawithout having to decode it. The proposed technique takes advantage of this informationin order to decide which frames to drop.

Novel methods in H.264 85Table 5-1: NAL unit type codes.nal_unit_typeContent of NAL unit0 Unspecified1 Coded slice of a non-IDR picture2 Coded slice data partition A3 Coded slice data partition B4 Coded slice data partition C5 Coded slice of an IDR picture6 Supplemental enhancement information (SEI)7 Sequence parameter set8 Picture parameter set9 Access unit delimiter10 End of sequence11 End of stream12 Filler data13..23 Reserved24..31 Unspecified5.4.2.5 Shot Boundary DetectionA sequence of frames captured by one camera in a single continuous action in time andspace is referred to as a video shot. Shot boundary detection is the automated detection ofdifferent shots in video sequences. A shot is a key element of movies. Usually there aregreat dissimilarities between two successive shots. The proposed method takes advantageof these dissimilarities in order to decide which frames to drop.5.4.3 Bit Rate TranscoderThe proposed method is mainly based on the disposable (D) frame concept (see Section5.4.2.3). A D frame can be dropped without generating drift error because it is not used asa reference for other frames. The problem is that the H.264 encoder does not generate Dframes by default. Even if it does, the frequency that the D frames appear in the bitstreammay not be the desirable one. However, many other frames could be regarded as D or“almost” D frames, meaning that only a few macroblocks within these frames are used asreferences for inter predicted macroblocks in other frames. These frames could also bedropped causing a non-perceptible drift error. There are also other frames, which can bedropped under certain conditions. All of the aforementioned frames will be referred toaltogether as droppable.The purpose of the proposed transcoder is twofold. First it must discover these droppableframes and signal them in the NALU header (Figure 5-5). Then it must drop the markedNALUs according to the bit rate requirements of the channel. Figure 5-6 shows the blockdiagram of the proposed method. There are two distinct components, namely the

86 Bitrate TranscodingDroppable Frame Generator (DFG) and the Bit Rate Controller (BRC). The mainadvantage of separating the two components is that they can be implemented in differentdevices as shown in Figure 5-7. The BRC can be implemented in a gateway of limitedcapabilities, while the DFG can be implemented in a powerful stream server. In that wayone can correct or implement more advanced techniques for generating droppable frameswhilst maintaining the same simple software in the gateway. Anyway, a gateway cannotnot be easily upgraded.H.264 bitstreamBit Rate ControllerDecode NAL UnitParse NAL HeaderIs it marked?YesNo/keepYes/dropCan it bedropped?NoIs bit ratereductionneeded?noYesMark itD Frame generatorTranscoded H.264 bitstreamFigure 5-6: The block diagram of the proposed bit rate transcoder.Physical ChannelD FrameGeneratorLogical ChannelBit RateControllerFigure 5-7: The implementation of the bit rate transcoder in separate devices.

Novel methods in H.264 875.4.3.1 Droppable Frame Generator (DFG)The DFG takes into account the special encoding characteristics of the H.264 encoder inorder to choose the frames that are candidates for being dropped. When thesecharacteristics do not generate a sufficient number of candidates, then the DFG appliesseveral rules in order to increase these candidates.Rule 1: If only a few (even better none) macroblocks within a frame are used asreferences by other frames, then this frame can be dropped causing a non-perceptible drifterror in the decoded video. The proposed technique detects such frames using a “shotboundary detection” approach. The concept behind these methods is that within an H.264sequence there is a strong inter frame correlation unless significant changes occur. As aconsequence the different prediction types and the direction of the reference frames in aframe may indicate severe dissimilarities between consecutive frames, i.e. a shotboundary. A number of shot boundary detection methods on H.264 encoded sequenceshave been studied [69, 70 and 71]. In our work we applied the method described in [69].This method depends on the MB prediction types, the MB partitions and the displaynumbers of the reference pictures in order to detect the shot boundary. Figure 5-8illustrates the possible positions of a shot boundary within an H.264 sequence.Temporal predictionSpatial predictionI/P/B P/B P/BShot Boundary DetectionTemporal predictionSpatial predictionI/P/B I P/BTemporal predictionFigure 5-8: Possible positions of a shot boundary.

88 Bitrate TranscodingWhen a shot boundary occurs at t and the next frame F t+1 is intra (I), H.264 will encodethe frame’s macroblocks as intra macroblocks exclusively as already explained in Section2.6. If the F t+1 is inter frame (P), the H.264 encoder will prefer to spatially predict itsmacroblocks as if they were intra macroblocks. The reason is because the previous frameF t belongs to a different shot and presents little resemblance with F t+1 . Therefore, thespatial prediction is likely to compute smaller residuals than the temporal prediction. IfF t+1 is B, the H.264 encoder will prefer to use the next frame F t+2 as a reference (Figure5-8) for the same reason as in the previous (P) case. In any case, the frames, whichprecede the shot boundary, are not used as references for the next frames. Thus they canbe dropped. The advantages of this method, in conjunction with [69], are the following: It works in the compressed domain. It is fast because it requires only entropy decoding of the NALU. Moreover, itneeds to know only the different macroblock types and the corresponding displaynumbers of the reference frames. It does not require much buffering because it is applied on a per frame basis. A shot boundary leads to high peaks in the bit rate mainly due to the spatialprediction that takes place that time as explained above and illustrated in Figure5-8. As a matter of fact there are shot boundary detection methods, which examinethe bit rate peaks in order to detect a shot boundary. Therefore, the shot boundarydetection method will probably take place when rate control is actually needed.Rule 2: If a frame is an IDR then the previous frames are surely not referenced by theframes that follow. These frames are considered to be D frames and they can be safelydropped without affecting the visual quality of the decoded video.Rule 3: If a frame is I and the number of the reference frames is one, then again theprevious frames are D frames since there is no way for them to be used as references.The flow chart of the DFG for rules 1, 2 and 3 is shown in Figure 5-9.

Novel methods in H.264 89Buffered NALUs...X(n-2)X(n-1)X(n)X(n)Read NALUHeaderEntropy decode the PPSYesType=PPSStore the number of the referenceframes num_ref_idx_l0_active_minus1We ignoreTypes>8See Table 5-1NoType=IDRYesSignal X(n-1) as a D frame in theNALU HeaderNRI=0 & F=1NoEntropy decode theSlice HeaderNoSlice Type=7 (I) &num_ref_idx_l0_active_minus1=0YesYesEntropy decodethe MacroblocksApply ShotChange DetectionShot ChangeNo/get next NALUFigure 5-9: Flow chart of the rules 1, 2 and 3, which are applied by the Droppable FrameGenerator (DFG).Rule 4: The X ( k − r)is a D-like frame if X ( k − r +1)is an I frame and the followingconditions are met for the X ( k − n)frame:X ( k − n)is an I frame andTref≤r−(n+1) N k n∑ ∑ −i= 0 j=0idx(i,j)≤ Nk −n(5-1)where 0 ≤ n ≤ r + 2 , I denotes an intra frame, r is the number of the reference framesthat have been used during the encoding, N − )is total number of the luma blocks and( k nsub-locks in frame X − ), idx ( i,j)is the reference index of the reference list 0 of the( k n

90 Bitrate Transcodingblock j , which equals to i and Trefis a threshold, which is used to denote a D-like frame.In this work T=refN k −nEq. (5-1) is very common when the motion between successive frames is either slowor/and smooth. Rule 4 is effectively the same as rule 2 for r = 1. Figure 5-10 illustratesthis rule.Figure 5-10: Example of rule 4, Frame X(k-4) is a droppable one because it is not usedby the following frames as a reference.Rule 5: Frame X (k)can be dropped if the following condition is met:Tm≤P∑i=0MB ( i)+PM∑j=0MB16x16( j)≤ N(5-2)where N is the total number of the inter-predicted MBs in frame X (k), MBPdenotes askipped MB, MB16x16denotes a 16×16 MB with (0, 0) motion vector, i.e. a static MB andTmis a threshold, which is used to denote a droppable frame. In this work3×NT m= .4Eq. (5-2) is true for frames, which have large static areas and almost no motion, such asthe frames of a surveillance camera. Rule 5 indicates mostly a frame that is almostidentical to the previous one and thus it can be replaced by it.Rule 6: A frame is droppable if the NRI field in its NALU header equals to zero. Thisactually takes place during the encoding by setting the encoder’s parameter DisposablePequal to one. This will generate a bitstream where every second P frame is Disposableand thus droppable.

Novel methods in H.264 915.4.3.1.1 Signaling of the droppable frameIt is important that the DFG must signal the droppable frames in the NALU headerwithout violating its syntax. The DFG will do so by modifying the NALU header (Figure5-5) as follows:- If the NALU contains a Droppable Frame- Set NRI = 0- Set F = 1 (Rules 1,2,3,4)- Set F = 0 (Rule 5,6)Where NRI = 0 means that the current NALU contains a droppable frame and F = 1indicates that none of the frames, which precede the current frame, is used as a referenceby the frames, which follow. For droppable frames, which have been detected by rules 5and 6 we set F = 0 in order to protect the previous frames since rules 5 and 6 detect onedroppable frame at a time. Note that F = 1 normally denotes a syntax violation within theNALU but it is used only by the gateways and it has no impact on the decoding.Therefore, we can safely use it for our purposes. Later the BRC will reset this flagalthough this is not necessary.5.4.3.2 Bit Rate Controller (BRC)The BRC will drop the frames based on the bit rate constraints of the transmissionchannel. Here we shall focus on the frame dropping assuming that the bit rate constraint isknown. The BRC is as simple as a parser of the NALU header. As is shown in Figure5-11, when NRI=0 and F=1 the BRC must drop k frames in order to meet the bit rateconstraint. The variable k is crucial for the bit rate reduction as well as for the visualquality of the decoded video. The larger the k is, the greater bit rate reduction is achieved.However, this is achieved at the expense of the video quality. The visual side effect of abig k is usually an abrupt shot or/and an abnormal object movement due to the missing kframes. The value of k clearly depends on the number of frames between two successivedroppable frames. In other words if there is a sufficient number of droppable frameswithin a sequence, the value of k is small and close to one. In general k is defined as in eq.(5-3)1 ≤ k

92 Bitrate Transcodingwhere N D is the number of frames between two successive droppable frames. This is thereason why the BRC needs to buffer up to N + 1 frames.DBuffered NALUs...X(n-2)X(n-1)X(n)X(n)Rate Control isneededNo/keepTranscoded BitstreamYesRead NALU Header0

Novel methods in H.264 93devices, such as cell phones and which they expect to receive a bitstream through an errorprone channel. As a matter of fact the JM16.2 reference H.264 decoder supports two errorconcealment methods, namely Frame Copy and Motion Copy [68]. In our case, we followthe latter approach, i.e. we take no actions for correcting the semantics violation.5.4.3.4 Performance AspectsThe proposed techniques guarantees a fast execution since it works directly in thecompressed domain and requires only entropy decoding. The speed is also increased bythe separate implementation of the DFG and the BRC (Figure 5-7). We shall thereforefocus on the bit rate reduction and the memory requirements. Clearly, the bit ratereduction depends on the capability of the DFG to detect as many droppable frames aspossible. This, however, cannot be guaranteed in every sequence. For example a videosequence with a GOP of the form IPPP…, where only the first frame is I, which also hascomplex motion (rapid continuous movement of objects, camera zooming, cameramotion, etc.) and long duration without shot changes is not friendly to the proposedtechnique. However, in real life, video sequences which are expected to be wirelesslytransmitted, have some error resilient provisions, such as periodic I or IDR frames andpossibly some Flexible Macroblock Ordering (FMO) scheme. Moreover, if the sequenceis a movie or sport news we can assume that it will have many shots. All of these logicalassumptions make the proposed technique very efficient with regard to the bit ratereduction because most of the rules described in Section 5.4.3.1 can be applied.Regarding the memory requirements, these are proportional to the appearance frequencyof the droppable frames in the video sequence as it was explained in Section 5.4.3.1. Wealways need to buffer the NALUs between two successive droppable frames.5.4.4 Simulation Results5.4.4.1 Simulation SetupThe simulation setup is described in Appendix II. In order to evaluate the performance ofthe proposed technique we followed two approaches. At first, we measured the impactthat each rule (described in Section 5.4.3.1) has to the bit rate and to the video qualityseparately. Several well-known representative video sequences in CIF and QCIF formatwere used. Secondly, we applied our technique in a movie. For that experiment we used achunk of the movie “Bourne Ultimatum”.

94 Bitrate Transcoding5.4.4.2 Encoding AspectsWe used the JM16.2 reference encoder in order to encode the testing sequences so as therules 2, 3, 4, 5 and 6 can be applied. The baseline profile was used and the configurationparameters retained their default values apart from the ones, which are shown in Table5-2. These were differentiated according to the rule to be applied.5.4.4.3 Decoding AspectsWe used JM16.2 reference decoder in order to decode the testing sequences. Theconfiguration parameters of the decoder retained the default values with the exception ofthe error concealment parameter, which was set to one, i.e. the Frame Copy errorconcealment method was used. The error concealment was required in order to apply themetrics eq. (I-1) and eq. (I-4), i.e. the decoded frames of the transcoded bitstream shouldbe equal to the decoded frames of the original bitstream.5.4.4.4 ResultsThe Bit Rate Variation (eq. (I-1)) and the average PSNR (eq. (I-4)) for each rule (2, 3, 4,5 and 6) and for the sequences of Table 5-2 are shown in Table 5-3. The minus sign in thebit rate variation denotes a bit rate reduction. Moreover, the larger the APSNR, the betterthe visual quality is. Figure 5-12, Figure 5-13, Figure 5-14, Figure 5-15 and Figure 5-16show the PSNR per frame between the original and the transcoded sequences. A value of100 means that a transcoded frame is the same with the original frame. Usually a value ofmore than 35 dB is considered good quality. From Table 5-3 and from the Figures onecan notice the effect of increasing the value of k (see eq. (5-3)).We also used movies for our experiments. Movies have many shots, so we could easilyapply rule 1 along with the other rules. We played back the transcoded sequence using aknown video player, “Elecard AVC HD” [73], which simply ignores the skipped framesand decodes the rest. Moreover, we applied the MOS metric (Appendix-Table I-1) inorder to evaluate the visual quality of the transcoded sequence. In the depicted experimentwe set a bitrate constraint of 15 Mb/s while the bitrate of the original sequence exceeded20 Mb/s. Then we applied the rules in order to meet that constraint. Figure 5-17 shows thebitrate of the transcoded sequence vs the original bitrate. The bitrate peak indicates thetime when a shot boundary occurred. At that time we applied the shot boundary detectiontechnique (rule 1) in order to decrease the bit rate. Subjective evaluation has been

Novel methods in H.264 95performed in accordance with the double stimulus impairment scale (DSIS) method [74]and the Mean Opinion Score (MOS) (Appendix-Table I-1) was obtained by using theMSU tool [72]. The MOS measurement of the transcoded sequence was 4.6 according toAppendix-Table I-1, which means that the subjects could hardly detect any errors.5.4.5 Further improvementsAn H.264 encoder is highly configurable. The reference encoder JM16.2 presents morethan 160 configuration parameters which result in different encoding schemes. Moreover,many of them can be detected directly or indirectly in the compressed domain by anintelligent algorithm. That makes possible the discovery of many other rules, such asthose described in Section 5.4.3.1, which can detect droppable frames. As a consequence,the possibility of detecting droppable frames will be increased and the performance of theproposed technique will be further improved.5.4.6 ConclusionsA new bit rate transcoding technique, suitably adapted for H.264 encoded sequences, isproposed. It works directly in the compressed domain because it requires only entropydecoding of the H.264 bitstream. The basic concept behind the method is to discover theframes, which are not used as referenced by other frames, within the H.264 bitstream anddrop them in order to meet the bandwidth constraints of a communication channel. This isachieved by applying several rules, which take into account a number of parameters, suchas the number of the reference frames used by the encoder as long as the possibledissimilarities between successive frames. The effectiveness of the method clearlydepends on the number of the non-reference frames that can detect. The method proved tobe very efficient under certain conditions and works in real-time. Extensive objective andsubjective tests give excellent results. The technique could be further improved bydiscovering more rules that can lead to more droppable frames.

96 Bitrate TranscodingTable 5-2: Configuration parameters of the encoder.Rule Sequence Parameters2 akiyoIDRPeriod=5 NumberReferenceFrames=5(qcif, 150 frames)3 bridge-closeIntraPeriod=8 NumberReferenceFrames=1(qcif, 150 frames)4 grandmaIntraPeriod=10 NumberReferenceFrames=5(qcif, 150 frames)5 bridge-farIntraPeriod=10 NumberReferenceFrames=5(qcif, 150 frames)6 stefan(cif, 89 frames)DisposableP=1Table 5-3: Bit rate variation and APSNR.Rule Bit Rate Var. (Kb/s) calc. as in eq. (I-1) APSNR (dB)kcalc. as in eq. (I-4)2 1 -6.475694 88.134772 -7.040593 67.873953 -8.377952 65.933873 1 -6.255422 92.324052 -10.296116 42.001083 -14.074904 41.310704 1 -3.641593 95.073962 -7.493855 70.677833 -11.168888 66.593415 1 -26.828371 62.028236 1 -44.526611 60.36717Figure 5-12: PSNR results between the original and the transcoded sequence akiyo(QCIF, 150 frames) for k=1, 2 and 3.

Novel methods in H.264 97Rule 3PSNR (dB)11010090807060504030201 11 21 31 41 51 61 71 81 91 101 111 121 131 141Framebridge-close, k=1bridge-close, k=2bridge-close, k=3Figure 5-13: PSNR results between the original and the transcoded sequence bridgeclose(QCIF, 150 frames) for k=1, 2 and 3.Rule 4PSNR (dB)110100908070605040301 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145Framegrandma, k=1grandma, k=2grandma, k=3Figure 5-14: PSNR results between the original and the transcoded sequence grandma(QCIF, 150 frames) for k=1, 2 and 3.Figure 5-15: PSNR results between the original and the transcoded sequence bridge-far(QCIF, 150 frames).

98 Bitrate TranscodingFigure 5-16: PSNR results between the original and the transcoded sequence stefan (CIF,89 frames).Figure 5-17: Bitrate of the transcoded movie (Bourne Ultimatum) vs the original one.

6 EpilogueIn the previous chapters we described our research work in three different areas of H.264video coding, namely Inter Prediction (Chapter 3), Data Hiding (Chapter 4) and Bit RateTranscoding (Chapter 5). Moreover, our research was expanded on more than one aspectswithin each area. More specifically, in the context of the Inter Prediction, we developed: A new fast full search algorithm, which reduces the complexity of the H.264encoder by 53.57% and 32.34% compared to the full search and the fast fullsearch algorithms, which are officially adopted by the reference encoder,respectively. A new spatio-temporal predictor, like the median predictor, which actually definesa new search area during motion estimation. This may result in 7.3% reduction ofthe motion estimation time. This is a considerable improvement of the existingfast motion estimation algorithms (FME) taking into account that the proposedscheme leaves the main core of these algorithms as is and it simply modifies theinitial search point. A fast multiple reference frame selector, which reduces the number of thereference frames used by the motion estimation process. This may result in 80%reduction of the motion estimation time in average. A moving object detection method, which uses the motion vectors produced bythe motion estimation in order to detect a moving object. The method is very fastbecause it works directly in the compressed domain and thus it suits well a variety

100 Epilogueof applications, which induce time constraints, such as CCTV based videosurveillance.In the context of the Data Hiding research we developed: A method, which manipulates the modes during the inter prediction in order tohide data. The method results in high capacity of hidden data. It can hide 1600 bitsof data in 3 sec (30 fps) of a wide range of video sequences. However, thecapacity can be further improved due to the expandable nature of the method. Itsmain advantage is that it achieves this capacity without affecting the visual qualityof the video. A scene change detection method, which is based on the previous data hidingmethod. The method can be combined with any existing scene change detectionmethod which works in the uncompressed domain, enabling fast scene detectionin the compressed domain. Hence, the method can work in real time and suits wellfor applications such as video indexing. A method, which exploits the special I_PCM macroblocks in order to hide data.The method can hide 18 Kbits of data in just 10 sec (30 fps) of a wide range ofvideo sequences without affecting the visual quality of the video. In addition tothat, the method has unique capabilities, not existing in other previous methodsand which have not been overcome yet. First of all it can work in real timedirectly in the compressed domain. Secondly, the marked H.264 bitstream can bereused for hiding new data numerous times without the need of the original video,without having to decode and re-encode the bitstream and without degrading thequality of the video. This is due to the nature of the I_PCM macroblock, whichallows pixel values to be inserted to the bitstream intact, i.e. without beingpredicted.It must be stressed, that the data hiding methods open new directions in the data hidingresearch in video not only because of their unique capabilities (high data capacity, realtime operation, reusability of the marked streams, etc.) but also because, for the first time,they moved the cost of the hidden data from the PSNR to the bit rate in contrast to all ofthe previously existing methods.

Novel methods in H.264 101In the context of the Bit Rate Transcoding we developed: A transcoder, which controls the bit rate of a H.264 bitstream by dropping framesin the compressed domain. Frame dropping is expected to result in high visualdegradation due to “drift” errors. However, our method controls the framedropping in such a way that it either eliminates or makes imperceptible the “drift”errors. It achieves this by detecting the frames, which are not used as referencesby other frames and thus they can be dropped. The droppable frame detection ispossible by analyzing the NAL Units, which compose the H.264 bitstream.Looking at the above methods, one may get the impression that there is a great diversityand possibly some inconsistency among them. However, a closer look will reveal that thecommon underlying core is the inter prediction scheme of H.264. Apart from the obviousinter prediction methods (fast full search algorithm, spatio-temporal predictor andreference frame selector), the object detection is also based on the motion vectors of theinter prediction. Also the first data hiding method (and its associated scene changedetection method) takes advantage of the inter prediction. Even, the I_PCM based datahiding method, which takes place during the intra prediction, actually interferes with themode decision process while encoding an inter frame. This happens because the H.264encoder allows for a macroblock to be predicted as intra (and as I_PCM) even if itbelongs to an inter frame. Finally, the bit rate transcoder is also related to the interprediction because it deals with drift errors. These errors are generated when someframes, which the H.264 decoder needs to use as references during the motioncompensation, are dropped.The above brief analysis makes clear the importance as well the key role of the interprediction within the H.264 encoding. As a matter of fact, the inter prediction affectsH.264 in multiple ways; it increases the compression ratio but at the same time itincreases the complexity of the encoder; it puts time constraints in several applicationsbut at the same time enables other applications to work in real time. It is thiscontradictory behavior of the inter prediction, which makes it an excellent research area.In the following sections, we discuss the contribution of the proposed methods to theH.264 field and their potential improvements.

102 Epilogue6.1 CONTRIBUTIONIn this section we discuss the overall contribution of our work to the field of H.264 videocoding. The section is divided in three parts, namely Inter Prediction, Data Hiding andBit Rate Transcoding in consistence with the research areas described in Chapters 3, 4and 5, respectively. The contribution of each part is estimated based on the novelty of theresearch as well as on its pioneering in the sense whether our work introduced newaspects in H.264. Other criteria that could also be taken into account, not necessarilythough, are the number of the produced publications (8) and the number of the so farcitations (30).6.1.1 Inter predictionThe contribution of the proposed inter prediction methods is somewhat limited. This ispartly due to the nature of the proposed methods in the sense that these methods do notcover all of the inter prediction aspects. They are rather designed to work in conjunctionwith other existing techniques in order to improve their performance. The fact that manyother similar techniques, already exist in the literature also, justifies the low rating. Apartfrom the above, there are also a couple of other reasons why the contribution of ourmethods was kept low. First of all, most (if not all) of the inter prediction methods areheuristic and they are based on experimental observations. This means that they are likelyto be sequence wise and hence cannot perform well for all of the sequences. Apparently,this puts some limitations on the performance and makes it very difficult for a newmethod to have an outstanding contribution. Finally, there is a great competition in thearea because many researchers, being familiar with the motion estimation techniques ofprevious standards (MPEG-2, H.263, etc.), continue their research in H.264 interprediction. As a consequence all of the inter prediction aspects are covered and newmethods will eventually be similar or comparable with existing ones. The moving objectdetection method described in Section 3.7 is an exception. This is a novel method, whichworks exclusively in H.264 videos, directly in the compressed domain. However, itslimitations described in Section 3.7.4, also constricts the method’s value.6.1.2 Data hidingThe contribution of the proposed data hiding methods is high. The main reason is becausewe followed a different approach from the mainstream. We moved the cost of the hidden

Novel methods in H.264 103data from the visual quality to the bit-rate, i.e. the hidden data do not degrade the visualquality but increase, although slightly, the bit rate. Previous methods were hiding data atthe expense of the visual quality. Moreover, the method described in Section 4.5, is oneof the very few data hiding methods, which work in real time. In addition to that, ourmethod has unique capabilities (Section 4.5.3.3), which, to our knowledge, do not exist inany other data hiding methods.6.1.3 Bitrate transcodingThe contribution of the proposed bit rate transcoding method is medium. The method isbased on the well-known technique of dropping frames in order to satisfy the bandwidthconstraints of a communication channel. However, frame dropping, although simple, isvery ineffective because it causes severe distortions in the video. Our method controls theframe dropping in such a way that the distortions are either eliminated or becomeimperceptible in the worst case. To our knowledge, there is no other similar method in theopen literature.6.2 FURTHER IMPROVEMENTSThe proposed methods have been categorized as enhancements and applied methods. Theinter prediction methods (fast full search algorithm, spatio-temporal predictor andreference frame selector) fall in the first category, while all of the rest fall in the latter.The role of an enhancement is to enhance the performance, here by reducing the interprediction complexity of the H.264 encoder. As such, an enhancement is not open todrastic improvement. However, the enhancements could be combined either with eachother or with other existing techniques in order to increase the overall performance. Onthe other hand, the applied methods present many potential improvements, which areenumerated below:The current design of the moving object detection has some limitations. First of all, theaccuracy of the method, especially the detection of an object’s contour, heavily dependson the number of the sub-blocks during the motion estimation. This means that lack ofsufficient number of sub-blocks due to either high quantization parameter (QP) or to slowmotion may lead to rather crude object detection. Moreover, the method cannot handlecomplex motions, such as the overlapping motions of two or more moving objects. Theimprovement has to do with eliminating these limitations.

104 EpilogueThe current design of the data hiding method during the inter prediction uses only 4different block types, namely 16×16, 16×8, 8×16, 8×8. However, the scheme can also usethe sub-partitions of the 8×8 type (8×4, 4×8, 4×4), thus increasing the available bits forcoding to 8. Apparently, the additional bits will increase the data capacity decreasing thenumber of the “tweaked” macroblocks at the same time. Moreover, the scheme usedconsecutive macroblocks within a single frame in order to hide the data. Anotherimprovement would have been if the macroblocks were widespread within the frame oreven better if the macroblocks were widespread within multiple frames. This approachwould improve the coding efficiency, since the “motion error”, which is produced by thescheme, will not be accumulated in one place. In addition to that, the assignment of thebinary codes in Table 4-1 could be modified so as to take into account some videostatistics. For example the 16 × 16 block type appears more often than the other types.The message can therefore be coded using a Huffman coding and the Huffman code withthe highest probability could be assigned to the 16× 16 block type. The gain of thisapproach will be that our scheme will most likely choose the block type, which wouldhave been chosen by the encoder in normal operation without our interference.The I_PCM based data hiding method inserts raw information into the H.264 bitstream.This makes possible a lot of potential improvements because the I_PCM macroblock canbe regarded as part of a still image. Therefore, many data hiding and watermarkingtechniques, which work in the spatial domain, can be applied.The Bit Rate Transcoder could take advantage of the high configurability of the H.264encoder in order to detect more droppable frames. The reference encoder presents morethan 160 configuration parameters, which result in different encoding schemes.Moreover, many of them can be detected directly or indirectly in the compressed domainby an intelligent algorithm. That makes possible the discovery of more droppable framesincreasing the performance of the method.

References[1] ITU-T Rec. (05/2003), “Advanced video coding for generic audiovisual services”,T-REC-H.264-200903-S.[2] W. Li and E. Salari, “Successive elimination algorithm for motion estimation”,IEEE Transactions on Image Processing, Volume 4, Issue 1, Jan 1995.[3] T. Toivonen and J. Heikkila, “Fast full search block motion estimation forH.264/AVC with multilevel successive elimination algorithm”, InternationalConference on Image Processing (ICIP) Singapore, October 2004.[4] M. Yang, H. Cui and K. Tang, “Efficient tree structured motion estimation usingsuccessive elimination”, IEEE Proceedings on Vision, Image and SignalProcessing, Volume 151, Issue 5, 30 Oct. 2004.[5] Ce Zhu, Wei-Song Qi and W. Ser, “Predictive fine granularity successiveelimination for fast optimal block-matching motion estimation”, IEEE Transactionson Image Processing, Volume 14, Issue 2, Feb. 2005.[6] Tianding Chen and Quan Xue, “Fast Motion Estimation with Multilevel SuccessiveElimination Algorithm and Early Termination for H.264/AVC Video Coding”,International Conference on Wireless Communications, Networking and MobileComputing (WiCOM) Wuhan , China, Sep.2006.[7] Yang Song, Zhenyu Liu, T. Kenaga and S.Goto, “Enhanced Strict MultilevelSuccessive Elimination Algorithm for Fast Motion Estimation”, IEEE InternationalSymposium on Circuits and Systems (ISCAS) New Orleans, USA, May 2007.[8] Jong-Nam Kim and Tae-Sun Choi, “A fast full-search motion-estimation algorithmusing representative pixels and adaptive matching scan”, IEEE Transactions onCircuits and Systems for Video Technology, Volume 10, Issue 7, Oct 2000.

106 References[9] Chen-Fu Lin and Jin-Jang Leou, “An adaptive fast full search motion estimationalgorithm for H.264”, IEEE Int. Symp. on Circuits and Systems (ISCAS), Kobe,Japan, May 2005.[10] I. Ahmad, Weiguo Zheng, Jiancong Luo and Ming Liou, “A fast adaptive motionestimation algorithm”, IEEE Transactions on Circuits and Systems for VideoTechnology, Vol. 16, No. 3. (2006).[11] Yan-Ho Kam and Wan-Chi Siu, “A Fast Full Search Scheme for Rate-DistortionOptimization of Variable Block Size and Multi-frame Motion Estimation”, IEEEInternational Midwest Symposium on Circuits and Systems (MWSCAS), PuertoRico, 2006.[12] Xuan Jing and Lap-Pui Chau, “Partial Distortion Search Algorithm UsingPredictive Search Area for Fast Full-Search Motion Estimation”, IEEE SignalProcessing Letters, Vol. 14, Nov. 2007.[13] Lung-Chun Chang, Kuo-Liang Chung and Tsung-Cheng Yang, “An improvedsearch algorithm for motion estimation using adaptive search order”, IEEE, SignalProcessing Letters,Vol. 8, Issue 5, May, 2001.[14] Tian Song, K. Ogata, K. Saito and T. Shimamoto, “Adaptive Search Range MotionEstimation Algorithm for H.264/AVC”, IEEE International Symposium on Circuitsand Systems (ISCAS) New Orleans, USA, May 2007.[15] I.E.G. Richardson, “H.264 and MPEG-4 Video Compression”, John Wiley & SonsLtd., 2003.[16] F. Crow, “Summed-area tables for texture mapping”, Proceedings ofSIGGRAPH,volume 18(3), pages 207–212, 1984.[17] V.A. Nguyen and Y.P. Tan, “Efficient Block-Matching Motion Estimation Basedon Integral Frame Attributes”, IEEE Transactions on Circuits and Systems forVideo Technology, Vol. 16, No 3, March 2006, pp. 375-385.[18] C.H. Cheung and L.M. Po, “Novel Cross-Diamond-Hexagonal Search Algorithmsfor Fast Block Motion Estimation”, IEEE Trans. Multimedia, vol. 7, no. 1, pp 16-22, Feb. 2005.[19] Y.K. Tu, J.F. Yang, M.T. Sun and Y.T Tsai, “Fast variable-size block motionestimation for efficient H.264/AVC encoding”, Signal Processing: ImageCommunication, vol. 20, pp. 595-623, 2005.[20] L. Yang, K. Yu, J. Li and S. Li, “Prediction-based directional fractional pixelmotion estimation for H.264 video coding”, Proc. ICASSP, pp. II901-II904, 2005.

Novel methods in H.264/AVC 107[21] J. Stottrup-Andersen, S. Forchhammer and S.M. Aghito, “Rate-distortioncomplexity optimization of fast motion estimation in H.264/MPEG-4 AVC”, Proc.ICIP 2004, Oct. 24-27, Singapore, 2004.[22] H.C. Fei, C.J. Chen and S.H. Lai, “Enhanced downhill simplex search for fast videomotion estimation”, PCM 2005, Part I, LNCS 3767, pp. 512-523, 2005.[23] X.Q. Banh and Y.P. Tan, “Adaptive dual cros search algorithm for block-matchingmotion estimation”, IEEE Trans. Consumer Electronics, vol. 50, no. 2, pp. 766-775,May 2004.[24] P.Y Burgi, “Motion estimation based on the direction of intensity gradient”, Imageand Video Computing, vol. 22, pp. 637-653, 2004.[25] A. Tourapis, O.C. Au and M.L. Liou, “Highly efficient predictive zonal algorithmfor fast block-matching motion estimation”, IEEE Trans. Circuits and Systems forVideo Technology, vol. 12, pp. 934-947, Oct. 2002.[26] Z. Chen, P. Zhou and Y. He, “Fast integer and fractional pel motion estimation forJVT”, JVT-F017r.doc, Joint Video Team (JVT) of ISO/IEC MPEG & ITU-TVCEG. 6th Meeting, Awaji, Island, Japan, Dec. 5-13, 2002.[27] X. Yi, J. Zhang, N. Ling and W. Shang, “Improved and simplified fast motionestimation for JM”, Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG16th meeting, Poznan, Poland, 24-29 July 2005.[28] P. Yin, A.M. Tourapis and J.M. Boyce, "Fast mode decision and motion estimationfor JVT/H.264", in Proc. ICIP (3), 2003, pp.853-856.[29] Hung-Ju Li; Ching-Ting Hsu and Mei-Juan Chen, “Fast Multiple Reference FrameSelection Method for Motion Estimation in JVT/H.264”, IEEE Asia-PacificConference on Circuits and Systems, 6-9 Dec. 2004.[30] A. Chang, O.C. Au and Y.M. Yeung, “A Novel Approach To Fast Multi-FrameSelection For H.264 Video Coding”, 2003 IEEE Int. Conf. on Acoustics, Speechand Signal Processing, Vol. 3, 6-10 April 2003.[31] J-S Sohn and D-G Kim, “Fast Multiple Reference Frame Selection Method UsingCorrelation of Sequence in JVT/H.264”, IEICE Trans. Fundamentals, Vol. E89–A,No. 3, March 2006.[32] C.W. Ting, L.M. Po and C.H. Cheung, “Center-Motion Estimation in H.264”, Proc.of the 2003 Int. Conf. on Neural Networks and Signal Processing, Volume 2, 14-17Dec. 2003.

108 References[33] S.K. Kapotas and A.N. Skodras, “A New Spatio-Temporal Predictor for MotionEstimation in H.264 Video Coding”, 8th Int. Workshop on Image Analysis forMultimedia Interactive Services (WIAMIS 2007), Santorini, Greece, 6-8 June 2007.[34] Z. Chen and K.N. Ngan, “Recent Advances in Rate Control for Video Coding”,Signal Processing Image Communication, Elsevier, pp. 19-38, Vol. 22, 2007.[35] A. Ahmad, D. Chen and S. Lee, “Robust compressed domain object detection inMPEG videos”, Proceedings of Internet and Multimedia Systems and Applications,August 2003.[36] O. Sukmarg and K. Rao, “Fast algorithm detection and segmentation in MPEGcompressed domain”, Proceedings of IEEE Region 10 Technical Conference,September 2000.[37] R. Wang, H. Zhang and Y. Zhang, “A confidence measure based moving objectextraction system for compressed domain”, IEEE International Symposium onCircuits and Systems, pages 21–24, May 2000.[38] R. Venkatesh Babu and K. Ramakrishna, “Compressed domain motionsegmentation for video object extraction”, IEEE International Conference onAcoustics, Speech and Signal Processing, 4:3788–3791, May 2002.[39] Z. Wei, D. Jun, G. Wen and H. Qingming, “Robust moving object segmentation onH.264/AVC compressed video using the block-based MRF model”, Real-TimeImaging, 11(4):290– 299, 2005.[40] M. Ibrahim and S. Rao, “Motion Analysis In Compressed Video-A HybridApproach”, IEEE Workshop on Motion and Video Computing (WMVC), February2007.[41] R. Babu, K.R. Ramakrishnan and S.H. Srinivasan, “Video object segmentation: acompressed domain approach”, IEEE Transaction on Circuits Systems for VideoTechnology 2004;14(4):462–74.[42] J.J. Chae and B.S. Manjunath, “Data Hiding in Video”, IEEE Proc. Int. Conf. onImage Precessing, pp.243-246, 1999.[43] V. Fotopoulos, A.N. Skodras, “Transform Domain Water-marking: AdaptiveSelection of the Watermark's Position and Length”, Proc. Visual Communicationsand Image Processing, VCIP2003, July 2003.[44] A. Sarkar, U. Madhow, S. Chandrasekaran and B.S. Manjunath, “Adaptive MPEG-2 Video Data Hiding Scheme”, Proc. SPIE Security, Steganography andWatermarking of Multimedia Contents IX, Jan. 2007.

Novel methods in H.264/AVC 109[45] H. Liu, J. Huang and Y. Q. Shi, “DWT-Based Video Data Hiding Robust to MPEGCompression and Frame Loss”, Int. Journal of Image and Graphics Vol.5 No.1, pp.111-134, Jan. 2005.[46] J. Zhang, J. Li and L. Zhang, “Video Watermark Technique in Motion Vector”,Proc. of XIV Symposium on Computer Graphics and Image Processing, pp.179-182, Oct.2001.[47] Y. Bodo, N. Laurent and J.L. Dugelay, “Watermarking Video, HierarchicalEmbedding in Motion Vectors”, Proc. Int. Conference on Image Processing, Sept.2003.[48] D.Y. Fang and L.W. Chang, “Data Hiding for Digital Video with Phase of MotionVector”, IEEE Proc. Int. Symposium on Circuits and Systems, ISCAS 2006, May2006.[49] M. Noorkami and R.M. Mersereau, “Towards Robust Compressed-Domain VideoWatermarking for H.264”, Proc. SPIE, Vol. 6072, pp. 489-497, 2006.[50] H. Cao, J. Zhou and S. Yu, “An Implement of Fast Hiding Data into H.264Bitstream based on Inter-Prediction Coding”, Proc. SPIE, Vol. 6043, pp. 123-130,2005.[51] D. Proefrock, H. Richter, M. Schlauweg and E. Mueller, “H.264/AVC VideoAuthentication Using Skipped Macroblocks for an Erasable Watermark”, Proc.SPIE Vol. 5960, pp. 1480-1489, 2005.[52] S. Chen, M. Shyu, C. Zhang and R.L. Kashyap, “Video scene change detectionmethod using unsupervised segmentation and object tracking”, IEEE InternationalConference on Multimedia and Expo (ICME), 2001, pp.57-60.[53] S. Han, “Shot detection combining bayesian and structural information”, Storageand Retrieval for Media Databases 2001, Vol. 4315, December 2001, pp509-516.[54] J. Oh, K.A. Hua and N. Liang, “A content-based scene change detection andclassification technique using background tracking”, IS&T/SPIE Conference onMultimedia Computing and Networking 2000, San Jose CA, January 2000, pp.254-265.[55] J. Bescos, “Real-time shot change detection over online MPEG-2 video”, IEEETrans on Circuits and Systems for Video Technology, Vol. 14, No.4, April 2004,pp.475-484.

110 References[56] C. Dulaverakis, S. Vagionitis, M. Zervakis and E. Petrakis, “Adaptive methods formotion characterization and segmentation of MPEG compressed frame sequences”,ICIAR 2004, Porto, Portugal, September 29-October 1, 2004, pp.310-317.[57] E. Saez, J.I. Benavides and N. Guil, “Reliable time scene change detection inMPEG compressed video”, ICME 2004, June 2004, Taipei, pp.567-570.[58] W. Fernando, C. Canagarajah and D. Bull, “Scene change detection algorithms forcontent-based video indexing and retrieval”, Electronics & CommunicationEngineering Journal, June 2001, pp.117-126.[59] D. Robie and R. Mersereau, “Video error correction using steganography”,EURASIP Journal on Applied Signal Processing, Feb. 2002, pp. 164-173.[60] Y.J. Jung, H.K. Kang and Y.M. Ro, “Metadata hiding for content adaptation”, Dig.Watermarking Int. Work. IWDW, 2003, pp. 456-467.[61] Sung Min Kim, Sang Beom Kim, Youpyo Hong and Chee Sun Won, "Data hidingon H. 264/AVC compressed video", Image Analysis and Recognition, 2007 –Springer.[62] R.C. Gonzalez and R.E. Woods, “Digital Image Processing, Second Edition”,Prentice Hall, ISBN 0-201-18075-8.[63] Y. Hu, C. Zhang and Y. Su, "Information Hiding Based on Intra Prediction Modesfor H.264/AVC", IEEE International Conference on Multimedia and Expo (ICME),Beijing, China, July 2-5, 2007.[64] W. Bender, D. Gruhl and N. Morimoto, “Techniques for data hiding”, TechnicalReport, Massachusetts Institute of Technology Media Lab, 1994.[65] J. Xin, C.W. Lin and M.T. Sun, “Digital video transcoding”, IEEE Proceedings,vol. 7, issue 1, pp. 84-97, January 2005.[66] A. Vetro, C. Christopoulos and H. Sun, “Video transcoding architectures andtechniques: An overview”, IEEE Signal Process. Mag., vol. 20, no 2, pp. 18-29,March 2003.[67] D. Lefol, D. Bull and N. Canagarajah, “Performance evaluation of transcodingalgorithms for H.264”, IEEE Trans. Consumer Electronics, vol. 52, issue 1, pp.215-222, February 2006.[68] S.K. Bandyopadhyay, Z. Wu, P. Pandit and J.M. Boyce, “An error concealmentscheme for entire frame losses for H.264/AVC”, Proc. IEEE Sarnoff Symposium,Mar. 2006.

Novel methods in H.264/AVC 111[69] S. De Bruyne, W. De Neve, K. De Wolf, D. De Schrijver, P. Verhoeve and R.Walle, “Temporal video segmentation on H.264/AVC compressed bitstreams”,Lecture Notes in Computer Science, vol. 4351, pp. 1–12, Springer, Berlin, 2007.[70] S.M. Kim, J. Byun and C. Won, “A scene change detection in H.264/AVCcompression domain”, Proc. PCM , Korea, 2005.[71] W Zeng and W Gao, “Shot Change Detection on H.264/AVC compressed video”,Proc. IEEE ISCAS, Kobe, Japan, 2005.[72] MSU Video Quality Measurement Tool.[73] Elecard AVC HD player, http://www.elecard.com/[74] F. Pereira and T. Ebrahimi, “The MPEG-4 Book”, IMSC Press, Prentice Hall PTR,2002.[75] I. Ahmad, X. Wei, Y. Sun and Y. Zhang, “Video Transcoding: An Overview ofVarious Techniques and Research Issues”, IEEE Transactions On Multimedia, Vol.7, No. 5, October 2005.

Appendices

114 AppendicesAppendix I. MetricsVarious metrics, objective and subjective, were used in order to evaluate the performanceof the proposed methods. The evaluation was done by comparing the reference encoderprovided by the JVT and a modified one. Refer to Appendix II for more details about themethodology that was followed.I.I. OBJECTIVE METRICSI.i.i. Bit Rate (bits/sec)The bit rate refers to the generated bit rate after H.264 compression has been applied to araw video sequence.I.i.ii. Bit Rate Variation (%)It is a comparative metric, which compares the reference H.264 encoder with a modifiedH.264 encoder in terms of the generated Bit Rate. The Bit Rate Variation is calculated asfollows:R′− R∆R = ×100(%)(I-1)Rwhere R is the Bit Rate of the reference encoder and R′ is the Bit Rate of the modifiedencoder.A negative∆ R means that the modified encoder generates fewer bits in the output, i.e. itoutperforms the reference encoder.I.i.iii. Encoding Time (sec)The Encoding Time refers either to the total encoding time or only to the MotionEstimation time. It is assumed that the Encoding Time is the total encoding time unless itis explicitly stated that it is the Motion Estimation time throughout the document.

Novel methods in H.264/AVC 115I.i.iv. Encoding Time Variation (%)It is a comparative metric, which compares the reference H.264 encoder with a modifiedH.264 encoder in terms of the Encoding Time. The Encoding Time Variation iscalculated as follows:T ′ − T∆T = ×100(%)(I-2)Twhere T is the Encoding Time of the reference encoder and T ′ is the Encoding Time ofthe modified encoder.A negative∆ T means that the modified encoder takes less time to encode a sequencethan the reference encoder, i.e. it outperforms the reference encoder.I.i.v. PSNR/APSNR (dB)The PSNR/APSNR metrics are used to evaluate the visual quality of the compressedvideo. The PSNR is calculated as follows:2MaxError w×hPSNR ( n)= 10log10(I-3)w-1,h-12( x - y )∑i=0, j=0i, ji, jwhere MaxError is the maximum possible absolute value of color components difference.The MaxError is 255 for 8-bit color components. Thethe video frames andw, h are the width and height ofx, y are the pixel’s luma values of the decoded frames of theoriginal bitstream and of the modified bitstream, respectively.The APSNR is calculated as follows:APSNRN= t∑n=1PSNR(n)Nt(I-4)where PSNR (n)is the outcome of eq. (I-3) and N t is the total number of the frames inthe video sequence.

116 AppendicesI.i.vi. PSNR Variation (dB)The PSNR Variation is calculated as follows∆ PSNR = APSNR′− APSNR(I-5)where APSNR is the outcome of eq. (I-4) and refers to reference encoder whilst theAPSN R′ refers to the modified encoder.A positive∆ PSNR means that the modified encoder results in better visual quality thanthe reference encoder, i.e. it outperforms the reference encoder.I.II.SUBJECTIVE METRICSI.ii.i. Mean Opinion Score (MOS)The perception of visual quality is influenced by spatial fidelity (how clearly parts of thescene can be seen, whether there is any obvious distortion) and temporal fidelity (whethermotion appears natural and ‘smooth’). However, a viewer’s opinion of ‘quality’ is alsoaffected by other, subjective, factors such as the viewing environment, the observer’sstate of mind and the extent to which the observer interacts with the visual scene.As a subjective quality metric the mean opinion score (MOS) was employed, whichprovides a numerical indication of the perceived quality of the video stream. MOS isexpressed as a single number in the range 1 to 5, where 1 is the lowest perceived videoquality and 5 is the highest one. MOS is generated by averaging the results of a set ofsubjective tests where a number of viewers rate the transcoded video using the ratingscheme shown in Appendix-Table I-1.Appendix-Table I-1: Mean opinion score.MOS Quality Impairment5 Excellent Imperceptible4 Good Perceptible but not annoying3 Fair Slightly annoying2 Poor Annoying1 Bad Very annoying

Novel methods in H.264/AVC 117Appendix II. Simulation environmentII.I.HARDWAREThe simulation tests were executed in Windows XP OS on an Intel CPU T2400 at 1.83GHz with 1.50 GB RAM.II.II.SOFTWAREAll of the proposed methods were implemented in standard C and they were incorporatedinto the reference H.264 code provided by the JVT.II.III. TESTING SEQUENCESSeveral video sequences in YUV 4:2:0 format of various resolutions QCIF (176x144),CIF (352x288) and SIF (352x288) were tested. The testing sequences are also separatedinto the following classes:Class AClass BClass CClass DClass ELow spatial detail and low amount of movementMedium spatial detail and low amount of movement or vice versaHigh spatial detail and medium amount of movement or vice versaStereoscopic (out of scope)Hybrid natural and synthetic (out of scope)Using a combination of sequences of all of the above classes gives an indication of thegenerality of the method under test. This indication is important because in many cases amethod, which is good for a Class A sequence it gives poor results for a Class C sequenceand vice versa. Appendix-Table I-1 presents the classification of some of the most knownvideo sequencesAppendix-Table II-1: Classification of video sequences.Class A Class B Class C Class DMother & daughterAkiyoHall MonitorContainer ShipForemanNewsSilentParisTable TennisStefanMobile & CalendarTempeteChildrenBreamWeather

118 AppendicesII.IV. METHODOLOGYThe reference H.264 codec was used as a basis. Each method was embedded into thereference H.264 code seamlessly. The data flow was directed to the embedded code insuch a way that no other part of the codec was affected. Then, both of the reference H.264codec and the modified one ran using the same configuration successively. Theperformance of each method was evaluated by comparing the results of the two runs. Themetrics, which are described in Appendix I, were used for the evaluation. The metrics’input parameters, Bit Rate, Encoding Time and PSNR, were obtained by the intrinsiclogging mechanism of the H.264 codec. Appendix-Table II-2 presents the log file(log.dat), which is generated by the H.264 encoder.Appendix-Table II-2: H.264 reference encoder’s log file.Name Format PurposeVer W.X/Y.Z Encoder VersionDate MM/DD Simulation End DateTime HH:MM Simulation End TimeSequence %30.30s Sequence Name#Img %5d Coded Primary FramesP/MbInt %d/%d Picture level/ Macroblock levelQPI %-3d I slice QuantizerQPP %-3d P slice QuantizerQPB %-3d B slice QuantizerFormat %4dx%4d Width x HeightIperiod %3d Intra Period#B %3d Number of B coded framesFMES FS|FFS|HEX|SHEX|EPZS Fast Motion Estimation usageHdmd %1d%1d%1d Distortion functions for Motion estimationS.R %3d Maximum Search Range#Ref %2d Maximum number of referencesFreq %3d Coded Video Frame RateCoding CABAC|CAVLC Entropy Mode UsedRD-opt %d Rate Distortion Optimization OptionIntra upd ON|OFF Use of MbLineIntraUpdate.8x8Tr %d Mode usage of 8x8 transformSNRY 1 %-5.3f Luma PSNR for first frame in sequenceSNRU 1 %-5.3f Chroma U PSNR for first frame in sequenceSNRV 1 %-5.3f Chroma V PSNR for first frame in sequenceSNRY N %-5.3f Luma PSNR for entire sequenceSNRU N %-5.3f Chroma U PSNR for entire sequenceSNRV N %-5.3f Chroma V PSNR for entire sequence#Bitr I %6.0f Bitrate assigned to I coded frames#Bitr P %6.0f Bitrate assigned to P coded frames#Bitr B %6.0f Bitrate assigned to B coded frames#Bitr IPB %6.0f Sequence Bitrate including overheadsTotal Time %12d Encoding Time in msMe Time %12d Motion Estimation only time in ms

Novel Methods in H.264/AVC - Library of Ph.D. Theses | EURASIP

Novel Methods in H.264/AVC - Library of Ph.D. Theses | EURASIP ... View more Novel Methods in H.264/AVC - Library of Ph.D. Theses | EURASIP

Delete template?

Save as template ?

Novel Methods in H.264/AVC - Library of Ph.D. Theses | EURASIP Novel Methods in H.264/AVC - Library of Ph.D. Theses | EURASIP