11.07.2015 Views

4B-1 - ASP-DAC 2013

4B-1 - ASP-DAC 2013

4B-1 - ASP-DAC 2013

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

High-Level ArchitectureExploration for MPEG4 Encoderwith Custom ParametersMarius Bonaciu, Aimen Bouchhima, Wassim Youssef,Xi Chen, Wander Cesario(*), Ahmed JerrayaTIMA Laboratory, SLS Group, Grenoble, FRANCEMND, Paris, FRANCE (*)


MPEG4 Encoder Applications Many application domains different configurations, architectures, constraintsMARIUS BONACIU- 1 - <strong>ASP</strong>-<strong>DAC</strong>’06


Solution Space for MPEG4Architecture/Algorithm ExplorationMPEG4 Algorithm ParametersArchitecture Parameters‣ Video resolution‣ Number of CPUs‣ Frame_rate‣ Type of CPUs‣ Bitrate‣ HW-SW partitioning‣ MotionEstimation precision‣ Communication topology‣ Motion Search Area‣ Arbitration type‣ Progressive/Interlaced‣ Message size‣ Scene Change Detection‣ Data width‣ Quantization range/type‣ Data transfer latency‣…‣ … Finding the optimal solution requires to explore between very large number of solutions Implementing the RTL architecture using wrong configurations might “kill” the projectMARIUS BONACIU- 2 - <strong>ASP</strong>-<strong>DAC</strong>’06


Contribution for High-Level Architecture ExplorationFlexibleAlgorithm/ArchitectureModel ofMPEG4Algorithm/ ArchitectureExecutable ModelGeneration(customized algorithm +abstract architecture)VirtualPrototypeAlg./ Arch.parametersImage sizeFrames/secQualityBitrateCPU numberCPU typesClock speedComm. Type…High-LevelArchitectureExplorationNOCo-simulationSatisfiedrequirements?YESRTLImplementationApplicationspecificMP-SoCMARIUS BONACIU- 5 - <strong>ASP</strong>-<strong>DAC</strong>’06


Outline• Introduction to High-Level Architecture Exploration• Flexible Architecture Model for Architecture Exploration• Flexible Algorithm and Architecture Representation• High-Level Architecture Exploration Flow for MPEG4 Encoder• ConclusionsMARIUS BONACIU<strong>ASP</strong>-<strong>DAC</strong>’06


MPEG4 Video Encoder AlgorithmquantaItYUVPMotionEstimationMotionComp.DCTQuant.VLCMPEG4/ISOBitstreamt-1IntraPredictionIPIDCTDeQuant.ImageReconstructMain DivX TaskVLC Task Compress only the spatio-temporal differences between consecutive frames DivX = a popular algorithm implementation of the MPEG4 videocompression technology (ISO/IEC 14496-2)MARIUS BONACIU- 6 - <strong>ASP</strong>-<strong>DAC</strong>’06


Algorithm Exploration Initial MPEG4 Encoder algorithm‣ sequential algorithm‣ requires a large amount of computation Initial Exploration Parameters‣ Video resolution, Frame rate, Bitrate, MotionEstimation precision, MotionSearch Area, Progressive/Interlaced, Scene Change Detection, Quantizationrange, Quantization type‣ Insufficient for multi-processor implementation Need of adding parameters for multi-processor implementation‣ Insert parallelism and pipeline support‣ Modifying parallelism/pipeline level shouldn’t change the code, only the behavior Flexible MPEG4 Encoder algorithm‣ Supports all the Initial Exploration Parameters, plus Parameters for the level ofParallelism/PipelineMARIUS BONACIU- 7 - <strong>ASP</strong>-<strong>DAC</strong>’06


Flexible Parallel/Pipeline DivX Data FlowquantaMain DivX1Processed MBYUVVideo sourcePreprocessingSplitterArea1 Area2YUV Area1YUV Area1Main MainDivX2DivX4Adapt. YUV Processed MB Compressed MBMPEG4 bitstreamYUV Area4YUV YUV Area16 Area16Main DivX2Main DivX1DivX3Main DivXMain Main DivX13 DivX3Processed MBProcessed MB. ......VLCMain DivX4Main DivX14VLCmMain DivX15Main DivX16Processed MBProcessed MBProcessed MBVLC1Compressed MBCompressed MBPostprocessingCombinerMPEG4 storageArea3 Area4SIMDSIMDMARIUS BONACIU- 8 - <strong>ASP</strong>-<strong>DAC</strong>’06


Targeted Abstract Architecture with 2 SIMDSIMD - MainDivXSIMD - VLCMainDivX CPU SubsystemVLC CPU SubsystemVIDEOSplitterAPICPU MemCPU MemCPU MemCPU MemAdapterAdapterAdapterAdapterCPU MemCPU MemCPU MemCPU MemAdapterAdapterAdapterAdapterCombinerAPIMPEG4Explicit interconnect structure Targeted abstract architecture for MPEG4 encoder Flexible and Scalable Interconnect Structure Imposes some architecture configurations‣Splitter and Combiner IP‣Number of tasks per CPU 1MARIUS BONACIU- 9 - <strong>ASP</strong>-<strong>DAC</strong>’06


Outline• Introduction to High-Level Architecture Exploration• Flexible Architecture Model for Architecture Exploration• Flexible Algorithm and Architecture Representation• High-Level Architecture Exploration Flow for MPEG4 Encoder• ConclusionsMARIUS BONACIU<strong>ASP</strong>-<strong>DAC</strong>’06


Flexible Algorithm/Architecture Model for MPEG4VideoSplitter MainDivX VLC Combiner StorageMPIMPIMPIMPI MPI MPIAPIAbstract interconnect execution model (MPI-SystemC)Fixed taskTask with flexible computations ( belonging to a SIMD)Task with flexible input/output (connected to a SIMD) Model described using macro language (i.e. Rive, M4) Tasks description language: C/C++ Tasks encapsulated into SystemC modules Communication done using Message Passing primitives Communication infrastructure: MPI-SystemC High-Level Parallel Programming ModelMARIUS BONACIU- 10 - <strong>ASP</strong>-<strong>DAC</strong>’06


Tasks with Flexible Computations//------------------ MainDivX task ---------------------------------------------EXTERN *image_memory“ N” ,height“ N” , length“ N” , top_border“ N” , left_border” N” ,bottom_border“ N” , right_border“ N” ,&result“ N” ;void MainDivX” N” _MAIN ( ){//initialization of computationsMainDivX” N” _INIT (&image_memory“ N” , height” N” , length” N” );MainDivX//data_receive_communication from the SplitterMPI_” PROTOCOL” Recv(this,&image_memory” N” ,sizeof(image_memory” N” ),” DATA_WIDTH” ,SPLITTER_ID,22,MPI_COMM_WORLD);//calls the function with flexible computationsMainDivX” N” _COMPUTE (&image_memory“ N” ,height“ N” , length“ N” ,top_border“ N” , left_border” N” ,bottom_border“ N” , right_border“ N” ,&result“ N” ); Described using macro language As customizable as possible}//send_results_communication to the VLCMPI_” PROTOCOL” Send(this,&result“ N” ,sizeof(result“ N” ),” DATA_WIDTH” ,VLC[“ target_vlc” ]_ID,22, MPI_COMM_WORLD); Represents a template description used later to generate multiple MainDivX tasksMARIUS BONACIU- 11 - <strong>ASP</strong>-<strong>DAC</strong>’06


Communication Infrastructure: MPI-SystemC HLPPM Tasks communicate using Message Passing primitivesMP_Init(*this,argc,argv);MP_Finalize(*this);MP_[I]Send(*this,buf,count,datatype,dest,tag,comm);MP_[I]Recv(*this,buf,count,datatype,source,tag,comm,status);MP_[I]BSend(*this,buf,count,datatype,dest,tag,comm);MP_[I]BRecv(*this,buf,count,datatype,source,tag,comm,status);MP_[I]SSend(*this,buf,count,datatype,dest,tag,comm);MP_[I]SRecv(*this,buf,count,datatype,source,tag,comm,status);MPI_Wait(*this,request,status);MPI_Test(*this,request,flag,status); MPI-SystemC High-Level Parallel Programming Model‣ Similar with MPICH‣ Implemented in SystemC‣ Capable of time annotating the communication‣ Splits a communication in 3 phases:initialization, data transfer, releaseDataaccessTask CUCommrequestTransferSystemCchannelCUTaskSystemCprocessMARIUS BONACIU- 12 - <strong>ASP</strong>-<strong>DAC</strong>’06


Executable SystemC Model of Combined Arch./Alg. [1/2]MPIVideoSplitterMPIMainDivX 1MainDivX 2MainDivX 3VLC 1VLC 2Combiner StorageMPIMPIMPIMPIMPIMPIMainDivX 4MPIMPIMPIMPI–SystemC HLPPM Obtained after macro-generating the Flexible Algorithm/Architecture Model Different Configuration Parameters Different Executable SystemC Models Executable model Un-timed model Used for algorithm debugging, synchronization debug, communication debug,performance analysisMARIUS BONACIU- 13 - <strong>ASP</strong>-<strong>DAC</strong>’06


Executable SystemC Model of Combined Arch./Alg. [2/2]Macro-generated code for MainDivX 1//------------------ MainDivX task ---------------------------------------------EXTERN *image_memory1,height1, length1, top_border1, left_border1,bottom_border1, right_border1,&result1 ;void MainDivX1_MAIN ( ){//initialization of computationsMainDivX1_INIT (&image_memory1, height1, length1);//data_receive_communication from the SpliterMainDivX MPI_Recv(this,&image_memory1,sizeof(image_memory1),1 32,SPLITTER_ID,22,MPI_COMM_WORLD);//calls the function with flexible computationsMainDivX1_COMPUTE (&image_memory1,height1, length1,top_border1, left_border1,bottom_border1, right_border1,&result1);}//send_result_communication to the VLCMPI_BSend(this,&result1,sizeof(result1),32,VLC[0]_ID,22, MPI_COMM_WORLD);MARIUS BONACIU- 14 - <strong>ASP</strong>-<strong>DAC</strong>’06


Timed Executable SystemC Model [1/2] Obtained after time annotating the computations and communication in the ExecutableSystemC Model with Combined Algorithm/Architecture//------------------ MainDivX task ---------------------------------------------EXTERN *image_memory1,height1, length1, top_border1, left_border1,bottom_border1, right_border1,&result1 ;void MainDivX1_MAIN ( ){//initialization of computationsMainDivX1_INIT (&image_memory1, height1, length1);WAIT(13.224);MainDivX 1//data_receive_communication from the SpliterMPI_Recv(this,&image_memory1,sizeof(image_memory1),32,SPLITTER_ID,22,MPI_COMM_WORLD);//calls the function with flexible computationsMainDivX1_COMPUTE (&image_memory1,height1, length1,top_border1, left_border1,bottom_border1, right_border1,&result1);WAIT(2.312.564);}//send_result_communication to the VLCMPI_BSend(this,&result1,sizeof(result1),32,VLC[0]_ID,22, MPI_COMM_WORLD);MARIUS BONACIU- 15 - <strong>ASP</strong>-<strong>DAC</strong>’06


Timed Executable SystemC Model [2/2]Computation time annotations Computation times depend on the type of CPU and are determined using an ISSsimulation Can be considered fixed computation times (an average) or variable computationtimes read from a table file Annotation granularity vs. Work EffortCommunication time annotations Automatically integrated and managed by the MPI-SystemC HLPPM Splits every communication in 3 steps : initialization, transfer, release Time annotation for each step depends on the used MPI parametersMARIUS BONACIU- 16 - <strong>ASP</strong>-<strong>DAC</strong>’06


Outline• Introduction to High-Level Architecture Exploration• Flexible Architecture Model for Architecture Exploration• Flexible Algorithm and Architecture Representation• High-Level Architecture Exploration Flow for MPEG4 Encoder• ConclusionsMARIUS BONACIU<strong>ASP</strong>-<strong>DAC</strong>’06


Detailed Representation of the Design FlowAlgorithm & Arch.ParametersFlexible Algorithm/Architecture Modelfor MPEG4CommunicationNetworkComm. timetablesExpansion &timingannotationsExpansionAbstractArchitectureTimed ExecutableSystemC ModelExecutable Model with ExplicitNetworkExecution &performanceestimationsHW-SW InterfacesArchitecture RefinementHigh-LevelArch./Alg.ExplorationNoAcceptablesolution?YesArchitectureconfiguration fileRTL Architecture withexecutable SWClassical RTL design flowMP-SoC chipMARIUS BONACIU- 17 - <strong>ASP</strong>-<strong>DAC</strong>’06


Execution and Performance Estimations Executable Algorithm/Architecture Model‣ generated automatically according to the chosen algorithm and architectureconfigurations‣ insert time annotations Executing the Executable Algorithm/Architecture Model‣ SystemC compilation‣ Execute‣ Performance measured using sc_simulation_time() function after encodingevery frame Results representation‣ Using graphic table for representing the performances measured during thesimulation‣ Compare the obtained performances for different algorithm/architectureconfigurationsMARIUS BONACIU- 18 - <strong>ASP</strong>-<strong>DAC</strong>’06


Performance Estimations: QCIF, 25fps, ARM7QCIF using ARM7 (60MHz) processors8.000.000frame0 5 10 15 20clock cycles7.000.0006.000.0005.000.0004.000.0003.000.0002.000.0001.000.00001 MainDivX + 1 VLC2 MainDivX + 1 VLC4 MainDivX + 1 VLC8 MainDivX + 1 VLC16 MainDivX + 1VLC32 MainDivX + 1 VLCREAL TIMEMARIUS BONACIU- 19 - <strong>ASP</strong>-<strong>DAC</strong>’06


Performance Estimations: QCIF, 25fps, ARM9QCIF using ARM9SE46- 4kI$,4kD$ (60MHz) processors3.500.000frame0 5 10 15 203.000.000clock cycles2.500.0002.000.0001.500.0001.000.0001 MainDivX + 1 VLC2 MainDivX + 1 VLC4 MainDivX + 1 VLC8 MainDivX + 1 VLC16MainDivX + 1 VLC32 MainDivX + 1 VLCREAL TIME500.0000MARIUS BONACIU- 20 - <strong>ASP</strong>-<strong>DAC</strong>’06


Performance Estimations: CIF, 25fps, ARM7CIF using ARM7 (60MHz) processors30.000.000frame0 5 10 15 20clock cycles25.000.00020.000.00015.000.00010.000.0005.000.0001MainDivX + 1 VLC2 MainDivX + 2 VLC4 MainDivX + 3 VLC8 MainDivX + 3 VLC12 MainDivX + 3 VLC16 MainDivX + 3 VLC20 MainDivX + 3 VLC32 MainDivX + 3 VLCREAL TIME0MARIUS BONACIU- 21 - <strong>ASP</strong>-<strong>DAC</strong>’06


Performance Estimations: CIF, 25fps, ARM9CIF using ARM9SE46- 4kI$,4kD$ (60MHz) processorsframe0 5 10 15 2012.000.00010.000.000clock cycles8.000.0006.000.0004.000.0002.000.0001 MainDivX + 1 VLC2 MainDivX + 2 VLC4 MainDivX + 2 VLC8 MainDivX + 2 VLC16 MainDivX + 2 VLC32 MainDivX + 2 VLCREAL TIME0MARIUS BONACIU- 22 - <strong>ASP</strong>-<strong>DAC</strong>’06


Architecture ExplorationChangeConfigurationParametersNoExecution &performanceestimationsOK?YesArchitectureconfiguration fileReady for RTLimplementationCPUs Number 5 (4 MainDivX + 1 VLC)IPs Number 2Splitter IPMainDivX 1 CPU (ARM7)MainDivX 1 60MHzMainDivX 1 no cache… the same for the other 3 MainDivXVLC 1 CPU (ARM7)VLC 1 60MHzVLC 1 no cacheCombiner IPSplitter-MainDivX 1send protocol BlockingMainDivX 1-Splitter recv. protocol Non-BlockingSplitter-MainDivX 1burst size 128 bytesSplitter-MainDivX 1data_width 32 bitsSplitter-MainDivX 1init_latency 2 cyclesSplitter-MainDivX 1data_latency 3 cyclesMainDivX 1-VLC 1send protocol BlockingVLC 1-MainDivX 1recv. protocol BlockingVLC 1recv. arbitration AnySourceMainDivX 1-VLC 1burst size 810 bytesMainDivX 1-VLC 1data_width 32 bits.... similar for the other modules Currently, all the decisions are done manually by the designerMARIUS BONACIU- 23 - <strong>ASP</strong>-<strong>DAC</strong>’06


High-Level Estimation vs RTL Measurements140120100115,02 123,537.39% errorms806040200Estimated (High-Level)Measured (RTL) High-Level Estimation = sufficient precision to assure its reliability Error caused by the impossibility to capture at High-Level:‣ OS (scheduling, service calls, latencies induced by the API calls)‣ Interconnect between the CPU buses and Network Interfaces GRANT on bus‣ HW/SW WrappersMARIUS BONACIU- 24 - <strong>ASP</strong>-<strong>DAC</strong>’06


Results Analysis Proposed exploration at High-Level (i.e. for QCIF using ARM7)‣ ~15 minutes were required to generate the Timed SystemC Model‣ ~2 minutes were required to simulate 25 frames‣ ~1 hour was required to find one architecture solution Exploration at RTL level for QCIF using ARM7‣ ~6 months were required to build manually the RTL architecture‣ ~25 hours were required to simulate 25 frames‣ testing different architecture configuration at RTL level becomes unacceptable Adaptation to other applications‣ Flexible Algorithm/Architecture Model has to be remade.‣ MPI-SystemC has to be readapted only in case new communicationtopologies are requiredMARIUS BONACIU- 25 - <strong>ASP</strong>-<strong>DAC</strong>’06


Outline• Introduction to High-Level Architecture Exploration• Flexible Architecture Model for Architecture Exploration• Flexible Algorithm and Architecture Representation• High-Level Architecture Exploration Flow for MPEG4 Encoder• ConclusionsMARIUS BONACIU<strong>ASP</strong>-<strong>DAC</strong>’06


Conclusions• MPEG4 video encoder requires complex and long design time MP-SoCimplementations• High-Level architecture exploration helps finding reliable architectureconfigurations before the MP-SoC implementation starts• A unique Flexible Algorithm/Architecture Model for MPEG4 Encoder were usedto automatically generate different Algorithm/Architecture Models requiredduring the exploration• Time annotations were used to simulate the performances of computations andcommunications running together• Multiple configurations were experimented• The High-Level performance estimations are close to the RTL performancemeasurements• The proposed approach drastically reduced the time to explore differentconfigurations of MPEG4 Encoder in MP-SoC• This method can be extended to other applicationsMARIUS BONACIU- 26 - <strong>ASP</strong>-<strong>DAC</strong>’06


Thank youMARIUS BONACIUTIMA Laboratory, SLS Group46 Avenue Félix Viallet, Grenoble, FRANCETel : +33 4 76 57 43 34Fax : +33 4 76 47 38 14marius.bonaciu@imag.fr

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!