Understanding sources of ineffciency in general-purpose chips

Understanding Sourcesof Inefficiency inGeneral-Purpose ChipsBy Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitzdoi:10.1145/2001269.2001291AbstractScaling the performance of a power limited processorrequires decreasing the energy expended per instructionexecuted, since energy/op * op/second is power. To betterunderstand what improvement in processor efficiency ispossible, and what must be done to capture it, we quantifythe sources of the performance and energy overheads of a720p HD H.264 encoder running on a general-purpose fourprocessorCMP system. The initial overheads are large: theCMP was 500× less energy efficient than an ApplicationSpecific Integrated Circuit (ASIC) doing the same job. Weexplore methods to eliminate these overheads by transformingthe CPU into a specialized system for H.264 encoding.Broadly applicable optimizations like single instruction,multiple data (SIMD) units improve CMP performance by14× and energy by 10×, which is still 50× worse than an ASIC.The problem is that the basic operation costs in H.264 areso small that even with a SIMD unit doing over 10 ops percycle, 90% of the energy is still overhead. Achieving ASIClikeperformance and efficiency requires algorithm-specificoptimizations. For each subalgorithm of H.264, we create alarge, specialized functional/storage unit capable of executinghundreds of operations per instruction. This improvesenergy efficiency by 160× (instead of 10×), and the final customizedCMP reaches the same performance and within 3×of an ASIC solution’s energy in comparable area.1. INTRODUCTIONMost computing systems today are power limited, whetherit is the 1 W limit of a cell phone system on a chip (SoC), orthe 100 W limit of a processor in a server. Since power is ops/second * energy/op, we need to decrease the energy cost ofeach op if we want to continue to scale performance at constantpower. Traditionally, chip designers were able to makeincreasingly complex designs both by increasing the systempower, and by leveraging the energy gains from technologyscaling. Historically each factor of 2 in scaling made eachgate evaluation take 8× less energy. 7 However, technologyscaling no longer provides the energy savings it once did, 9 sodesigners must turn to other techniques to scale energy cost.Most designs use processor-based solutions because of theirflexibility and low design costs, however, these are usually notthe most energy-efficient solutions. A shift to multi-core systemshas helped improve the efficiency of processor systemsbut that approach is also going to hit a limit pretty soon. 8On the other hand, using hardware that has been customizedfor a specific application (an Application Specific IntegratedCircuit or ASIC) can be three orders of magnitude better thana processor in both energy/op and ops/area. 6 This paper comparesASIC solutions to processor-based solutions, to try tounderstand the sources of inefficiency in general-purposeprocessors. We hope this information will prove to be usefulboth for building more energy-efficient processors andunderstanding why and where customization must be usedfor efficiency.To build this understanding, we start with a single videocompression application, 720p HD H.264 video encode,and transform the hardware it runs on from a generic multiprocessorto a custom multiprocessor with ASIC-like specializedhardware units. On this task, a general-purposesoftware solution takes 500× more energy per frame and500× more area than an ASIC to reach the same performance.We choose H.264 because it demonstrates the largeenergy advantage of ASIC solutions (500×) and becausethere exist commercial ASICs that can serve as a benchmark.Moreover, H.264 contains a variety of computational motifs,from highly data-parallel algorithms (motion estimation) tocontrol intensive ones (Context Adaptive Binary ArithmeticCoding [CABAC]).To better understand the potential of producing generalpurposechips with better efficiency, we consider two broadstrategies for customized hardware. The first extends thecurrent trend of creating general data-parallel engines onour processors. This approach mimics the addition of SSEinstructions, or the recent work in merging graphic processorson die to help with other applications. We claim theseare similar to general functional units since they typicallyhave some special instructions for important applications,but are still generally useful. The second approach createsapplication-specific data storage fused with functionalunits. In the limit this should be an ASIC-like solution. Thefirst has the advantage of being a programmable solution,while the second provides potentially greater efficiency.The results are striking. Starting from a 500× energy penalty,adding relatively wide SSE-like parallel execution enginesand rewriting the code to use them improves performance/A previous version of this paper was published inProceedings of the 37th Annual International Symposium onComputer Architecture (2010), ACM, NY.OCTOBER 2011 | vol. 54 | no. 10 | communications of the acm 85

esearch highlightsarea by 14 × and energy efficiency by 10 ×. Despite these customizations,the resulting solution is still 50 × less energyefficient than an ASIC. An examination of the energy breakdownin the paper clearly demonstrates why. Basic arithmeticoperations are typically 8–16 bits wide, and even when performingmore than 10 such operations per cycle, arithmeticunit energy comprises less than 10% of the total. One mustconsider the energy cost of the desired operation comparedwith the energy cost of one processor cycle: for highly efficientmachines, these energies should be similar.The next section provides the background needed tounderstand the rest of the paper. Section 3 then presentsour experimental methodology, describing our baseline,generic H.264 implementation on a Tensilica CMP. The performanceand efficiency gains are described in Section 4,which also explores the causes of the overheads and differentmethods for addressing them. Using the insight gainedfrom our results, Section 5 discusses the broader implicationsfor efficient computing and supporting applicationdriven design.2. BACKGROUNDWe first review the basic ways one can analyze power, andsome previous work in creating energy-efficient processors.With this background, we then provide an overview of H.264encoding and its main compute stages. The section endsby comparing existing hardware and software implementationsof an H.264 encoder.2.1. Power-constrained design and energy efficiencyPower is defined to be energy per second, which can be brokenup into two terms, energy/op * ops/second. Thus thereare two primary means by which a designer can reducepower consumption: reduce the number of operationsper second or reduce the energy per operation. The firstapproach—reducing the operations per second—simplyreduces performance to save power. This approach is analogousto slowing down a factory’s assembly line to save electricitycosts; although power consumption is reduced, thefactory output is also reduced and the energy used (i.e., theelectricity bill) per unit of output remains unchanged. If, onthe other hand, a designer wishes to maintain or improvethe performance under a fixed power budget, a reductionin the fundamental energy per operation is required. It isthis reduction in energy per operation—not power—thatrepresents real gains in efficiency.This distinction between power and energy is an importantone. Even though designers typically face physicalpower constraints, to increase efficiency requires that thefundamental energy of operations be reduced. Although onemight be tempted to report power numbers when discussingpower efficiency, this can be misleading if the performanceis not also reported. What may seem like a power efficiencygain may just be a modulation in performance. Using energyper operation, however, is a performance-invariant metricthat represents the fundamental efficiency of the workbeing done. Thus, even though the designer may be facing apower constraint, it is energy per operation that the designerneeds to focus on improving.Reducing the energy required for the basic operationcan be achieved through a number of techniques, all ofwhich fundamentally reduce the overhead affiliated withthe work being done. As one simple example, clock gatingimproves energy efficiency by eliminating spurious activityin a chip that otherwise causes energy waste. 8 As anotherexample, customized hardware can increase efficiency byeliminating overheads. The next section further discussesthe use of customization.2.2. Related work in efficient computingProcessors are often customized to improve their efficiencyfor specific application domains. For example, SIMD architecturesachieve higher performance for multimedia andother data-parallel applications, while DSP processors aretailored for signal-processing tasks. More recently, ELM 1and AnySP 24 have been optimized for embedded and mobilesignal processing applications, respectively, by reducingprocessor overheads. While these strategies target a broadspectrum of applications, special instructions are sometimesadded to speed up specific applications. For example,Intel’s SSE4 10 includes instructions to accelerate matrixtranspose and sum-of-absolute-differences.Customizable processors allow designers to take thenext step, and create instructions tailored to applications.Extensible processors such as Tensilica’s Xtensa providea base design that the designer can extend with custominstructions and datapath units. 15 Tensilica provides anautomated ISA extension tool, 20 which achieves speedupsof 1.2× to 3× for EEMBC benchmarks and signal processingalgorithms. 21 Other tools have similarly demonstratedsignificant gains from automated ISA extension. 4, 5 Whileautomatic ISA extensions can be very effective, manuallycreating ISA extensions gives even larger gains: Tensilicareports speedups of 40× to 300× for kernels such as FFT,18, 19, 22AES, and DES encryption.Recently researchers have proposed another approachfor achieving energy efficiency—reducing the cost of creatingcustomized hardware rather than customizing a processor.Examples of the latter include using higher levels ofabstraction (e.g., C-to-RTL 13 ) and even full chip generatorsusing extensible processors. 16 Independent of whether onecustomizes a processor, or creates customized hardware, itis important to understand in quantitative terms the typesand magnitudes of energy overheads in processors.While previous studies have demonstrated significantimprovements in performance and efficiency moving fromgeneral-purpose processors to ASICs, we explore the reasonsfor these gains, which is essential to determine the natureand degree of customization necessary for future systems.Our approach starts with a generic CMP system. We incrementallycustomize its memory system and processors todetermine the magnitude and sources of overhead eliminatedin each step toward achieving a high efficiency 720pHD H.264 encoder. We explore the basic computation inH.264 next.2.3. H.264 computational motifsH.264 is a block-based video encoder which divides each86 communications of the acm | OCTOBER 2011 | vol. 54 | no. 10

Luma Ref. Pels,Cur.Luma MBChroma MB,Upper Pelsvideo frame into 16 × 16 macro-blocks and encodes each oneseparately. Each block goes through five major functions:i. IME: Integer Motion Estimationii. FME: Fractional Motion Estimationiii. IP: Intra Predictioniv. DCT/Quant: Transform and Quantizationv. CABAC: Context Adaptive Binary Arithmetic CodingIME finds the closest match for an image block versus aprevious reference image. While it is one of the most computeintensive parts of the encoder, the basic algorithmlends itself well to data-parallel architectures. On our baseCMP, IME takes up 56% of the total encoder execution timeand 52% of total energy.The next step, FME, refines the initial match frominteger motion estimation and finds a match at quarterpixelresolution. FME is also data parallel, but it has somesequential dependencies and a more complex computationkernel that makes it more difficult to parallelize.FME takes up 36% of the total execution time and 40% oftotal energy on our base CMP design. Since FME and IMEtogether dominate the computational load of the encoder,optimizing these algorithms is essential for an efficientH.264 system design.IP uses previously encoded neighboring image blockswithin the current frame to form an alternate prediction for thecurrent image-block. While the algorithm is still dominatedby arithmetic operations, the computations are much lessregular than the motion estimation algorithms. Additionally,there are sequential dependencies not only within the algorithmbut also with the transform and quantization function.Next, in DCT/Quant, the difference between a currentand predicted image block is transformed and quantizedto generate coefficients to be encoded. The basic functionis relatively simple and data parallel. However, it is invokeda number of times for each 16 × 16 image block, which callsfor an efficient implementation. For the rest of this paper,we merge these operations into the IP stage. The combinedoperation accounts for 7% of the total execution time and 6%of total energy.Finally, CABAC is used to entropy-encode the coefficientsand other elements of the bit-stream. Unlike theprevious algorithms, CABAC is sequential and controldominated. While it takes only 1.6% of the execution timeand 1.7% of total energy on our base design, CABAC oftenbecomes the bottleneck in parallel systems due to itssequential nature.2.4. Current H.264 implementationsThe computationally intensive H.264 encoding algorithmposes a challenge for general-purpose processors, and istypically implemented as an ASIC. For example, T. C. Chenet al. implement a full-system H.264 encoder and demonstratethat real-time HD H.264 encoding is possible in hardwareusing relatively low power and area cost. 2H.264 software optimizations exist, particularly formotion estimation, which takes most of the encoding time.For example, sparse search techniques speed performanceof IME and FME by up to 10×. 14, 25 Combining aggressivealgorithmic modifications with multiple cores and SSEextensions leads to highly optimized H.264 encoders on3, 12Intel processors.Despite these optimizations, software implementationsof H.264 lag far behind dedicated ASICs. Table 1 comparesa software implementation of a 480p SD encoder 12 to a 720pHD ASIC implementation. 2 The software implementationemploys a 2.8 GHz Intel Pentium 4 executing highly optimizedSSE code. This results in very high energy consumptionand low area efficiency. It is also worth noting that thesoftware implementation relies on various algorithmic simplifications,which drastically reduce the computationalcomplexity, but result in a 20% decrease in compressionefficiency for a given SNR. The ASIC hardware, on the otherhand, consumes over 500× less energy and is far more efficientin its use of silicon area and has a negligible drop incompression efficiency.3. EXPERIMENTAL METHODOLOGYTo understand what is needed to gain ASIC level efficiency,we use existing H.264 partitioning techniques, and modifythe H.264 encoder reference code JM 8.6 11 to remove dependenciesand allow mapping of the five major algorithmicblocks to the four-stage macro-block (MB) pipeline shownin Figure 1. This mapping exploits task level parallelism atthe macro-block level and significantly reduces the interprocessorcommunication bandwidth requirements bysharing data between pipeline stages.Table 1. Intel’s optimized H.264 encoder versus a 720p HD ASIC.Luma Ref. Pels,Cur. Luma MBMV Info.Read/Write tomain memoryMB0MB1MB2MB3FPS Area (mm 2 ) Energy/Frame (mJ)Intel (720 × 480 SD) 30 122 742Intel (1280 × 720 HD) 11 122 2023ASIC 30 8 4The second row gives Intel’s SD data scaled to HD. ASIC data is scaled from 180 downto 90nm.Figure 1. Four stage macroblock partition of H.264. (a) Dataflow between stages. (b) How the pipeline works on differentmacroblocks. IP includes DCT + Quant. EC is CABAC.Cur. Luma MBMV Info.,MC Luma MBData produced in prev.pipe stage(a)IME FME IP ECIME FME IP ECIME FME IP EC(b)Upper PelsIME FME IP ECBitstreamResidueMB, QP,Intra FlagDelayed main memorydataOCTOBER 2011 | vol. 54 | no. 10 | communications of the acm 87

esearch highlightsIn the base system, we map this four-stage macro-blockpartition to a four-processor CMP system where each processorhas 16KB 2-way set associative instruction and datacaches. Figure 2 highlights the large efficiency gap betweenour base CMP and the reference ASIC for individual 720pHD H.264 subalgorithms. The energy required for eachRISC instruction is similar and as a result, the energyrequired for each task (shown in Figure 3) is related to thecycles spent on that task. The RISC implementation ofIME, which is the major contributor to performance andenergy consumption, has a performance gap of 525× andan energy gap of over 700× compared to the ASIC. IMEand FME dominate the overall energy and thus need to beaggressively optimized. However, we also note that while IP,DCT, Quant, and CABAC are much smaller parts of the totalenergy/delay, even they need about 100× energy improvementto reach ASIC levels.At approximately 8.6B instructions to process 1 frame,our base system consumes about 140 pJ per instruction—areasonable value for a general-purpose system. To furtheranalyze the energy efficiency of this base CMP implementationwe break the processor’s energy into differentfunctional units as shown in Figure 4. This data makes itclear how far we need to go to approach ASIC efficiency.The energy spent in instruction fetch (IF) is an overheaddue to the programmable nature of the processors and isabsent in a custom hardware state machine, but eliminatingall this overhead only increases the energy efficiencyby less than one third. Even if we eliminate everything butthe functional unit energy, we still end up with energy savingsof only 20×—not nearly enough to reach ASIC levels.The next section explores what customizations are neededto reach the efficiency goals.4. CUSTOMIZATION RESULTSAt first, we restrict our customizations to datapath extensionsinspired by GPUs and Intel’s SSE instructions. Suchextensions are relatively general-purpose data-paralleloptimizations and consist of single instruction, multipledata (SIMD) and multiple instruction issue per cycle (weuse long instruction word, or LIW), with a limited degreeof algorithm-specific customization coming in the formof operation fusion—the creation of new instructions thatcombine frequently occurring sequences of instructions.However, much like their SSE and GPU counterparts, thesenew instructions are constrained to the existing instructionformats and datapath structures. This step representsthe datapaths in current state-of-the-art optimized CPUs. Inthe next step, we replace these generic datapaths by customunits, and allow unrestricted tailoring of the datapath byintroducing arbitrary new compute operations as well as byadding custom register file structures.The results of these customizations are shown inFigures 5 through 7. The rest of this section describes theseresults in detail and evaluates the effectiveness of thesethree customization strategies. Collectively, these resultsdescribe how efficiencies improve by 170× over the baselineof Section 3.Figure 4. Processor energy breakdown for base implementation.IF is instruction fetch/decode. D-$ is data cache. P Reg includes thepipeline registers, buses, and clocking. Ctrl is miscellaneous control.RF is register file. FU is the functional units.Figure 2. The performance and energy gap for base CMPimplementation when compared to an equivalent ASIC. Intracombines IP, DCT, and Quant.1000100101IME FME Intra CABACPerformance gap Energy gapFigure 3. Processor energy breakdown for base implementation, overthe different H.264 subalgorithms. Intra combines IP, DCT, and Quant.Figure 5. Each set of bar graphs represents energy consumption (mJ)at each stage of optimization for IME, FME, IP and CABAC respectively.The first bar in each set represents base RISC energy; followed byRISC augmented with SSE/GPU style extensions; and then RISCaugmented with “magic” instructions. The last bar in each groupindicates energy consumption by the ASIC.10,000,0001,000,000100,00010,0001,000100IME FME IP CABAC TotalRISC SSE/GPU Magic ASIC88 communications of the acm | OCTOBER 2011 | vol. 54 | no. 10

4.1. SSE/GPU style enhancementsUsing Tensilica’s TIE extensions we add LIW instructionsand SIMD execution units with vector register files of customdepths and widths. A single SIMD instruction performsmultiple operations (8 for IP, 16 for IME, and 18 for FME),reducing the number of instructions and consequentlyreducing IF energy. LIW instructions execute 2 or 3 operationsper cycle, further reducing cycle count. Moreover,SIMD operations perform wider register file and data cacheaccesses which are more energy efficient compared to narroweraccesses. Therefore all components of instructionenergy depicted in Figure 4 get a reduction through the useof these enhancements.We further augment these enhancements with operationfusion, in which we fuse together frequently occurringcomplex instruction sub-graphs for both RISC and SIMDinstructions. To prevent the register file ports from increasing,these instructions are restricted to use up to two inputoperands and can produce only one output. Operationfusion improves energy efficiency by reducing the numberof instructions and also reducing the number of register fileaccesses by internally consuming short-lived intermediatedata. Additionally, fusion gives us the ability to create moreFigure 6. Speedup at each stage of optimization for IME, FME, IP andCABAC.10001001010.1IME FME IP CABAC TotalRISC SSE/GPU Magic ASICenergy-efficient hardware implementations of the fusedoperations, e.g., multiplication implemented using shiftsand adds. The reductions due to operation fusion are lessthan 2× in energy and less than 2.5× in performance.With SIMD, LIW and Op Fusion support, IME, FME andIP processors achieve speedups of around 15×, 30× and 10×,respectively. CABAC is not data parallel and benefits onlyfrom LIW and op fusion with a speedup of merely 1.1× andalmost no change in energy per operation. Overall, the applicationgets an energy efficiency gain of almost 10×, but stilluses greater than 50× more energy than an ASIC. To reachASIC levels of efficiency, we need a different approach.4.2. Algorithm specific instructionsThe root cause of the large energy difference is that thebasic operations in H.264 are very simple and low energy.They only require 8–16 bit integer operations, so the fundamentalenergy per operation bound is on the order of hundredsof femtojoules in a 90 nm process. All other costs in aprocessor—IF, register fetch, data fetch, control, and pipelineregisters—are much larger (140 pJ) and dominate overallpower. Standard SIMD and simple fused instructionscan only go so far to improve the performance and energyefficiency. It is hard to aggregate more than 10–20 operationsinto an instruction without incurring growing inefficiencies,and with tens of operations per cycle we still havea machine where around 90% of the energy is going intooverhead functions. It is now easy to see how an ASIC canbe 2–3 orders of magnitude lower energy than a processor.For computationally limited applications with low-energyoperations, an ASIC can implement hardware which bothhas low overheads, and is a perfect structural match to theapplication. These features allow it to exploit large amountsof parallelism efficiently.To match these results in a processor we must amortizethe per-instruction energy overheads over hundreds of theseFigure 7. Processor energy breakdown for H.264. IF is instruction fetch/decode. D-$ is data cache. Pip is the pipeline registers, buses, andclocking. Ctl is random control. RF is the register file. FU is the functional elements. Only the top bar or two (FU, RF) contribute useful work inthe processor. For this application it is hard to achieve much more than 10% of the power in the FU without adding custom hardware units.100%90%80%70%60%50%40%30%20%FURFCtlPipD-$IF10%0%RISC SSE/GPU Magic RISC SSE/GPU Magic RISC SSE/GPU Magic RISC SSE/GPU MagicIMEFMEIPCABACOCTOBER 2011 | vol. 54 | no. 10 | communications of the acm 89

esearch highlightssimple operations. To create instructions with this level ofparallelism requires custom storage structures with algorithm-specificcommunication links to directly feed largeamounts of data to custom functional units without explicitregister accesses. These structures also substantiallyincrease data-reuse in the datapath and reduce communicationbandwidth and power at all levels of the memory hierarchy(register, cache, and memory).Once this hardware is in place, the machine can issue“magic” instructions that accomplish large amounts ofcomputation at very low cost. This type of structure eliminatesalmost all the processor overheads for these functionsby eliminating most of the communication overhead associatedwith processors. We call these instructions “magic”because they can have a large effect on both the energy andperformance of an application and yet they would be difficultto derive directly from the code. Such instructionstypically require an understanding of the underlying algorithms,as well as the capabilities and limitations of existinghardware resources, thus requiring greater effort on thepart of the designer. Since the IP stage uses techniques similarto FME, the rest of the section will focus on IME, FME,and CABAC.IME Strategy: To demonstrate the nature and benefit ofmagic instructions we first look at IME, which determinesthe best alignment for two image blocks. The best match isdefined by the smallest sum-of-absolute-differences (SAD) ofall of the pixel values. Since finding the best match requiresscanning one image block over a larger piece of the image,one can easily see that while this requires a large numberof calculations, it also has very high data locality. Figure 8shows the custom datapath elements added to the IMEprocessor to accelerate this function. At the core is a 16 × 16SAD array, which can perform 256 SAD operations in 1 cycle.Since our standard vector register files cannot fed enoughdata to this unit per cycle, the SAD unit is fed by a customregister structure, which allows parallel access to all 16-pixelrows and enables this datapath to perform one 256-pixelcomputation per cycle. In addition, the intermediate resultsof the pixel operations need not be stored since they can bereduced in place (summed) to create the single desired output.Furthermore, because we need to check many possibleFigure 8. Custom storage and compute for IME 4 × 4 SAD. Currentand ref-pixel register files feed all pixels to the 16 × 16 SAD arrayin parallel. Also, the ref-pixel register file allows horizontal andvertical shifts.16 rows of reference pixel registers128-bitload128-bitload16-pixels 16-pixels 16-SAD units 16-pixels16-SAD units16-SAD units16-SAD unitsCurrent pixelregisters128-bit write portalignments, the custom storage structure has support forparallel shifts in all four directions, thus allowing one toshift the entire comparison image in only one cycle. Thisfeature drastically reduces the instructions wasted on loads,shifts, and pointer arithmetic operations as well as datacache accesses. “Magic” instructions and storage elementsare also created for other major algorithmic functions inIME to achieve similar gains.Thus, by reducing instruction overheads and by amortizingthe remaining overheads over larger datapath widths,this functional unit finally consumes around 40% of thetotal instruction energy. The performance and energy efficiencyimprove by 200–300× over the base implementation,match the ASIC’s performance and come within 3× of ASICenergy. This customized solution is 20–30× better than theresults using only generic data-parallel techniques.FME Strategy: FME improves the output of the IME stageby refining the alignment to a fraction of a pixel. To performthe fractional alignment, the FME stage interpolatesone image to estimate the values of a 4 × 4 pixel block atfractional pixel coordinates. This operation is done by afilter and upsample block, which again has high arithmeticintensity and high data locality. In H.264, upsamplinguses a six tap FIR filter that requires one new pixel per iteration.To reduce IFs and register file transfers, we augmentthe processor register file with a custom 8 bit wide, 6 entryshift register structure which works like a FIFO: every timea new 8 bit value is loaded, all elements are shifted. Thiseliminates the use of expensive register file accesses foreither data shifting or operand fetch, which are now bothhandled by short local wires. All six entries can now beaccessed in parallel and we create a six input multiplier/adder which can do the calculation in a single cycle andalso can be implemented much more efficiently than thecomposition of normal 2-input adders. Finally, since weneed to perform the upsampling in 2-D, we build a shiftregister structure that stores the horizontally upsampleddata, and feeds its outputs to a number of vertical upsamplingunits (Figure 9).This transformation yields large savings even beyondthe savings in IF energy. From a pure datapath perspective(register file, pipeline registers, and functional units), thisapproach dissipates less than 1/30th the energy of a traditionalapproach.A look at the FME SIMD code implementation highlightsthe advantages of this custom hardware approach versusthe use of larger SIMD arrays. The SIMD implementationsuffers from code replication and excessive local memoryand register file accesses, in addition to not having the mostefficient functional units. FME contains seven different subblocksizes ranging from 16 × 16 pixel blocks to 4 × 4 blocks,and not all of them can fully exploit the 18-way SIMD datapath.Additionally, to use the 18-way SIMD datapath, eachsub-block requires a slightly different code sequence, whichresults in code replication and more I-fetch power becauseof the larger I-cache.To avoid these issues, the custom hardware upsamplerprocesses 4 × 4 pixels. This allows it to reuse the samecomputation loop repeatedly without any code replication,90 communications of the acm | OCTOBER 2011 | vol. 54 | no. 10

Figure 9. FME upsampling unit. Customized shift registers, directlywired to function logic, result in efficient upsampling. Ten integerpixels from local memory are used for row upsampling in RFIRblocks. Half upsampled pixels along with appropriate integer pixelsare loaded into shift registers. CFIR accesses six shift registers ineach column simultaneously to perform column upsampling.Figure 10. CABAC Arithmetic Encoding Loop (a) H.264 referencecode. (b) After insertion of “magic” instructions. Much of the controllogic in the main loop has been reduced to one constant timeinstruction ENCODE_PIPE_5.STARTTen integer pixels loaded from local memoryDone?YSTOPSTARTRFIR RFIR RFIR RFIR RFIRNOutput 0's?YOutput IndefiniteNumber of BitsENCODE_PIPE_5SeldomExecutedCFIRCFIRCFIRCFIRCFIRCFIRCFIRCFIRCFIRCFIRCFIRNOutput 1's?NYOutput IndefiniteNumber of BitsYOutput?NSTOPOutput 64 Bits(a)(b)Integer buffer Row half Buffer Column half BufferRFIR Row upsampling CFIR Column upsamplingwhich, in turn, lets us reduce the I-cache from a 16KB 4-waycache to a 2KB direct-mapped cache. Due to the abundanceof short-lived data, we remove the vector register files andreplace them with custom storage buffers. The “magic”instruction reduces the instruction cache energy by 54×and processor fetch and decode energy by 14×. Finally, asFigure 7 shows, 35% of the energy is now going into the functionalunits, and again the energy efficiency of this unit isclose to an ASIC.CABAC Strategy: CABAC originally consumed less than 2%of the total energy, but after data-parallel components areaccelerated by “magic” instructions, CABAC dominates thetotal energy. However, it requires a different set of optimizationsbecause it is control oriented and not data parallel.Thus, for CABAC, we are more interested in control fusionthan operation fusion.A critical part of CABAC is the arithmetic encoding stage,which is a serial process with small amounts of computation,but complex control flow. We break arithmetic codingdown into a simple pipeline and drastically change itfrom the reference code implementation, reducing thebinary encoding of each symbol to five instructions. Whilethere are several if–then–else conditionals reduced tosingle instructions (or with several compressed into one),the most significant reduction came in the encoding loop,as shown in Figure 10a. Each iteration of this loop may ormay not trigger execution of an internal loop that outputsan indefinite number of encoded bits. By fundamentallychanging the algorithm, the while loop was reduced to asingle constant time instruction (ENCODE_PIPE_5) and ararely executed while loop, as shown in Figure 10b.The other critical part of CABAC is the conversion ofnon-binary-valued DCT coefficients to binary codes in thebinarization stage. To improve the efficiency of this step, wecreate a 16-entry LIFO structure to store DCT coefficients.To each LIFO entry, we add a single-bit flag to identify zerovaluedDCT coefficients. These structures, along with theircorresponding logic, reduce register file energy by bringingthe most frequently used values out of the register file andinto custom storage buffers. Using “magic” instructions weproduce Unary and Exponential-Golomb codes using simpleoperations, which help reduce datapath energy. Thesemodifications are inspired by the ASIC implementationdescribed in Shojania and Sudharsanan. 17 CABAC is optimizedto achieve the bit rate required for H.264 level 3.1 at720p video resolution.Magic Instructions Summary: To summarize, the magicinstructions perform up to hundreds of operations eachtime they are executed, so the overhead of the instructionis better balanced by the work performed. Of course this ishard to do in a general way, since bandwidth requirementsand utilization of a larger SIMD array would be problematic.Therefore we solved this problem by building custom storageunits tailored to the application, and then directly connectingthe necessary functional units to these storage units.These custom storage units greatly amplified the registerfetch bandwidth, since data in the storage units is used formany different computations. In addition, since the intrastorageand functional unit communications were fixed andlocal, they can be managed at ASIC-like energy costs.After this effort, the processors optimized for data-parallelalgorithms have a total speedup of up to 600 × and anenergy reduction of 60–350 × compared to our base CMP. ForCABAC total performance gain is 17 × and energy gain is 8 ×.Figure 7 provides the final energy breakdowns. The efficienciesfound in these custom datapaths are impressive, since,in H.264 at least, they take advantage of data sharing patternsand create very efficient multiple-input operations.This means that even if researchers are able to a create a processorwhich decreases the instruction and data fetch partsof a processor by more than 10×, these solutions will not beas efficient as solutions with “magic” instructions.Achieving ASIC-like efficiency required 2–3 specialhardware units for each subalgorithm, which is significantcustomization work. Some might even say we are just buildingan ASIC in our processor. While we agree that creating“magic” instructions requires a thorough understanding ofOCTOBER 2011 | vol. 54 | no. 10 | communications of the acm 91

esearch highlightsthe application as well as hardware, we feel that adding thishardware in an extensible processor framework has manyadvantages over just designing an ASIC. These advantagescome from the constrained processor design environmentand the software, compiler, and debugging tools available inthis environment. Many of the low-level issues, like interfacedesign and pipelining, are automatically handled. In addition,since all hardware is wrapped in a general-purpose processor,the application developer retains enough flexibilityin the processor to make future algorithmic modifications.5. ENERGY-EFFICIENT COMPUTERSIt is important to remember that the “overhead” of using aprocessor depends on the energy required for the desiredoperation. Floating point (FP) energy costs are about 10×the small integer ops we have explored in this paper, somachines with 10 wide FP units will not be far from themaximum efficiency possible for that class of applications.Similarly, customizing the hardware will not havea large impact on the energy efficiency of an applicationdominated by memory costs; an ASIC and a processor’senergy will not be that different. For these applications,optimization that restructures the algorithm and/or thememory system is needed to reduce energy, and can yieldlarge savings. 23Unfortunately, as we drive to more energy-efficient solutions,we will find ways to transform FP code to fixed pointoperations, and restructure our algorithms to minimize thememory fetch costs. Said differently, if we want ASIC-likeenergy efficiencies—100× to 1000× more energy efficientthan general-purpose CPUs—we will have to transformour algorithms to be dominated by the simple, low-energyoperations we have been studying in this paper. Since theenergy of these operations is very low, any overhead, fromthe register fetch to the pipeline registers in a processor, islikely to dominate energy costs. The good news is that thislarge overhead per instruction makes estimating the energysavings easy—you simply look at the performance gains—but the bad news is that adding state-of-the art data-parallelhardware like wide SIMD units and media extensions willstill leave you far from the desired efficiency.It is encouraging that we were able to achieve ASICenergy levels in a customized processor by creating customizedhardware that easily fit inside a processor framework.Extending a processor instead of building an ASIC seemslike the correct approach, since it provides a number ofsoftware development advantages and the energy cost ofthis option seems small. However, building such customdatapaths still requires a significant effort and thus the keychallenge now is to build a design system that lets applicationdesigners create and exploit such customizations withmuch greater ease. The key is to find a parameterization ofthe space which makes sense to application designers in aspecific application domain.For example, often a number of algorithms in a domainshare similar data flow and computation structures. InH.264 a common computational motif is based on a convolution-likedata flow: apply a function to all the data, thenperform a reduction, then shift the data and add a smallamount of new data, and repeat. A similar pattern of convolution-likecomputations also exists in a number of otherimage processing and media processing algorithms. Whilethe exact computation is going to be different for each particularalgorithm, we believe that by exploiting the commondata-flow structure of these algorithms we can createa generalized convolution abstraction which applicationdesigners can customize. If this abstraction is useful forapplication designers, one can imagine implementing it bycreating a flexible hardware unit that is significantly moreefficient than a generic SIMD/SSE unit. We also believe thatsimilar patterns exist in other domains that may allow us tocreate a set of customized units for each domain.Even if we could come up with such a set of customizedfunctional units, it is likely that some degree of peralgorithm configurability will be required. For example,in a convolution engine, the convolution size and resultingdatapath size could vary from algorithm to algorithmand thus potentially needs to be tuned on a per processorbasis. This leads to the idea of creating a two-stepdesign process. The first step is when a set of chip expertsdesign a processor generator platform. This is a meta-leveldesign which “knows” about the special functional unitsand control optimization, and provides the applicationdesigner an application-tailored interface. The applicationdesigners can then co-optimize their code and theinterface parameters to meet their requirements. Afterthis co-optimization, an optimized implementation basedon these parameters is automatically generated. In fact,such a platform will also help in building the more genericdomain customized functional units mentioned earlier byfacilitating the process of rapidly creating and evaluatingnew designs.A reconfigurable processor generator alone is not a sufficientsolution, since one still needs to take one or moreof these processors and create a working chip system.Designing and validating a chip is an extremely hard andexpensive task. If application customization will be neededfor efficiency—and our data indicates it will be—we need tostart creating systems that will efficiently allow savvy applicationexperts to create these optimized chip level solutions.This will require extending the ideas for extensible processorsto full chip generation systems. We are currently workingon creating this type of system. 16AcknowledgmentsThis work would have not been possible without greatsupport and cooperation from many people at Tensilicaincluding Chris Rowen, Dror Maydan, Bill Huffman, NenadNedeljkovic, David Heine, Govind Kamat, and others. Theauthors acknowledge the support of the C2S2 Focus Center,one of six research centers funded under the Focus CenterResearch Program (FCRP), a Semiconductor ResearchCorporation subsidiary, and earlier support from DARPA.This material is based upon work partially supportedunder a Sequoia Capital Stanford Graduate Fellowship.The National Science Foundation under Grant #0937060to the Computing Research Association also supports thismaterial for the CIFellows Project. Any opinions, findings,92 communications of the acm | OCTOBER 2011 | vol. 54 | no. 10

and conclusions or recommendations expressed in thismaterial are those of the author(s) and do not necessarilyreflect the view of the National Science Foundation or theComputing Research Association.References1. Balfour, J., Dally, W., Black-Schaffer,D., Parikh, V., Park, J. An energyefficientprocessor architecture forembedded systems. Comput. Archit.Lett. 7,1 (2007), 29–32.2. Chen, T.C. Analysis and architecturedesign of an HDTV720p 30 frames/sH.264/AVC encoder. IEEE Trans.Circuits Syst. Video Technol. 16, 6(2006), 673–688.3. Chen, Y.-K., Li, E.Q., Zhou, X., Ge, S.Implementation of H.264 encoderand decoder on personal computers.J Vis. Commun. Image Represent. 17(2006), 509–532.4. Clark, N., Zhong, H., Mahlke, S.Automated custom instructiongeneration for domain-specificprocessor acceleration. IEEE Trans.Comput. 54, 10 (2005), 1258–1270.5. Cong, J., Fan, Y., Han, G., Zhang, Z.Application-specific instructiongeneration for configurable processorarchitectures. In 12th InternationalSymposium on Field ProgrammableGate Arrays (2004), 183–189.6. Davis, W., Zhang, N., Camera, K., Chen,F., Markovic, D., Chan, N., Nikolic, B.,Brodersen, R. A design environmentfor high throughput, low power,dedicated signal processing systems.In Custom Integrated CircuitsConference (CICC) (2001).7. Dennard, R., Gaensslen, F., Yu, H.,Rideout, V., Bassous, E., LeBlanc, A.Design of ion-implanted MOSFET’swith very small physical dimensions.Proc. IEEE (reprinted from IEEEJ Solid-State Circuits, 1974), 87, 4(1999), 668–678.8. Esmaeilzadeh, H., Blem, E., Amant,R.S., Sankaralingam, K., Burger, D.Dark silicon and the end of multicorescaling. In Proceedings of the38th International Symposium onComputer Architecture (June 2011).9. Horowitz, M. Scaling, power and thefuture of CMOS. In Proceedings ofthe 20th International Conferenceon VLSI Design, 2007. Held jointlywith 6th International Conference onEmbedded Systems (2007), 23–23.10. Intel Corporation. Motion Estimationwith Intel Streaming SIMDExtensions 4 (Intel SSE4) (2008).11. ITU-T. Joint Video Team ReferenceSoftware JM8.6 (2004).12. Iverson, V., McVeigh, J., Reese, B.Real-time H.264/AVC Codec on Intelarchitectures. In IEEE InternationalConference on Image ProcessingICIP’04 (2004).13. Kathail, V. Creating power-efficientapplication engines for SOC design.SOC Central (2005).14. MPEG, I., VCEG, I.-T. Fast integer peland fractional pel motion estimationfor JVT. JVT-F017 (2002).15. Rowen, C., Leibson, S. Flexiblearchitectures for engineeringsuccessful SOCs. In Proceedings ofthe 41st Annual Design AutomationConference (2004), 692–697.16. Shacham, O., Azizi, O., Wachs, M.,Qadeer, W., Asgar, Z., Kelley, K.,Stevenson, J., Solomatnikov, A.,Firoozshahian, A., Lee, B., Richardson,S., Horowitz, M. Why design mustchange: Rethinking digital design. IEEEMicro 30, 6 (Nov.–Dec. 2010), 9–24.17. Shojania, H., Sudharsanan, S.A VLSI architecture for highperformance CABAC encoding. InVisual Communications and ImageProcessing (2005).18. Tensilica Inc. Implementing theadvanced encryption standard on Xtensaprocessors. Application Notes (2009).19. Tensilica Inc. Implementing the fastFourier transform (FFT). ApplicationNotes (2005).20. Tensilica Inc. The what, why, and howof configurable processors (2008).Rehan Hameed (rhameed@stanford.edu),Stanford University, Stanford, CA.Wajahat Qadeer (wqadeer@stanford.edu),Stanford University. Stanford, CA.Megan Wachs (wachs@stanford.edu),Stanford University, Stanford, CA.Omid Azizi (oazizi@gmail.com), HicampSystems, Menlo Park, CA.Alex Solomatnikov (Solomatnikov@gmail.com), Hicamp Systems, Menlo Park, CA.© 2011 ACM 0001-0782/11/10 $10.0021. Tensilica Inc. Xtensa LX2benchmarks (2005).22. Tensilica Inc. Xtensa processorextensions for data encryption standard(DES). Application Notes (2008).23. Williams, S., Oliker, L., Vuduc, R., Shalf,J., Yelick, K., Demmel, J. Optimizationof sparse matrix-vector multiplicationon emerging multicore platforms. InProceedings of the 2007 ACM/IEEEConference on Supercomputing (2007).24. Woh, M., Seo, S., Mahlke, S., Mudge,T., Chakrabarti, C., Flautner, K. AnySP:Anytime anywhere anyway signalprocessing. SIGARCH Comp. Arch.News 37, 3 (2009), 128–139.25. Yin, P., Tourapis, H.-Y.C., Tourapis,A.M., Boyce, J. Fast mode decisionand motion estimation for JVT/H.264.In Proceedings of IEEE InternationalConference on Image Processing(2003).Benjamin C. Lee (benjamin.c.lee@duke.edu), Duke University, Durham, NC.Stephen Richardson (steveri@stanford.edu), Stanford University, Stanford, CA.Christos Kozyrakis (kozyraki@stanford.edu), Stanford University, Stanford, CA.Mark Horowitz (horowitz@stanford.edu),Stanford University, Stanford, CA.OCTOBER 2011 | vol. 54 | no. 10 | communications of the acm 93

Understanding sources of ineffciency in general-purpose chips

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?