13.07.2015 Views

Understanding sources of ineffciency in general-purpose chips

Understanding sources of ineffciency in general-purpose chips

Understanding sources of ineffciency in general-purpose chips

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Understand<strong>in</strong>g</strong> Sources<strong>of</strong> Inefficiency <strong>in</strong>General-Purpose ChipsBy Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,Benjam<strong>in</strong> C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitzdoi:10.1145/2001269.2001291AbstractScal<strong>in</strong>g the performance <strong>of</strong> a power limited processorrequires decreas<strong>in</strong>g the energy expended per <strong>in</strong>structionexecuted, s<strong>in</strong>ce energy/op * op/second is power. To betterunderstand what improvement <strong>in</strong> processor efficiency ispossible, and what must be done to capture it, we quantifythe <strong>sources</strong> <strong>of</strong> the performance and energy overheads <strong>of</strong> a720p HD H.264 encoder runn<strong>in</strong>g on a <strong>general</strong>-<strong>purpose</strong> fourprocessorCMP system. The <strong>in</strong>itial overheads are large: theCMP was 500× less energy efficient than an ApplicationSpecific Integrated Circuit (ASIC) do<strong>in</strong>g the same job. Weexplore methods to elim<strong>in</strong>ate these overheads by transform<strong>in</strong>gthe CPU <strong>in</strong>to a specialized system for H.264 encod<strong>in</strong>g.Broadly applicable optimizations like s<strong>in</strong>gle <strong>in</strong>struction,multiple data (SIMD) units improve CMP performance by14× and energy by 10×, which is still 50× worse than an ASIC.The problem is that the basic operation costs <strong>in</strong> H.264 areso small that even with a SIMD unit do<strong>in</strong>g over 10 ops percycle, 90% <strong>of</strong> the energy is still overhead. Achiev<strong>in</strong>g ASIClikeperformance and efficiency requires algorithm-specificoptimizations. For each subalgorithm <strong>of</strong> H.264, we create alarge, specialized functional/storage unit capable <strong>of</strong> execut<strong>in</strong>ghundreds <strong>of</strong> operations per <strong>in</strong>struction. This improvesenergy efficiency by 160× (<strong>in</strong>stead <strong>of</strong> 10×), and the f<strong>in</strong>al customizedCMP reaches the same performance and with<strong>in</strong> 3×<strong>of</strong> an ASIC solution’s energy <strong>in</strong> comparable area.1. INTRODUCTIONMost comput<strong>in</strong>g systems today are power limited, whetherit is the 1 W limit <strong>of</strong> a cell phone system on a chip (SoC), orthe 100 W limit <strong>of</strong> a processor <strong>in</strong> a server. S<strong>in</strong>ce power is ops/second * energy/op, we need to decrease the energy cost <strong>of</strong>each op if we want to cont<strong>in</strong>ue to scale performance at constantpower. Traditionally, chip designers were able to make<strong>in</strong>creas<strong>in</strong>gly complex designs both by <strong>in</strong>creas<strong>in</strong>g the systempower, and by leverag<strong>in</strong>g the energy ga<strong>in</strong>s from technologyscal<strong>in</strong>g. Historically each factor <strong>of</strong> 2 <strong>in</strong> scal<strong>in</strong>g made eachgate evaluation take 8× less energy. 7 However, technologyscal<strong>in</strong>g no longer provides the energy sav<strong>in</strong>gs it once did, 9 sodesigners must turn to other techniques to scale energy cost.Most designs use processor-based solutions because <strong>of</strong> theirflexibility and low design costs, however, these are usually notthe most energy-efficient solutions. A shift to multi-core systemshas helped improve the efficiency <strong>of</strong> processor systemsbut that approach is also go<strong>in</strong>g to hit a limit pretty soon. 8On the other hand, us<strong>in</strong>g hardware that has been customizedfor a specific application (an Application Specific IntegratedCircuit or ASIC) can be three orders <strong>of</strong> magnitude better thana processor <strong>in</strong> both energy/op and ops/area. 6 This paper comparesASIC solutions to processor-based solutions, to try tounderstand the <strong>sources</strong> <strong>of</strong> <strong>in</strong>efficiency <strong>in</strong> <strong>general</strong>-<strong>purpose</strong>processors. We hope this <strong>in</strong>formation will prove to be usefulboth for build<strong>in</strong>g more energy-efficient processors andunderstand<strong>in</strong>g why and where customization must be usedfor efficiency.To build this understand<strong>in</strong>g, we start with a s<strong>in</strong>gle videocompression application, 720p HD H.264 video encode,and transform the hardware it runs on from a generic multiprocessorto a custom multiprocessor with ASIC-like specializedhardware units. On this task, a <strong>general</strong>-<strong>purpose</strong>s<strong>of</strong>tware solution takes 500× more energy per frame and500× more area than an ASIC to reach the same performance.We choose H.264 because it demonstrates the largeenergy advantage <strong>of</strong> ASIC solutions (500×) and becausethere exist commercial ASICs that can serve as a benchmark.Moreover, H.264 conta<strong>in</strong>s a variety <strong>of</strong> computational motifs,from highly data-parallel algorithms (motion estimation) tocontrol <strong>in</strong>tensive ones (Context Adaptive B<strong>in</strong>ary ArithmeticCod<strong>in</strong>g [CABAC]).To better understand the potential <strong>of</strong> produc<strong>in</strong>g <strong>general</strong><strong>purpose</strong><strong>chips</strong> with better efficiency, we consider two broadstrategies for customized hardware. The first extends thecurrent trend <strong>of</strong> creat<strong>in</strong>g <strong>general</strong> data-parallel eng<strong>in</strong>es onour processors. This approach mimics the addition <strong>of</strong> SSE<strong>in</strong>structions, or the recent work <strong>in</strong> merg<strong>in</strong>g graphic processorson die to help with other applications. We claim theseare similar to <strong>general</strong> functional units s<strong>in</strong>ce they typicallyhave some special <strong>in</strong>structions for important applications,but are still <strong>general</strong>ly useful. The second approach createsapplication-specific data storage fused with functionalunits. In the limit this should be an ASIC-like solution. Thefirst has the advantage <strong>of</strong> be<strong>in</strong>g a programmable solution,while the second provides potentially greater efficiency.The results are strik<strong>in</strong>g. Start<strong>in</strong>g from a 500× energy penalty,add<strong>in</strong>g relatively wide SSE-like parallel execution eng<strong>in</strong>esand rewrit<strong>in</strong>g the code to use them improves performance/A previous version <strong>of</strong> this paper was published <strong>in</strong>Proceed<strong>in</strong>gs <strong>of</strong> the 37th Annual International Symposium onComputer Architecture (2010), ACM, NY.OCTOBER 2011 | vol. 54 | no. 10 | communications <strong>of</strong> the acm 85


esearch highlightsarea by 14 × and energy efficiency by 10 ×. Despite these customizations,the result<strong>in</strong>g solution is still 50 × less energyefficient than an ASIC. An exam<strong>in</strong>ation <strong>of</strong> the energy breakdown<strong>in</strong> the paper clearly demonstrates why. Basic arithmeticoperations are typically 8–16 bits wide, and even when perform<strong>in</strong>gmore than 10 such operations per cycle, arithmeticunit energy comprises less than 10% <strong>of</strong> the total. One mustconsider the energy cost <strong>of</strong> the desired operation comparedwith the energy cost <strong>of</strong> one processor cycle: for highly efficientmach<strong>in</strong>es, these energies should be similar.The next section provides the background needed tounderstand the rest <strong>of</strong> the paper. Section 3 then presentsour experimental methodology, describ<strong>in</strong>g our basel<strong>in</strong>e,generic H.264 implementation on a Tensilica CMP. The performanceand efficiency ga<strong>in</strong>s are described <strong>in</strong> Section 4,which also explores the causes <strong>of</strong> the overheads and differentmethods for address<strong>in</strong>g them. Us<strong>in</strong>g the <strong>in</strong>sight ga<strong>in</strong>edfrom our results, Section 5 discusses the broader implicationsfor efficient comput<strong>in</strong>g and support<strong>in</strong>g applicationdriven design.2. BACKGROUNDWe first review the basic ways one can analyze power, andsome previous work <strong>in</strong> creat<strong>in</strong>g energy-efficient processors.With this background, we then provide an overview <strong>of</strong> H.264encod<strong>in</strong>g and its ma<strong>in</strong> compute stages. The section endsby compar<strong>in</strong>g exist<strong>in</strong>g hardware and s<strong>of</strong>tware implementations<strong>of</strong> an H.264 encoder.2.1. Power-constra<strong>in</strong>ed design and energy efficiencyPower is def<strong>in</strong>ed to be energy per second, which can be brokenup <strong>in</strong>to two terms, energy/op * ops/second. Thus thereare two primary means by which a designer can reducepower consumption: reduce the number <strong>of</strong> operationsper second or reduce the energy per operation. The firstapproach—reduc<strong>in</strong>g the operations per second—simplyreduces performance to save power. This approach is analogousto slow<strong>in</strong>g down a factory’s assembly l<strong>in</strong>e to save electricitycosts; although power consumption is reduced, thefactory output is also reduced and the energy used (i.e., theelectricity bill) per unit <strong>of</strong> output rema<strong>in</strong>s unchanged. If, onthe other hand, a designer wishes to ma<strong>in</strong>ta<strong>in</strong> or improvethe performance under a fixed power budget, a reduction<strong>in</strong> the fundamental energy per operation is required. It isthis reduction <strong>in</strong> energy per operation—not power—thatrepresents real ga<strong>in</strong>s <strong>in</strong> efficiency.This dist<strong>in</strong>ction between power and energy is an importantone. Even though designers typically face physicalpower constra<strong>in</strong>ts, to <strong>in</strong>crease efficiency requires that thefundamental energy <strong>of</strong> operations be reduced. Although onemight be tempted to report power numbers when discuss<strong>in</strong>gpower efficiency, this can be mislead<strong>in</strong>g if the performanceis not also reported. What may seem like a power efficiencyga<strong>in</strong> may just be a modulation <strong>in</strong> performance. Us<strong>in</strong>g energyper operation, however, is a performance-<strong>in</strong>variant metricthat represents the fundamental efficiency <strong>of</strong> the workbe<strong>in</strong>g done. Thus, even though the designer may be fac<strong>in</strong>g apower constra<strong>in</strong>t, it is energy per operation that the designerneeds to focus on improv<strong>in</strong>g.Reduc<strong>in</strong>g the energy required for the basic operationcan be achieved through a number <strong>of</strong> techniques, all <strong>of</strong>which fundamentally reduce the overhead affiliated withthe work be<strong>in</strong>g done. As one simple example, clock gat<strong>in</strong>gimproves energy efficiency by elim<strong>in</strong>at<strong>in</strong>g spurious activity<strong>in</strong> a chip that otherwise causes energy waste. 8 As anotherexample, customized hardware can <strong>in</strong>crease efficiency byelim<strong>in</strong>at<strong>in</strong>g overheads. The next section further discussesthe use <strong>of</strong> customization.2.2. Related work <strong>in</strong> efficient comput<strong>in</strong>gProcessors are <strong>of</strong>ten customized to improve their efficiencyfor specific application doma<strong>in</strong>s. For example, SIMD architecturesachieve higher performance for multimedia andother data-parallel applications, while DSP processors aretailored for signal-process<strong>in</strong>g tasks. More recently, ELM 1and AnySP 24 have been optimized for embedded and mobilesignal process<strong>in</strong>g applications, respectively, by reduc<strong>in</strong>gprocessor overheads. While these strategies target a broadspectrum <strong>of</strong> applications, special <strong>in</strong>structions are sometimesadded to speed up specific applications. For example,Intel’s SSE4 10 <strong>in</strong>cludes <strong>in</strong>structions to accelerate matrixtranspose and sum-<strong>of</strong>-absolute-differences.Customizable processors allow designers to take thenext step, and create <strong>in</strong>structions tailored to applications.Extensible processors such as Tensilica’s Xtensa providea base design that the designer can extend with custom<strong>in</strong>structions and datapath units. 15 Tensilica provides anautomated ISA extension tool, 20 which achieves speedups<strong>of</strong> 1.2× to 3× for EEMBC benchmarks and signal process<strong>in</strong>galgorithms. 21 Other tools have similarly demonstratedsignificant ga<strong>in</strong>s from automated ISA extension. 4, 5 Whileautomatic ISA extensions can be very effective, manuallycreat<strong>in</strong>g ISA extensions gives even larger ga<strong>in</strong>s: Tensilicareports speedups <strong>of</strong> 40× to 300× for kernels such as FFT,18, 19, 22AES, and DES encryption.Recently researchers have proposed another approachfor achiev<strong>in</strong>g energy efficiency—reduc<strong>in</strong>g the cost <strong>of</strong> creat<strong>in</strong>gcustomized hardware rather than customiz<strong>in</strong>g a processor.Examples <strong>of</strong> the latter <strong>in</strong>clude us<strong>in</strong>g higher levels <strong>of</strong>abstraction (e.g., C-to-RTL 13 ) and even full chip generatorsus<strong>in</strong>g extensible processors. 16 Independent <strong>of</strong> whether onecustomizes a processor, or creates customized hardware, itis important to understand <strong>in</strong> quantitative terms the typesand magnitudes <strong>of</strong> energy overheads <strong>in</strong> processors.While previous studies have demonstrated significantimprovements <strong>in</strong> performance and efficiency mov<strong>in</strong>g from<strong>general</strong>-<strong>purpose</strong> processors to ASICs, we explore the reasonsfor these ga<strong>in</strong>s, which is essential to determ<strong>in</strong>e the natureand degree <strong>of</strong> customization necessary for future systems.Our approach starts with a generic CMP system. We <strong>in</strong>crementallycustomize its memory system and processors todeterm<strong>in</strong>e the magnitude and <strong>sources</strong> <strong>of</strong> overhead elim<strong>in</strong>ated<strong>in</strong> each step toward achiev<strong>in</strong>g a high efficiency 720pHD H.264 encoder. We explore the basic computation <strong>in</strong>H.264 next.2.3. H.264 computational motifsH.264 is a block-based video encoder which divides each86 communications <strong>of</strong> the acm | OCTOBER 2011 | vol. 54 | no. 10


Luma Ref. Pels,Cur.Luma MBChroma MB,Upper Pelsvideo frame <strong>in</strong>to 16 × 16 macro-blocks and encodes each oneseparately. Each block goes through five major functions:i. IME: Integer Motion Estimationii. FME: Fractional Motion Estimationiii. IP: Intra Predictioniv. DCT/Quant: Transform and Quantizationv. CABAC: Context Adaptive B<strong>in</strong>ary Arithmetic Cod<strong>in</strong>gIME f<strong>in</strong>ds the closest match for an image block versus aprevious reference image. While it is one <strong>of</strong> the most compute<strong>in</strong>tensive parts <strong>of</strong> the encoder, the basic algorithmlends itself well to data-parallel architectures. On our baseCMP, IME takes up 56% <strong>of</strong> the total encoder execution timeand 52% <strong>of</strong> total energy.The next step, FME, ref<strong>in</strong>es the <strong>in</strong>itial match from<strong>in</strong>teger motion estimation and f<strong>in</strong>ds a match at quarterpixelresolution. FME is also data parallel, but it has somesequential dependencies and a more complex computationkernel that makes it more difficult to parallelize.FME takes up 36% <strong>of</strong> the total execution time and 40% <strong>of</strong>total energy on our base CMP design. S<strong>in</strong>ce FME and IMEtogether dom<strong>in</strong>ate the computational load <strong>of</strong> the encoder,optimiz<strong>in</strong>g these algorithms is essential for an efficientH.264 system design.IP uses previously encoded neighbor<strong>in</strong>g image blockswith<strong>in</strong> the current frame to form an alternate prediction for thecurrent image-block. While the algorithm is still dom<strong>in</strong>atedby arithmetic operations, the computations are much lessregular than the motion estimation algorithms. Additionally,there are sequential dependencies not only with<strong>in</strong> the algorithmbut also with the transform and quantization function.Next, <strong>in</strong> DCT/Quant, the difference between a currentand predicted image block is transformed and quantizedto generate coefficients to be encoded. The basic functionis relatively simple and data parallel. However, it is <strong>in</strong>vokeda number <strong>of</strong> times for each 16 × 16 image block, which callsfor an efficient implementation. For the rest <strong>of</strong> this paper,we merge these operations <strong>in</strong>to the IP stage. The comb<strong>in</strong>edoperation accounts for 7% <strong>of</strong> the total execution time and 6%<strong>of</strong> total energy.F<strong>in</strong>ally, CABAC is used to entropy-encode the coefficientsand other elements <strong>of</strong> the bit-stream. Unlike theprevious algorithms, CABAC is sequential and controldom<strong>in</strong>ated. While it takes only 1.6% <strong>of</strong> the execution timeand 1.7% <strong>of</strong> total energy on our base design, CABAC <strong>of</strong>tenbecomes the bottleneck <strong>in</strong> parallel systems due to itssequential nature.2.4. Current H.264 implementationsThe computationally <strong>in</strong>tensive H.264 encod<strong>in</strong>g algorithmposes a challenge for <strong>general</strong>-<strong>purpose</strong> processors, and istypically implemented as an ASIC. For example, T. C. Chenet al. implement a full-system H.264 encoder and demonstratethat real-time HD H.264 encod<strong>in</strong>g is possible <strong>in</strong> hardwareus<strong>in</strong>g relatively low power and area cost. 2H.264 s<strong>of</strong>tware optimizations exist, particularly formotion estimation, which takes most <strong>of</strong> the encod<strong>in</strong>g time.For example, sparse search techniques speed performance<strong>of</strong> IME and FME by up to 10×. 14, 25 Comb<strong>in</strong><strong>in</strong>g aggressivealgorithmic modifications with multiple cores and SSEextensions leads to highly optimized H.264 encoders on3, 12Intel processors.Despite these optimizations, s<strong>of</strong>tware implementations<strong>of</strong> H.264 lag far beh<strong>in</strong>d dedicated ASICs. Table 1 comparesa s<strong>of</strong>tware implementation <strong>of</strong> a 480p SD encoder 12 to a 720pHD ASIC implementation. 2 The s<strong>of</strong>tware implementationemploys a 2.8 GHz Intel Pentium 4 execut<strong>in</strong>g highly optimizedSSE code. This results <strong>in</strong> very high energy consumptionand low area efficiency. It is also worth not<strong>in</strong>g that thes<strong>of</strong>tware implementation relies on various algorithmic simplifications,which drastically reduce the computationalcomplexity, but result <strong>in</strong> a 20% decrease <strong>in</strong> compressionefficiency for a given SNR. The ASIC hardware, on the otherhand, consumes over 500× less energy and is far more efficient<strong>in</strong> its use <strong>of</strong> silicon area and has a negligible drop <strong>in</strong>compression efficiency.3. EXPERIMENTAL METHODOLOGYTo understand what is needed to ga<strong>in</strong> ASIC level efficiency,we use exist<strong>in</strong>g H.264 partition<strong>in</strong>g techniques, and modifythe H.264 encoder reference code JM 8.6 11 to remove dependenciesand allow mapp<strong>in</strong>g <strong>of</strong> the five major algorithmicblocks to the four-stage macro-block (MB) pipel<strong>in</strong>e shown<strong>in</strong> Figure 1. This mapp<strong>in</strong>g exploits task level parallelism atthe macro-block level and significantly reduces the <strong>in</strong>terprocessorcommunication bandwidth requirements byshar<strong>in</strong>g data between pipel<strong>in</strong>e stages.Table 1. Intel’s optimized H.264 encoder versus a 720p HD ASIC.Luma Ref. Pels,Cur. Luma MBMV Info.Read/Write toma<strong>in</strong> memoryMB0MB1MB2MB3FPS Area (mm 2 ) Energy/Frame (mJ)Intel (720 × 480 SD) 30 122 742Intel (1280 × 720 HD) 11 122 2023ASIC 30 8 4The second row gives Intel’s SD data scaled to HD. ASIC data is scaled from 180 downto 90nm.Figure 1. Four stage macroblock partition <strong>of</strong> H.264. (a) Dataflow between stages. (b) How the pipel<strong>in</strong>e works on differentmacroblocks. IP <strong>in</strong>cludes DCT + Quant. EC is CABAC.Cur. Luma MBMV Info.,MC Luma MBData produced <strong>in</strong> prev.pipe stage(a)IME FME IP ECIME FME IP ECIME FME IP EC(b)Upper PelsIME FME IP ECBitstreamResidueMB, QP,Intra FlagDelayed ma<strong>in</strong> memorydataOCTOBER 2011 | vol. 54 | no. 10 | communications <strong>of</strong> the acm 87


esearch highlightsIn the base system, we map this four-stage macro-blockpartition to a four-processor CMP system where each processorhas 16KB 2-way set associative <strong>in</strong>struction and datacaches. Figure 2 highlights the large efficiency gap betweenour base CMP and the reference ASIC for <strong>in</strong>dividual 720pHD H.264 subalgorithms. The energy required for eachRISC <strong>in</strong>struction is similar and as a result, the energyrequired for each task (shown <strong>in</strong> Figure 3) is related to thecycles spent on that task. The RISC implementation <strong>of</strong>IME, which is the major contributor to performance andenergy consumption, has a performance gap <strong>of</strong> 525× andan energy gap <strong>of</strong> over 700× compared to the ASIC. IMEand FME dom<strong>in</strong>ate the overall energy and thus need to beaggressively optimized. However, we also note that while IP,DCT, Quant, and CABAC are much smaller parts <strong>of</strong> the totalenergy/delay, even they need about 100× energy improvementto reach ASIC levels.At approximately 8.6B <strong>in</strong>structions to process 1 frame,our base system consumes about 140 pJ per <strong>in</strong>struction—areasonable value for a <strong>general</strong>-<strong>purpose</strong> system. To furtheranalyze the energy efficiency <strong>of</strong> this base CMP implementationwe break the processor’s energy <strong>in</strong>to differentfunctional units as shown <strong>in</strong> Figure 4. This data makes itclear how far we need to go to approach ASIC efficiency.The energy spent <strong>in</strong> <strong>in</strong>struction fetch (IF) is an overheaddue to the programmable nature <strong>of</strong> the processors and isabsent <strong>in</strong> a custom hardware state mach<strong>in</strong>e, but elim<strong>in</strong>at<strong>in</strong>gall this overhead only <strong>in</strong>creases the energy efficiencyby less than one third. Even if we elim<strong>in</strong>ate everyth<strong>in</strong>g butthe functional unit energy, we still end up with energy sav<strong>in</strong>gs<strong>of</strong> only 20×—not nearly enough to reach ASIC levels.The next section explores what customizations are neededto reach the efficiency goals.4. CUSTOMIZATION RESULTSAt first, we restrict our customizations to datapath extensions<strong>in</strong>spired by GPUs and Intel’s SSE <strong>in</strong>structions. Suchextensions are relatively <strong>general</strong>-<strong>purpose</strong> data-paralleloptimizations and consist <strong>of</strong> s<strong>in</strong>gle <strong>in</strong>struction, multipledata (SIMD) and multiple <strong>in</strong>struction issue per cycle (weuse long <strong>in</strong>struction word, or LIW), with a limited degree<strong>of</strong> algorithm-specific customization com<strong>in</strong>g <strong>in</strong> the form<strong>of</strong> operation fusion—the creation <strong>of</strong> new <strong>in</strong>structions thatcomb<strong>in</strong>e frequently occurr<strong>in</strong>g sequences <strong>of</strong> <strong>in</strong>structions.However, much like their SSE and GPU counterparts, thesenew <strong>in</strong>structions are constra<strong>in</strong>ed to the exist<strong>in</strong>g <strong>in</strong>structionformats and datapath structures. This step representsthe datapaths <strong>in</strong> current state-<strong>of</strong>-the-art optimized CPUs. Inthe next step, we replace these generic datapaths by customunits, and allow unrestricted tailor<strong>in</strong>g <strong>of</strong> the datapath by<strong>in</strong>troduc<strong>in</strong>g arbitrary new compute operations as well as byadd<strong>in</strong>g custom register file structures.The results <strong>of</strong> these customizations are shown <strong>in</strong>Figures 5 through 7. The rest <strong>of</strong> this section describes theseresults <strong>in</strong> detail and evaluates the effectiveness <strong>of</strong> thesethree customization strategies. Collectively, these resultsdescribe how efficiencies improve by 170× over the basel<strong>in</strong>e<strong>of</strong> Section 3.Figure 4. Processor energy breakdown for base implementation.IF is <strong>in</strong>struction fetch/decode. D-$ is data cache. P Reg <strong>in</strong>cludes thepipel<strong>in</strong>e registers, buses, and clock<strong>in</strong>g. Ctrl is miscellaneous control.RF is register file. FU is the functional units.Figure 2. The performance and energy gap for base CMPimplementation when compared to an equivalent ASIC. Intracomb<strong>in</strong>es IP, DCT, and Quant.1000100101IME FME Intra CABACPerformance gap Energy gapFigure 3. Processor energy breakdown for base implementation, overthe different H.264 subalgorithms. Intra comb<strong>in</strong>es IP, DCT, and Quant.Figure 5. Each set <strong>of</strong> bar graphs represents energy consumption (mJ)at each stage <strong>of</strong> optimization for IME, FME, IP and CABAC respectively.The first bar <strong>in</strong> each set represents base RISC energy; followed byRISC augmented with SSE/GPU style extensions; and then RISCaugmented with “magic” <strong>in</strong>structions. The last bar <strong>in</strong> each group<strong>in</strong>dicates energy consumption by the ASIC.10,000,0001,000,000100,00010,0001,000100IME FME IP CABAC TotalRISC SSE/GPU Magic ASIC88 communications <strong>of</strong> the acm | OCTOBER 2011 | vol. 54 | no. 10


4.1. SSE/GPU style enhancementsUs<strong>in</strong>g Tensilica’s TIE extensions we add LIW <strong>in</strong>structionsand SIMD execution units with vector register files <strong>of</strong> customdepths and widths. A s<strong>in</strong>gle SIMD <strong>in</strong>struction performsmultiple operations (8 for IP, 16 for IME, and 18 for FME),reduc<strong>in</strong>g the number <strong>of</strong> <strong>in</strong>structions and consequentlyreduc<strong>in</strong>g IF energy. LIW <strong>in</strong>structions execute 2 or 3 operationsper cycle, further reduc<strong>in</strong>g cycle count. Moreover,SIMD operations perform wider register file and data cacheaccesses which are more energy efficient compared to narroweraccesses. Therefore all components <strong>of</strong> <strong>in</strong>structionenergy depicted <strong>in</strong> Figure 4 get a reduction through the use<strong>of</strong> these enhancements.We further augment these enhancements with operationfusion, <strong>in</strong> which we fuse together frequently occurr<strong>in</strong>gcomplex <strong>in</strong>struction sub-graphs for both RISC and SIMD<strong>in</strong>structions. To prevent the register file ports from <strong>in</strong>creas<strong>in</strong>g,these <strong>in</strong>structions are restricted to use up to two <strong>in</strong>putoperands and can produce only one output. Operationfusion improves energy efficiency by reduc<strong>in</strong>g the number<strong>of</strong> <strong>in</strong>structions and also reduc<strong>in</strong>g the number <strong>of</strong> register fileaccesses by <strong>in</strong>ternally consum<strong>in</strong>g short-lived <strong>in</strong>termediatedata. Additionally, fusion gives us the ability to create moreFigure 6. Speedup at each stage <strong>of</strong> optimization for IME, FME, IP andCABAC.10001001010.1IME FME IP CABAC TotalRISC SSE/GPU Magic ASICenergy-efficient hardware implementations <strong>of</strong> the fusedoperations, e.g., multiplication implemented us<strong>in</strong>g shiftsand adds. The reductions due to operation fusion are lessthan 2× <strong>in</strong> energy and less than 2.5× <strong>in</strong> performance.With SIMD, LIW and Op Fusion support, IME, FME andIP processors achieve speedups <strong>of</strong> around 15×, 30× and 10×,respectively. CABAC is not data parallel and benefits onlyfrom LIW and op fusion with a speedup <strong>of</strong> merely 1.1× andalmost no change <strong>in</strong> energy per operation. Overall, the applicationgets an energy efficiency ga<strong>in</strong> <strong>of</strong> almost 10×, but stilluses greater than 50× more energy than an ASIC. To reachASIC levels <strong>of</strong> efficiency, we need a different approach.4.2. Algorithm specific <strong>in</strong>structionsThe root cause <strong>of</strong> the large energy difference is that thebasic operations <strong>in</strong> H.264 are very simple and low energy.They only require 8–16 bit <strong>in</strong>teger operations, so the fundamentalenergy per operation bound is on the order <strong>of</strong> hundreds<strong>of</strong> femtojoules <strong>in</strong> a 90 nm process. All other costs <strong>in</strong> aprocessor—IF, register fetch, data fetch, control, and pipel<strong>in</strong>eregisters—are much larger (140 pJ) and dom<strong>in</strong>ate overallpower. Standard SIMD and simple fused <strong>in</strong>structionscan only go so far to improve the performance and energyefficiency. It is hard to aggregate more than 10–20 operations<strong>in</strong>to an <strong>in</strong>struction without <strong>in</strong>curr<strong>in</strong>g grow<strong>in</strong>g <strong>in</strong>efficiencies,and with tens <strong>of</strong> operations per cycle we still havea mach<strong>in</strong>e where around 90% <strong>of</strong> the energy is go<strong>in</strong>g <strong>in</strong>tooverhead functions. It is now easy to see how an ASIC canbe 2–3 orders <strong>of</strong> magnitude lower energy than a processor.For computationally limited applications with low-energyoperations, an ASIC can implement hardware which bothhas low overheads, and is a perfect structural match to theapplication. These features allow it to exploit large amounts<strong>of</strong> parallelism efficiently.To match these results <strong>in</strong> a processor we must amortizethe per-<strong>in</strong>struction energy overheads over hundreds <strong>of</strong> theseFigure 7. Processor energy breakdown for H.264. IF is <strong>in</strong>struction fetch/decode. D-$ is data cache. Pip is the pipel<strong>in</strong>e registers, buses, andclock<strong>in</strong>g. Ctl is random control. RF is the register file. FU is the functional elements. Only the top bar or two (FU, RF) contribute useful work <strong>in</strong>the processor. For this application it is hard to achieve much more than 10% <strong>of</strong> the power <strong>in</strong> the FU without add<strong>in</strong>g custom hardware units.100%90%80%70%60%50%40%30%20%FURFCtlPipD-$IF10%0%RISC SSE/GPU Magic RISC SSE/GPU Magic RISC SSE/GPU Magic RISC SSE/GPU MagicIMEFMEIPCABACOCTOBER 2011 | vol. 54 | no. 10 | communications <strong>of</strong> the acm 89


esearch highlightssimple operations. To create <strong>in</strong>structions with this level <strong>of</strong>parallelism requires custom storage structures with algorithm-specificcommunication l<strong>in</strong>ks to directly feed largeamounts <strong>of</strong> data to custom functional units without explicitregister accesses. These structures also substantially<strong>in</strong>crease data-reuse <strong>in</strong> the datapath and reduce communicationbandwidth and power at all levels <strong>of</strong> the memory hierarchy(register, cache, and memory).Once this hardware is <strong>in</strong> place, the mach<strong>in</strong>e can issue“magic” <strong>in</strong>structions that accomplish large amounts <strong>of</strong>computation at very low cost. This type <strong>of</strong> structure elim<strong>in</strong>atesalmost all the processor overheads for these functionsby elim<strong>in</strong>at<strong>in</strong>g most <strong>of</strong> the communication overhead associatedwith processors. We call these <strong>in</strong>structions “magic”because they can have a large effect on both the energy andperformance <strong>of</strong> an application and yet they would be difficultto derive directly from the code. Such <strong>in</strong>structionstypically require an understand<strong>in</strong>g <strong>of</strong> the underly<strong>in</strong>g algorithms,as well as the capabilities and limitations <strong>of</strong> exist<strong>in</strong>ghardware re<strong>sources</strong>, thus requir<strong>in</strong>g greater effort on thepart <strong>of</strong> the designer. S<strong>in</strong>ce the IP stage uses techniques similarto FME, the rest <strong>of</strong> the section will focus on IME, FME,and CABAC.IME Strategy: To demonstrate the nature and benefit <strong>of</strong>magic <strong>in</strong>structions we first look at IME, which determ<strong>in</strong>esthe best alignment for two image blocks. The best match isdef<strong>in</strong>ed by the smallest sum-<strong>of</strong>-absolute-differences (SAD) <strong>of</strong>all <strong>of</strong> the pixel values. S<strong>in</strong>ce f<strong>in</strong>d<strong>in</strong>g the best match requiresscann<strong>in</strong>g one image block over a larger piece <strong>of</strong> the image,one can easily see that while this requires a large number<strong>of</strong> calculations, it also has very high data locality. Figure 8shows the custom datapath elements added to the IMEprocessor to accelerate this function. At the core is a 16 × 16SAD array, which can perform 256 SAD operations <strong>in</strong> 1 cycle.S<strong>in</strong>ce our standard vector register files cannot fed enoughdata to this unit per cycle, the SAD unit is fed by a customregister structure, which allows parallel access to all 16-pixelrows and enables this datapath to perform one 256-pixelcomputation per cycle. In addition, the <strong>in</strong>termediate results<strong>of</strong> the pixel operations need not be stored s<strong>in</strong>ce they can bereduced <strong>in</strong> place (summed) to create the s<strong>in</strong>gle desired output.Furthermore, because we need to check many possibleFigure 8. Custom storage and compute for IME 4 × 4 SAD. Currentand ref-pixel register files feed all pixels to the 16 × 16 SAD array<strong>in</strong> parallel. Also, the ref-pixel register file allows horizontal andvertical shifts.16 rows <strong>of</strong> reference pixel registers128-bitload128-bitload16-pixels 16-pixels 16-SAD units 16-pixels16-SAD units16-SAD units16-SAD unitsCurrent pixelregisters128-bit write portalignments, the custom storage structure has support forparallel shifts <strong>in</strong> all four directions, thus allow<strong>in</strong>g one toshift the entire comparison image <strong>in</strong> only one cycle. Thisfeature drastically reduces the <strong>in</strong>structions wasted on loads,shifts, and po<strong>in</strong>ter arithmetic operations as well as datacache accesses. “Magic” <strong>in</strong>structions and storage elementsare also created for other major algorithmic functions <strong>in</strong>IME to achieve similar ga<strong>in</strong>s.Thus, by reduc<strong>in</strong>g <strong>in</strong>struction overheads and by amortiz<strong>in</strong>gthe rema<strong>in</strong><strong>in</strong>g overheads over larger datapath widths,this functional unit f<strong>in</strong>ally consumes around 40% <strong>of</strong> thetotal <strong>in</strong>struction energy. The performance and energy efficiencyimprove by 200–300× over the base implementation,match the ASIC’s performance and come with<strong>in</strong> 3× <strong>of</strong> ASICenergy. This customized solution is 20–30× better than theresults us<strong>in</strong>g only generic data-parallel techniques.FME Strategy: FME improves the output <strong>of</strong> the IME stageby ref<strong>in</strong><strong>in</strong>g the alignment to a fraction <strong>of</strong> a pixel. To performthe fractional alignment, the FME stage <strong>in</strong>terpolatesone image to estimate the values <strong>of</strong> a 4 × 4 pixel block atfractional pixel coord<strong>in</strong>ates. This operation is done by afilter and upsample block, which aga<strong>in</strong> has high arithmetic<strong>in</strong>tensity and high data locality. In H.264, upsampl<strong>in</strong>guses a six tap FIR filter that requires one new pixel per iteration.To reduce IFs and register file transfers, we augmentthe processor register file with a custom 8 bit wide, 6 entryshift register structure which works like a FIFO: every timea new 8 bit value is loaded, all elements are shifted. Thiselim<strong>in</strong>ates the use <strong>of</strong> expensive register file accesses foreither data shift<strong>in</strong>g or operand fetch, which are now bothhandled by short local wires. All six entries can now beaccessed <strong>in</strong> parallel and we create a six <strong>in</strong>put multiplier/adder which can do the calculation <strong>in</strong> a s<strong>in</strong>gle cycle andalso can be implemented much more efficiently than thecomposition <strong>of</strong> normal 2-<strong>in</strong>put adders. F<strong>in</strong>ally, s<strong>in</strong>ce weneed to perform the upsampl<strong>in</strong>g <strong>in</strong> 2-D, we build a shiftregister structure that stores the horizontally upsampleddata, and feeds its outputs to a number <strong>of</strong> vertical upsampl<strong>in</strong>gunits (Figure 9).This transformation yields large sav<strong>in</strong>gs even beyondthe sav<strong>in</strong>gs <strong>in</strong> IF energy. From a pure datapath perspective(register file, pipel<strong>in</strong>e registers, and functional units), thisapproach dissipates less than 1/30th the energy <strong>of</strong> a traditionalapproach.A look at the FME SIMD code implementation highlightsthe advantages <strong>of</strong> this custom hardware approach versusthe use <strong>of</strong> larger SIMD arrays. The SIMD implementationsuffers from code replication and excessive local memoryand register file accesses, <strong>in</strong> addition to not hav<strong>in</strong>g the mostefficient functional units. FME conta<strong>in</strong>s seven different subblocksizes rang<strong>in</strong>g from 16 × 16 pixel blocks to 4 × 4 blocks,and not all <strong>of</strong> them can fully exploit the 18-way SIMD datapath.Additionally, to use the 18-way SIMD datapath, eachsub-block requires a slightly different code sequence, whichresults <strong>in</strong> code replication and more I-fetch power because<strong>of</strong> the larger I-cache.To avoid these issues, the custom hardware upsamplerprocesses 4 × 4 pixels. This allows it to reuse the samecomputation loop repeatedly without any code replication,90 communications <strong>of</strong> the acm | OCTOBER 2011 | vol. 54 | no. 10


Figure 9. FME upsampl<strong>in</strong>g unit. Customized shift registers, directlywired to function logic, result <strong>in</strong> efficient upsampl<strong>in</strong>g. Ten <strong>in</strong>tegerpixels from local memory are used for row upsampl<strong>in</strong>g <strong>in</strong> RFIRblocks. Half upsampled pixels along with appropriate <strong>in</strong>teger pixelsare loaded <strong>in</strong>to shift registers. CFIR accesses six shift registers <strong>in</strong>each column simultaneously to perform column upsampl<strong>in</strong>g.Figure 10. CABAC Arithmetic Encod<strong>in</strong>g Loop (a) H.264 referencecode. (b) After <strong>in</strong>sertion <strong>of</strong> “magic” <strong>in</strong>structions. Much <strong>of</strong> the controllogic <strong>in</strong> the ma<strong>in</strong> loop has been reduced to one constant time<strong>in</strong>struction ENCODE_PIPE_5.STARTTen <strong>in</strong>teger pixels loaded from local memoryDone?YSTOPSTARTRFIR RFIR RFIR RFIR RFIRNOutput 0's?YOutput Indef<strong>in</strong>iteNumber <strong>of</strong> BitsENCODE_PIPE_5SeldomExecutedCFIRCFIRCFIRCFIRCFIRCFIRCFIRCFIRCFIRCFIRCFIRNOutput 1's?NYOutput Indef<strong>in</strong>iteNumber <strong>of</strong> BitsYOutput?NSTOPOutput 64 Bits(a)(b)Integer buffer Row half Buffer Column half BufferRFIR Row upsampl<strong>in</strong>g CFIR Column upsampl<strong>in</strong>gwhich, <strong>in</strong> turn, lets us reduce the I-cache from a 16KB 4-waycache to a 2KB direct-mapped cache. Due to the abundance<strong>of</strong> short-lived data, we remove the vector register files andreplace them with custom storage buffers. The “magic”<strong>in</strong>struction reduces the <strong>in</strong>struction cache energy by 54×and processor fetch and decode energy by 14×. F<strong>in</strong>ally, asFigure 7 shows, 35% <strong>of</strong> the energy is now go<strong>in</strong>g <strong>in</strong>to the functionalunits, and aga<strong>in</strong> the energy efficiency <strong>of</strong> this unit isclose to an ASIC.CABAC Strategy: CABAC orig<strong>in</strong>ally consumed less than 2%<strong>of</strong> the total energy, but after data-parallel components areaccelerated by “magic” <strong>in</strong>structions, CABAC dom<strong>in</strong>ates thetotal energy. However, it requires a different set <strong>of</strong> optimizationsbecause it is control oriented and not data parallel.Thus, for CABAC, we are more <strong>in</strong>terested <strong>in</strong> control fusionthan operation fusion.A critical part <strong>of</strong> CABAC is the arithmetic encod<strong>in</strong>g stage,which is a serial process with small amounts <strong>of</strong> computation,but complex control flow. We break arithmetic cod<strong>in</strong>gdown <strong>in</strong>to a simple pipel<strong>in</strong>e and drastically change itfrom the reference code implementation, reduc<strong>in</strong>g theb<strong>in</strong>ary encod<strong>in</strong>g <strong>of</strong> each symbol to five <strong>in</strong>structions. Whilethere are several if–then–else conditionals reduced tos<strong>in</strong>gle <strong>in</strong>structions (or with several compressed <strong>in</strong>to one),the most significant reduction came <strong>in</strong> the encod<strong>in</strong>g loop,as shown <strong>in</strong> Figure 10a. Each iteration <strong>of</strong> this loop may ormay not trigger execution <strong>of</strong> an <strong>in</strong>ternal loop that outputsan <strong>in</strong>def<strong>in</strong>ite number <strong>of</strong> encoded bits. By fundamentallychang<strong>in</strong>g the algorithm, the while loop was reduced to as<strong>in</strong>gle constant time <strong>in</strong>struction (ENCODE_PIPE_5) and ararely executed while loop, as shown <strong>in</strong> Figure 10b.The other critical part <strong>of</strong> CABAC is the conversion <strong>of</strong>non-b<strong>in</strong>ary-valued DCT coefficients to b<strong>in</strong>ary codes <strong>in</strong> theb<strong>in</strong>arization stage. To improve the efficiency <strong>of</strong> this step, wecreate a 16-entry LIFO structure to store DCT coefficients.To each LIFO entry, we add a s<strong>in</strong>gle-bit flag to identify zerovaluedDCT coefficients. These structures, along with theircorrespond<strong>in</strong>g logic, reduce register file energy by br<strong>in</strong>g<strong>in</strong>gthe most frequently used values out <strong>of</strong> the register file and<strong>in</strong>to custom storage buffers. Us<strong>in</strong>g “magic” <strong>in</strong>structions weproduce Unary and Exponential-Golomb codes us<strong>in</strong>g simpleoperations, which help reduce datapath energy. Thesemodifications are <strong>in</strong>spired by the ASIC implementationdescribed <strong>in</strong> Shojania and Sudharsanan. 17 CABAC is optimizedto achieve the bit rate required for H.264 level 3.1 at720p video resolution.Magic Instructions Summary: To summarize, the magic<strong>in</strong>structions perform up to hundreds <strong>of</strong> operations eachtime they are executed, so the overhead <strong>of</strong> the <strong>in</strong>structionis better balanced by the work performed. Of course this ishard to do <strong>in</strong> a <strong>general</strong> way, s<strong>in</strong>ce bandwidth requirementsand utilization <strong>of</strong> a larger SIMD array would be problematic.Therefore we solved this problem by build<strong>in</strong>g custom storageunits tailored to the application, and then directly connect<strong>in</strong>gthe necessary functional units to these storage units.These custom storage units greatly amplified the registerfetch bandwidth, s<strong>in</strong>ce data <strong>in</strong> the storage units is used formany different computations. In addition, s<strong>in</strong>ce the <strong>in</strong>trastorageand functional unit communications were fixed andlocal, they can be managed at ASIC-like energy costs.After this effort, the processors optimized for data-parallelalgorithms have a total speedup <strong>of</strong> up to 600 × and anenergy reduction <strong>of</strong> 60–350 × compared to our base CMP. ForCABAC total performance ga<strong>in</strong> is 17 × and energy ga<strong>in</strong> is 8 ×.Figure 7 provides the f<strong>in</strong>al energy breakdowns. The efficienciesfound <strong>in</strong> these custom datapaths are impressive, s<strong>in</strong>ce,<strong>in</strong> H.264 at least, they take advantage <strong>of</strong> data shar<strong>in</strong>g patternsand create very efficient multiple-<strong>in</strong>put operations.This means that even if researchers are able to a create a processorwhich decreases the <strong>in</strong>struction and data fetch parts<strong>of</strong> a processor by more than 10×, these solutions will not beas efficient as solutions with “magic” <strong>in</strong>structions.Achiev<strong>in</strong>g ASIC-like efficiency required 2–3 specialhardware units for each subalgorithm, which is significantcustomization work. Some might even say we are just build<strong>in</strong>gan ASIC <strong>in</strong> our processor. While we agree that creat<strong>in</strong>g“magic” <strong>in</strong>structions requires a thorough understand<strong>in</strong>g <strong>of</strong>OCTOBER 2011 | vol. 54 | no. 10 | communications <strong>of</strong> the acm 91


esearch highlightsthe application as well as hardware, we feel that add<strong>in</strong>g thishardware <strong>in</strong> an extensible processor framework has manyadvantages over just design<strong>in</strong>g an ASIC. These advantagescome from the constra<strong>in</strong>ed processor design environmentand the s<strong>of</strong>tware, compiler, and debugg<strong>in</strong>g tools available <strong>in</strong>this environment. Many <strong>of</strong> the low-level issues, like <strong>in</strong>terfacedesign and pipel<strong>in</strong><strong>in</strong>g, are automatically handled. In addition,s<strong>in</strong>ce all hardware is wrapped <strong>in</strong> a <strong>general</strong>-<strong>purpose</strong> processor,the application developer reta<strong>in</strong>s enough flexibility<strong>in</strong> the processor to make future algorithmic modifications.5. ENERGY-EFFICIENT COMPUTERSIt is important to remember that the “overhead” <strong>of</strong> us<strong>in</strong>g aprocessor depends on the energy required for the desiredoperation. Float<strong>in</strong>g po<strong>in</strong>t (FP) energy costs are about 10×the small <strong>in</strong>teger ops we have explored <strong>in</strong> this paper, somach<strong>in</strong>es with 10 wide FP units will not be far from themaximum efficiency possible for that class <strong>of</strong> applications.Similarly, customiz<strong>in</strong>g the hardware will not havea large impact on the energy efficiency <strong>of</strong> an applicationdom<strong>in</strong>ated by memory costs; an ASIC and a processor’senergy will not be that different. For these applications,optimization that restructures the algorithm and/or thememory system is needed to reduce energy, and can yieldlarge sav<strong>in</strong>gs. 23Unfortunately, as we drive to more energy-efficient solutions,we will f<strong>in</strong>d ways to transform FP code to fixed po<strong>in</strong>toperations, and restructure our algorithms to m<strong>in</strong>imize thememory fetch costs. Said differently, if we want ASIC-likeenergy efficiencies—100× to 1000× more energy efficientthan <strong>general</strong>-<strong>purpose</strong> CPUs—we will have to transformour algorithms to be dom<strong>in</strong>ated by the simple, low-energyoperations we have been study<strong>in</strong>g <strong>in</strong> this paper. S<strong>in</strong>ce theenergy <strong>of</strong> these operations is very low, any overhead, fromthe register fetch to the pipel<strong>in</strong>e registers <strong>in</strong> a processor, islikely to dom<strong>in</strong>ate energy costs. The good news is that thislarge overhead per <strong>in</strong>struction makes estimat<strong>in</strong>g the energysav<strong>in</strong>gs easy—you simply look at the performance ga<strong>in</strong>s—but the bad news is that add<strong>in</strong>g state-<strong>of</strong>-the art data-parallelhardware like wide SIMD units and media extensions willstill leave you far from the desired efficiency.It is encourag<strong>in</strong>g that we were able to achieve ASICenergy levels <strong>in</strong> a customized processor by creat<strong>in</strong>g customizedhardware that easily fit <strong>in</strong>side a processor framework.Extend<strong>in</strong>g a processor <strong>in</strong>stead <strong>of</strong> build<strong>in</strong>g an ASIC seemslike the correct approach, s<strong>in</strong>ce it provides a number <strong>of</strong>s<strong>of</strong>tware development advantages and the energy cost <strong>of</strong>this option seems small. However, build<strong>in</strong>g such customdatapaths still requires a significant effort and thus the keychallenge now is to build a design system that lets applicationdesigners create and exploit such customizations withmuch greater ease. The key is to f<strong>in</strong>d a parameterization <strong>of</strong>the space which makes sense to application designers <strong>in</strong> aspecific application doma<strong>in</strong>.For example, <strong>of</strong>ten a number <strong>of</strong> algorithms <strong>in</strong> a doma<strong>in</strong>share similar data flow and computation structures. InH.264 a common computational motif is based on a convolution-likedata flow: apply a function to all the data, thenperform a reduction, then shift the data and add a smallamount <strong>of</strong> new data, and repeat. A similar pattern <strong>of</strong> convolution-likecomputations also exists <strong>in</strong> a number <strong>of</strong> otherimage process<strong>in</strong>g and media process<strong>in</strong>g algorithms. Whilethe exact computation is go<strong>in</strong>g to be different for each particularalgorithm, we believe that by exploit<strong>in</strong>g the commondata-flow structure <strong>of</strong> these algorithms we can createa <strong>general</strong>ized convolution abstraction which applicationdesigners can customize. If this abstraction is useful forapplication designers, one can imag<strong>in</strong>e implement<strong>in</strong>g it bycreat<strong>in</strong>g a flexible hardware unit that is significantly moreefficient than a generic SIMD/SSE unit. We also believe thatsimilar patterns exist <strong>in</strong> other doma<strong>in</strong>s that may allow us tocreate a set <strong>of</strong> customized units for each doma<strong>in</strong>.Even if we could come up with such a set <strong>of</strong> customizedfunctional units, it is likely that some degree <strong>of</strong> peralgorithm configurability will be required. For example,<strong>in</strong> a convolution eng<strong>in</strong>e, the convolution size and result<strong>in</strong>gdatapath size could vary from algorithm to algorithmand thus potentially needs to be tuned on a per processorbasis. This leads to the idea <strong>of</strong> creat<strong>in</strong>g a two-stepdesign process. The first step is when a set <strong>of</strong> chip expertsdesign a processor generator platform. This is a meta-leveldesign which “knows” about the special functional unitsand control optimization, and provides the applicationdesigner an application-tailored <strong>in</strong>terface. The applicationdesigners can then co-optimize their code and the<strong>in</strong>terface parameters to meet their requirements. Afterthis co-optimization, an optimized implementation basedon these parameters is automatically generated. In fact,such a platform will also help <strong>in</strong> build<strong>in</strong>g the more genericdoma<strong>in</strong> customized functional units mentioned earlier byfacilitat<strong>in</strong>g the process <strong>of</strong> rapidly creat<strong>in</strong>g and evaluat<strong>in</strong>gnew designs.A reconfigurable processor generator alone is not a sufficientsolution, s<strong>in</strong>ce one still needs to take one or more<strong>of</strong> these processors and create a work<strong>in</strong>g chip system.Design<strong>in</strong>g and validat<strong>in</strong>g a chip is an extremely hard andexpensive task. If application customization will be neededfor efficiency—and our data <strong>in</strong>dicates it will be—we need tostart creat<strong>in</strong>g systems that will efficiently allow savvy applicationexperts to create these optimized chip level solutions.This will require extend<strong>in</strong>g the ideas for extensible processorsto full chip generation systems. We are currently work<strong>in</strong>gon creat<strong>in</strong>g this type <strong>of</strong> system. 16AcknowledgmentsThis work would have not been possible without greatsupport and cooperation from many people at Tensilica<strong>in</strong>clud<strong>in</strong>g Chris Rowen, Dror Maydan, Bill Huffman, NenadNedeljkovic, David He<strong>in</strong>e, Gov<strong>in</strong>d Kamat, and others. Theauthors acknowledge the support <strong>of</strong> the C2S2 Focus Center,one <strong>of</strong> six research centers funded under the Focus CenterResearch Program (FCRP), a Semiconductor ResearchCorporation subsidiary, and earlier support from DARPA.This material is based upon work partially supportedunder a Sequoia Capital Stanford Graduate Fellowship.The National Science Foundation under Grant #0937060to the Comput<strong>in</strong>g Research Association also supports thismaterial for the CIFellows Project. Any op<strong>in</strong>ions, f<strong>in</strong>d<strong>in</strong>gs,92 communications <strong>of</strong> the acm | OCTOBER 2011 | vol. 54 | no. 10


and conclusions or recommendations expressed <strong>in</strong> thismaterial are those <strong>of</strong> the author(s) and do not necessarilyreflect the view <strong>of</strong> the National Science Foundation or theComput<strong>in</strong>g Research Association.References1. Balfour, J., Dally, W., Black-Schaffer,D., Parikh, V., Park, J. An energyefficientprocessor architecture forembedded systems. Comput. Archit.Lett. 7,1 (2007), 29–32.2. Chen, T.C. Analysis and architecturedesign <strong>of</strong> an HDTV720p 30 frames/sH.264/AVC encoder. IEEE Trans.Circuits Syst. Video Technol. 16, 6(2006), 673–688.3. Chen, Y.-K., Li, E.Q., Zhou, X., Ge, S.Implementation <strong>of</strong> H.264 encoderand decoder on personal computers.J Vis. Commun. Image Represent. 17(2006), 509–532.4. Clark, N., Zhong, H., Mahlke, S.Automated custom <strong>in</strong>structiongeneration for doma<strong>in</strong>-specificprocessor acceleration. IEEE Trans.Comput. 54, 10 (2005), 1258–1270.5. Cong, J., Fan, Y., Han, G., Zhang, Z.Application-specific <strong>in</strong>structiongeneration for configurable processorarchitectures. In 12th InternationalSymposium on Field ProgrammableGate Arrays (2004), 183–189.6. Davis, W., Zhang, N., Camera, K., Chen,F., Markovic, D., Chan, N., Nikolic, B.,Brodersen, R. A design environmentfor high throughput, low power,dedicated signal process<strong>in</strong>g systems.In Custom Integrated CircuitsConference (CICC) (2001).7. Dennard, R., Gaensslen, F., Yu, H.,Rideout, V., Bassous, E., LeBlanc, A.Design <strong>of</strong> ion-implanted MOSFET’swith very small physical dimensions.Proc. IEEE (repr<strong>in</strong>ted from IEEEJ Solid-State Circuits, 1974), 87, 4(1999), 668–678.8. Esmaeilzadeh, H., Blem, E., Amant,R.S., Sankaral<strong>in</strong>gam, K., Burger, D.Dark silicon and the end <strong>of</strong> multicorescal<strong>in</strong>g. In Proceed<strong>in</strong>gs <strong>of</strong> the38th International Symposium onComputer Architecture (June 2011).9. Horowitz, M. Scal<strong>in</strong>g, power and thefuture <strong>of</strong> CMOS. In Proceed<strong>in</strong>gs <strong>of</strong>the 20th International Conferenceon VLSI Design, 2007. Held jo<strong>in</strong>tlywith 6th International Conference onEmbedded Systems (2007), 23–23.10. Intel Corporation. Motion Estimationwith Intel Stream<strong>in</strong>g SIMDExtensions 4 (Intel SSE4) (2008).11. ITU-T. Jo<strong>in</strong>t Video Team ReferenceS<strong>of</strong>tware JM8.6 (2004).12. Iverson, V., McVeigh, J., Reese, B.Real-time H.264/AVC Codec on Intelarchitectures. In IEEE InternationalConference on Image Process<strong>in</strong>gICIP’04 (2004).13. Kathail, V. Creat<strong>in</strong>g power-efficientapplication eng<strong>in</strong>es for SOC design.SOC Central (2005).14. MPEG, I., VCEG, I.-T. Fast <strong>in</strong>teger peland fractional pel motion estimationfor JVT. JVT-F017 (2002).15. Rowen, C., Leibson, S. Flexiblearchitectures for eng<strong>in</strong>eer<strong>in</strong>gsuccessful SOCs. In Proceed<strong>in</strong>gs <strong>of</strong>the 41st Annual Design AutomationConference (2004), 692–697.16. Shacham, O., Azizi, O., Wachs, M.,Qadeer, W., Asgar, Z., Kelley, K.,Stevenson, J., Solomatnikov, A.,Firoozshahian, A., Lee, B., Richardson,S., Horowitz, M. Why design mustchange: Reth<strong>in</strong>k<strong>in</strong>g digital design. IEEEMicro 30, 6 (Nov.–Dec. 2010), 9–24.17. Shojania, H., Sudharsanan, S.A VLSI architecture for highperformance CABAC encod<strong>in</strong>g. InVisual Communications and ImageProcess<strong>in</strong>g (2005).18. Tensilica Inc. Implement<strong>in</strong>g theadvanced encryption standard on Xtensaprocessors. Application Notes (2009).19. Tensilica Inc. Implement<strong>in</strong>g the fastFourier transform (FFT). ApplicationNotes (2005).20. Tensilica Inc. The what, why, and how<strong>of</strong> configurable processors (2008).Rehan Hameed (rhameed@stanford.edu),Stanford University, Stanford, CA.Wajahat Qadeer (wqadeer@stanford.edu),Stanford University. Stanford, CA.Megan Wachs (wachs@stanford.edu),Stanford University, Stanford, CA.Omid Azizi (oazizi@gmail.com), HicampSystems, Menlo Park, CA.Alex Solomatnikov (Solomatnikov@gmail.com), Hicamp Systems, Menlo Park, CA.© 2011 ACM 0001-0782/11/10 $10.0021. Tensilica Inc. Xtensa LX2benchmarks (2005).22. Tensilica Inc. Xtensa processorextensions for data encryption standard(DES). Application Notes (2008).23. Williams, S., Oliker, L., Vuduc, R., Shalf,J., Yelick, K., Demmel, J. Optimization<strong>of</strong> sparse matrix-vector multiplicationon emerg<strong>in</strong>g multicore platforms. InProceed<strong>in</strong>gs <strong>of</strong> the 2007 ACM/IEEEConference on Supercomput<strong>in</strong>g (2007).24. Woh, M., Seo, S., Mahlke, S., Mudge,T., Chakrabarti, C., Flautner, K. AnySP:Anytime anywhere anyway signalprocess<strong>in</strong>g. SIGARCH Comp. Arch.News 37, 3 (2009), 128–139.25. Y<strong>in</strong>, P., Tourapis, H.-Y.C., Tourapis,A.M., Boyce, J. Fast mode decisionand motion estimation for JVT/H.264.In Proceed<strong>in</strong>gs <strong>of</strong> IEEE InternationalConference on Image Process<strong>in</strong>g(2003).Benjam<strong>in</strong> C. Lee (benjam<strong>in</strong>.c.lee@duke.edu), Duke University, Durham, NC.Stephen Richardson (steveri@stanford.edu), Stanford University, Stanford, CA.Christos Kozyrakis (kozyraki@stanford.edu), Stanford University, Stanford, CA.Mark Horowitz (horowitz@stanford.edu),Stanford University, Stanford, CA.OCTOBER 2011 | vol. 54 | no. 10 | communications <strong>of</strong> the acm 93

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!