12.07.2015 Views

Pdf - Tutorial on High-Level Synthesis - University of Windsor

Pdf - Tutorial on High-Level Synthesis - University of Windsor

Pdf - Tutorial on High-Level Synthesis - University of Windsor

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>Tutorial</str<strong>on</strong>g> <strong>on</strong> <strong>High</strong>-<strong>Level</strong> <strong>Synthesis</strong>Michael C. McFarland, SJAlice C. ParkerBost<strong>on</strong> College<strong>University</strong> <strong>of</strong> Southern CaliforniaChestnut Hill, MA 02167 Los Angeles, CA 90007Abstract. <strong>High</strong>-level synthesis takes an abstract behavioralspecificati<strong>on</strong> <strong>of</strong> a digital system and finds a register-transfer levelstructure that realizes the given behavior. In this tutorial we willexamine the high-level synthesis task, showing how it can bedecomposed into a number <strong>of</strong> distinct but not independenl. subtasks.Then we will present the techniques that have beendeveloped for solving those subtasks. Finally, we will note thoseareas related to high-level synthesis that are still open problems.1. Introducti<strong>on</strong>1.1 What is <strong>High</strong>-<strong>Level</strong> <strong>Synthesis</strong>?The synthesis task is to take a specificati<strong>on</strong> <strong>of</strong> the behaviorrequired <strong>of</strong> a system and a set <strong>of</strong> c<strong>on</strong>straints and goals to besatisfied, and to fmd a structure that implements the behavior whilesatisfying the goals and c<strong>on</strong>straints. By behavior we mean the waythe system or its comp<strong>on</strong>ents interact with their envir<strong>on</strong>ment, i.e.,the mapping from inputs to outputs. Structure refers to the set <strong>of</strong>interc<strong>on</strong>nected comp<strong>on</strong>ents that make up the system - somethinglike a netlist. Usually there are many different structums that canbe used to realize a given behavior. One <strong>of</strong> the tasks <strong>of</strong> synthesisis to tlnd the structure that best meets the c<strong>on</strong>straints, such as limitati<strong>on</strong>s<strong>on</strong> cycle time. area or power, while minimizing other costs.For example, the goal might be to minimize area while achieving acertain minimum processing rate.<strong>Synthesis</strong> can take place at various levels <strong>of</strong> abstracti<strong>on</strong> becausedesigns can be described at various levels <strong>of</strong> detail. The type <strong>of</strong>synthesis we will focus <strong>on</strong> in this tutorial begins with a behavioralspecificati<strong>on</strong> at what is <strong>of</strong>ten called the algorithmic level. The primarydata types at this level are integers and/or bit strings andarrays, rather than boolean variables. The input specificati<strong>on</strong> givesthe required mappings from sequences <strong>of</strong> inputs to sequences <strong>of</strong>outputs. It should c<strong>on</strong>strain the internal structure <strong>of</strong> the system tobe designed as little as possible. From that input specificati<strong>on</strong>, thesynthesis system produces a descripti<strong>on</strong> <strong>of</strong> a register-rran#er levelstructure that realizes the specified behavior. This sbuctumincludes a data path, that is, a network <strong>of</strong> registers, functi<strong>on</strong>alunits, multiplexem and buses, as well as hardware to c<strong>on</strong>trol thedata transfers in that network. If the c<strong>on</strong>trol is not integrated intothe datapath -and it usually is not - the synthesis system mustalso produce the specificati<strong>on</strong> <strong>of</strong> a fmite state machine that drivesthe datapaths so as to produce the required behavior. The c<strong>on</strong>nospecificati<strong>on</strong> could be in terms <strong>of</strong> microcode, a PLA pr<strong>of</strong>ile or randomlogic.<strong>High</strong>-level synthesis as we define it must be distinguished fromother types <strong>of</strong> synthesis, which operate at different levels <strong>of</strong> thedesign hierarchy. For example, high-level synthesis is noL to bec<strong>on</strong>fused with logic synthesis, where the system is specified interms <strong>of</strong> logic equati<strong>on</strong>s, which must be optimized and mappedinto a given technology. Logic synthesis might in fact be used <strong>on</strong>a design after high-level synthesis has been d<strong>on</strong>e, since it pmsupposesthe sorts <strong>of</strong> decisi<strong>on</strong>s that high-level synthesis makes. At theother end <strong>of</strong> the spectrum, there is some promising work underway <strong>on</strong> system level synthesis, for example <strong>on</strong> partiti<strong>on</strong>ing an algo-Paper 23.1330Raul CarnposanoIBM T.J. Wats<strong>on</strong> Research CenterYorktown Heights, NYrithm into multiple processes that can run in parallel or be pipelined.However, this. work is still in its preliminary stages: and wewill not report <strong>on</strong> it hem.1.2 Why Study <strong>High</strong> <strong>Level</strong> <strong>Synthesis</strong>?25th ACM/IEEE Design Automati<strong>on</strong> C<strong>on</strong>ference@’In recent years there has been a trend toward automating synthesisat higher and higher levels <strong>of</strong> the design hierarchy. Logic synthesisis gaining acceptance in industry, and there has been c<strong>on</strong>siderableinterest shown in synthesis at higher levels. There am anumber <strong>of</strong> mas<strong>on</strong>s for this:. Shorter design cycle. If mom <strong>of</strong> the design process isautomated, a company can get a design out the door faster, andthus have a better chance <strong>of</strong> hitting the market window for thatdesign. Furthermore, since much <strong>of</strong> the cost <strong>of</strong> the chip is indesign development, automating mom <strong>of</strong> that process can lowerthe cost significantly.- Fewer Errors. If the synthesis process can be verified to becorrect - by no means a trivial task - there is a greaterassurance that the final design will corresp<strong>on</strong>d to the initialspecificati<strong>on</strong>. This will mean fewer errors and less debuggingtime for new chips.. The ability to search the design space. A good synthesis systemcan produce several designs for the same specificati<strong>on</strong> in areas<strong>on</strong>able amount <strong>of</strong> time. This allows the developer toexplore different trade-<strong>of</strong>fs between c0st, speed, power and so<strong>on</strong>, or to take an existing design and produce a functi<strong>on</strong>allyequivalent <strong>on</strong>e that is faster or less expensive.- The design process is self-documenting. An automated systemcan keep track <strong>of</strong> what design decisi<strong>on</strong>s were made andwhy, and what the effect <strong>of</strong> those decisi<strong>on</strong>s was.. Availability <strong>of</strong> IC technology to more people. As moredesign expertise is moved into the synthesis system, it becomeseasier for a n<strong>on</strong>-expert to produce a chip that meets a given set<strong>of</strong> specificati<strong>on</strong>s.We expect this trend toward higher levels <strong>of</strong> synthesis to c<strong>on</strong>tinue.Already there are a number <strong>of</strong> research groups working <strong>on</strong> highlevelsynthesis, and a great deal <strong>of</strong> progress has been made infinding good techniques for optimizati<strong>on</strong> and for exploring designtrade-<strong>of</strong>fs. These techniques are.very important because decisi<strong>on</strong>smade at the algorithmic level tend to have a much greater impact<strong>on</strong> the design than those at lower levels.Them is now a sizable body <strong>of</strong> knowledge <strong>on</strong> high-level synthesis,although for the most part it has not yet been systematized. In theremainder <strong>of</strong> this paper, we will describe what the problems are inhigh-level synthesis, and what techniques have been developed tosolve them. To that end, Secti<strong>on</strong> 2 will describe the various tasksinvolved in developing a register-transfer level structure from analgorithmic level specificati<strong>on</strong>. Secti<strong>on</strong> 3 will describe. the basictechniques that have been developed for performing those tasks.Finally, Secti<strong>on</strong> 4 will look at those issues that have not been adequatelyaddressed and thus provide promising areas for futureresearch.CH2540-3/88/0000/0330$01.OO 0 1988 IEEE


2. The <strong>Synthesis</strong> TaskThe system to be designed is usually represented at the algorithmiclevel by a programming language such as Pascal [27] or Ada [8],or by a hardware descripti<strong>on</strong> language that is similar to a programminglanguage, such as ISPS [2]. Most <strong>of</strong> the languages used areproceduring languages. That is, they describe data manipulati<strong>on</strong> interms <strong>of</strong> assignment statements that am organized into larger blocksusing standard c<strong>on</strong>trol c<strong>on</strong>structs for sequential executi<strong>on</strong>, c<strong>on</strong>diti<strong>on</strong>alexecuti<strong>on</strong> and iterati<strong>on</strong>. There have been experiments, however,with various types <strong>of</strong> n<strong>on</strong>-procedural hardware descripti<strong>on</strong>languages, including applicative, LISP-like languages [ll] anddeclarative languages such as Prolog.The first step in high-level synthesis is usually the compilati<strong>on</strong> <strong>of</strong>the formal language into an internal representati<strong>on</strong>. Two types <strong>of</strong>internal representati<strong>on</strong>s are generally used: parse trees and graphs.Most approaches use variati<strong>on</strong>s <strong>of</strong> graphs that c<strong>on</strong>tain both thedata-flow and the c<strong>on</strong>trol flow implied by the specificati<strong>on</strong> [16],[26], [12]. Fig. 1 shows a part <strong>of</strong> a simple program that computesthe square-root <strong>of</strong> X using Newt<strong>on</strong>’s method, al<strong>on</strong>g with its graphicalrepresentati<strong>on</strong>. The number <strong>of</strong> iterati<strong>on</strong>s necessary in practiceis very small. In the example, 4 iterati<strong>on</strong>s were chosen. A firstdegree minimax polynomial approximati<strong>on</strong> for the interval gives the initial value. The data-flow and c<strong>on</strong>trol flow graphsare shown seprately in the figure for intelligibility. The c<strong>on</strong>trolgraph is derived directly from the explicit order given in the programand from the compiler’s choice <strong>of</strong> how to parse the arithmeticexpressi<strong>on</strong>s. The data-flow graph shows the essential ordering <strong>of</strong>operati<strong>on</strong>s in the program imposed by the data relati<strong>on</strong>s in thespecificati<strong>on</strong>. For example, in fig. 1, the additi<strong>on</strong> at the top <strong>of</strong> thediagram depends for its input <strong>on</strong> data produced by the multiplicati<strong>on</strong>.This implies that the multiplicati<strong>on</strong> must be d<strong>on</strong>e first in anyvalid ordvring <strong>of</strong> the operati<strong>on</strong>s. On the other hand, there is nodependence between the I + 1 operati<strong>on</strong> inside the loop and any <strong>of</strong>the operati<strong>on</strong>s in the chain that calculates Y, so the I + 1 may bed<strong>on</strong>e in parallel with those operati<strong>on</strong>s, as well as before. or afterthem. The data-flow graph can also be used to remove the dependence<strong>on</strong> the way internal variables are used in the specificati<strong>on</strong>,since each value produced by <strong>on</strong>e operati<strong>on</strong> and c<strong>on</strong>sumed byanother is represented uniquely by an arc. This ability to reassignvariables is important both for reordering operati<strong>on</strong>s and for simplifyingtbe datapaths....Y :=0.222222+0.888889*X;I:=O;DOUNTILI>3LOOPY:=O.S*(Y+X/Y);*I:=I+l;+ I*t IFigure 1. <strong>High</strong>-level Specificati<strong>on</strong> and graph for sqrtb > fThe rest <strong>of</strong> this secti<strong>on</strong> outlines the various steps used in turningthe intermediate form into a RT-level StNCtUre, using the SqUaEroot example to illustrate the different steps.Since the specificati<strong>on</strong> has been written for human readability andnot for direct translati<strong>on</strong> into hardware, it is desirable to do someinitial optimizati<strong>on</strong> <strong>of</strong> the internal representati<strong>on</strong>. These high-leveltransformati<strong>on</strong>s include such compiler-like optimizati<strong>on</strong>s as deadcode eliminati<strong>on</strong>, c<strong>on</strong>stant propagati<strong>on</strong>, comm<strong>on</strong> subexpressi<strong>on</strong>eliminati<strong>on</strong>, inline expansi<strong>on</strong> <strong>of</strong> procedures and loop unrolling.Local transformati<strong>on</strong>s, including those that are more specific tohardware, are also used. In the example, the loop-ending criteri<strong>on</strong>can be changed to I = 0 using a two-bit variable for 1. The multiplicati<strong>on</strong>times 0.5 can be replaced by a right shift by <strong>on</strong>e. ‘Iheadditi<strong>on</strong> <strong>of</strong> 1 to I can be replaced by an increment operati<strong>on</strong>. Theinternal representati<strong>on</strong> after these optimizati<strong>on</strong>s is depicted <strong>on</strong> theleft in fig. 2. Loop unrolling can also be d<strong>on</strong>e in this case sincethe number <strong>of</strong> iterati<strong>on</strong>s is fixed and small.The next two steps in synthesis are the core <strong>of</strong> transformingbehavior into structure: scheduling and allocati<strong>on</strong>. They are closelyinterrelated and depend <strong>on</strong> each other. Scheduling c<strong>on</strong>sists inassigning the operati<strong>on</strong>s to so-called c<strong>on</strong>trol steps. A c<strong>on</strong>trol stepis the fundamental sequencing unit in synchr<strong>on</strong>ous systems; itcorresp<strong>on</strong>ds to a clock cycle. AUocati<strong>on</strong> c<strong>on</strong>sists in assigning theoperati<strong>on</strong>s to hardware, i.e. allocating functi<strong>on</strong>al units, storage andcommunicati<strong>on</strong> paths.The aim <strong>of</strong> scheduling is to minimize the amount <strong>of</strong> time or thenumber <strong>of</strong> c<strong>on</strong>trol steps needed for completi<strong>on</strong> <strong>of</strong> the program,given certain limits <strong>on</strong> the available hardware resources. In ourexample, a trivial special case uses just <strong>on</strong>e functi<strong>on</strong>al unit and <strong>on</strong>ememory. Each operati<strong>on</strong> has to be scheduled in a different c<strong>on</strong>trolstep, so the computati<strong>on</strong> takes 3+4*5=23 c<strong>on</strong>trol steps. To speedup the computati<strong>on</strong> at the expense <strong>of</strong> adding more hardware, thec<strong>on</strong>trol graph can be packed into c<strong>on</strong>tml steps as tightly aspossible, observing <strong>on</strong>ly the essential dependencies required by tbedata-flow graph and by the loop boundaries. This form is shown infig. 2. Notice that two dummy nodes to delimit the loop boundarieswere introduced. Since the shift operati<strong>on</strong> is free, with tw<strong>of</strong>uncti<strong>on</strong>al units the operati<strong>on</strong>s can now he scheduled in 2+4*2=10c<strong>on</strong>trol steps.c<strong>on</strong>trol66d0p91010l multiplexer.# shiftc c<strong>on</strong>stantFigure 2. Optimized C<strong>on</strong>trol Graph and SchedulePaper 23.1331


The Yorktown Silic<strong>on</strong> Compiler (YSC) [4] does allocati<strong>on</strong> andscheduling together, but in a different way. It begins with eachoperati<strong>on</strong> being d<strong>on</strong>e <strong>on</strong> a separate functi<strong>on</strong>al unit and all operati<strong>on</strong>sbeing d<strong>on</strong>e in the same c<strong>on</strong>trol step. Additi<strong>on</strong>al c<strong>on</strong>trol stepsare added for loop boundaries, and as required to avoid c<strong>on</strong>flictsover register and memory usage. The hardware is then optimizedso as to share resources as much as possible. If there is too muchhardware or there are too many operati<strong>on</strong>s chained together in thesame c<strong>on</strong>trol step, more c<strong>on</strong>tml steps are added and the datapathstructure is again optimized. This process is repeated until thehardware and time c<strong>on</strong>straints are met.Finally, functi<strong>on</strong>al unit allocati<strong>on</strong> can be d<strong>on</strong>e first, followed byscheduling. In the BUD system [17], operati<strong>on</strong>s are first partiti<strong>on</strong>edinto clusters, using a metric that takes into account potentialfuncti<strong>on</strong>al unit sharing, interc<strong>on</strong>nect, and parallelism. Then functi<strong>on</strong>alunits are assigned to each cluster and the scheduling is d<strong>on</strong>e.The number <strong>of</strong> clusters to be used is determined by searchingthrough a range <strong>of</strong> possible clusterings, choosing the <strong>on</strong>e that bestmeets the design objectives.In the Karlsruhe CADDY/DSL system [25], the datapath is builtfirst, assuming maximal parallelism. This is then optimized,locally and globally, guided by both area c<strong>on</strong>straints and timing.The operati<strong>on</strong>s are then scheduled, subject to the c<strong>on</strong>straintsimposed by the datapath.3.1.2 Scheduling Algorithms There are two basic classes <strong>of</strong>scheduling algorithms: transformati<strong>on</strong>al and iterative/c<strong>on</strong>structive.A transformati<strong>on</strong>al type <strong>of</strong> algorithm begins with a defaultschedule, usually either maximally serial or maximally parallel, andapplies transformati<strong>on</strong>s to it to obtain other schedules. Thetransformati<strong>on</strong>s move serial operati<strong>on</strong>s in parallel and paralleloperati<strong>on</strong>s in series. Transformati<strong>on</strong>al algorithms differ in howthey choose what transformati<strong>on</strong>s to apply.Barbacci’s EXPL [l], <strong>on</strong>e <strong>of</strong> the earliest high-level synthesis system,used exhaustive search. That is, it tried all possible combmati<strong>on</strong>s<strong>of</strong> serial and parallel transformati<strong>on</strong>s and chose the bestdesign found. This method has the advantage that it looks throughall possible designs, but <strong>of</strong> course it is computati<strong>on</strong>ally very expensiveand not practical for sizable designs. Exhaustive search can beimproved somewhat by using branch-and-bound techniques, whichcut <strong>of</strong>f the search al<strong>on</strong>g any path that can be recognized to besuboptimal.Another approach to scheduling by transformati<strong>on</strong> is to use heuristicsto guide the process. Transformati<strong>on</strong>s are chosen that promiseto move the design closer to the given c<strong>on</strong>straints or to optimizethe objective. This is the approach used, for example, in the YorktownSilic<strong>on</strong> Compiler [4] and the CAMAD design system [23].The transformati<strong>on</strong>s used in the YSC can be shown to produce afastest possible schedule for a given specificati<strong>on</strong>.Operatives are first sorted topologically; that is. if operati<strong>on</strong> x2 isc<strong>on</strong>strained to follow operati<strong>on</strong> xl by some necessary dataflow orc<strong>on</strong>trol relati<strong>on</strong>ship, then x2 will follow xl in the topological order.Operati<strong>on</strong>s are taken from the list in order and each is put into theearliest c<strong>on</strong>trol step possible, given its dependence <strong>on</strong> other operati<strong>on</strong>sand the limits <strong>on</strong> resource usage. Figure 3 shows a dataflowgraph and its ASAP schedule. This was the type <strong>of</strong> schedulingused in the CMUDA system [lo]. in the MIMOLA system and inFlamel. The problem with this algorithm is that no priority isgiven to operati<strong>on</strong>s <strong>on</strong> the critical path, so that when there are limits<strong>on</strong> resource usage, operati<strong>on</strong>s that are less critical can bescheduled first <strong>on</strong> limited resources and thus block critical operati<strong>on</strong>s.This is shown in Figure 3, where operati<strong>on</strong> 1 is scheduledahead <strong>of</strong> operati<strong>on</strong> 2. which is <strong>on</strong> the critical path, so that operati<strong>on</strong>2 is scheduled later than is necessary, forcing a l<strong>on</strong>ger than optimalschedule.Figure 3. ASAP SchedulingList scheduling overcomes this problem by using a more global criteri<strong>on</strong>for selecting the next operati<strong>on</strong> to be scheduled. For eachc<strong>on</strong>trol step to be scheduled, the operati<strong>on</strong>s that are available to bescheduled into that c<strong>on</strong>trol step, that is, those whose predecessorshave already been scheduled, are kept in a list, ordered by somepriority functi<strong>on</strong>. Each operati<strong>on</strong> <strong>on</strong> the list is taken in turn and isscheduled if the resources it needs are still free in that step; otherwiseit is deferred to the next step. When no more operati<strong>on</strong>s canbe scheduled, the algorithm moves to the next c<strong>on</strong>trol step, theavailable operati<strong>on</strong>s are found and ordered, and the process isrepeated. This c<strong>on</strong>tinues until all the operati<strong>on</strong>s have beenscheduled. Studies have shown that this form <strong>of</strong> scheduling worksnearly as well as branch-and-bound scheduling in microcode optimizati<strong>on</strong>[6]. Figure 4 shows a list schedule for the graph in Figure3. Here the priority is the length <strong>of</strong> the path from the operati<strong>on</strong> tothe end <strong>of</strong> the block. Since operati<strong>on</strong> 2 has a higher priority thanoperati<strong>on</strong> 1, it is scheduled first, giving an optimal schedule for thiscase.A number <strong>of</strong> schedulers use list scheduling, though they differsomewhat in the priority functi<strong>on</strong> they use. The scheduler in theBUD system uses the length <strong>of</strong> the path from the operati<strong>on</strong> to theend <strong>of</strong> the block it is in. Elf [8] and ISYN [19] use the “urgency”<strong>of</strong> au operati<strong>on</strong>, the length <strong>of</strong> the shortest path from that operati<strong>on</strong>to the nearest local c<strong>on</strong>straint.The other class <strong>of</strong> algorithms, the iterative/c<strong>on</strong>structive <strong>on</strong>es, buildup a schedule by adding operati<strong>on</strong>s <strong>on</strong>e at a time until all theoperati<strong>on</strong>s have been scheduled. They differ in how the nextoperati<strong>on</strong> to be scheduled is chosen and in how they determinewhere to schedule each operati<strong>on</strong>.The simplest type <strong>of</strong> scheduling, as so<strong>on</strong> as possible (ASAP)scheduling, is local both in the selecti<strong>on</strong> <strong>of</strong> the operati<strong>on</strong> to bescheduled and in where it is placed. ASAP scheduling assumesthat the number <strong>of</strong> functi<strong>on</strong>al units has already been specified.Figure 4. A List SchedulePaper 23.1333


erally look at less <strong>of</strong> the search space than global techniques, andtherefore are mote efficient, but ate less likely to find optimal soluti<strong>on</strong>s.Figure 5. A Distributi<strong>on</strong> GraphThe last trpe <strong>of</strong> scheduling algorithm we will c<strong>on</strong>sider is globalboth in the way it selects the next operati<strong>on</strong> to be scheduled and inthe way it decides the c<strong>on</strong>trol step in which to put it. In this type<strong>of</strong> algorithm, the range <strong>of</strong> possible c<strong>on</strong>trol step assignments foreach operati<strong>on</strong> is calculated, given the time c<strong>on</strong>straints and the precedencerelati<strong>on</strong>s between the operati<strong>on</strong>s. In freedom-basedscheduling, the operati<strong>on</strong>s <strong>on</strong> the critical path are scheduled firstand assigned to functi<strong>on</strong>al units. Then the other operati<strong>on</strong>s arescheduled and assigned <strong>on</strong>e at a time. At each step theunschedtded operati<strong>on</strong> with the least freedom, that is, the <strong>on</strong>e withthe smallest range <strong>of</strong> c<strong>on</strong>trol steps into which it can go, is chosen,so that operati<strong>on</strong>s that might present more difficult schedulingproblems are taken care <strong>of</strong> first, before they become blocked.In force-directed scheduling. the range <strong>of</strong> possible c<strong>on</strong>trol steps foreach operati<strong>on</strong> is used to form a so-called Distributi<strong>on</strong> Graph. Thedistributi<strong>on</strong> graph shows, for each c<strong>on</strong>trol step, how heavily loadedthat step is, given that alI possible schedules are equally likely. Ifan operati<strong>on</strong> could be d<strong>on</strong>e in any <strong>of</strong> k c<strong>on</strong>trol steps, then l/k isadded to each <strong>of</strong> those c<strong>on</strong>trol steps in the graph. For exampleFigure 5 shows a dataflow graph, the range <strong>of</strong> steps for each operati<strong>on</strong>,and the corresp<strong>on</strong>ding distributi<strong>on</strong> graph for the additi<strong>on</strong>operati<strong>on</strong>s, assuming a time c<strong>on</strong>straint <strong>of</strong> three c<strong>on</strong>trol steps.Additi<strong>on</strong> al must be scheduled in step 1, so it c<strong>on</strong>tributes 1 to thatstep. Similarly additi<strong>on</strong> a2 adds 1 to c<strong>on</strong>trol step 2. Additi<strong>on</strong> a3could be scheduled in either step 2 or step 3, so it c<strong>on</strong>tributes I/; toeach. Operati<strong>on</strong>s ate then selected and placed so as to balance thedistributi<strong>on</strong> as much as possible. In the above example, a.3 wouldfirst be scheduled into step 3. since that would have the greatesteffect in balancing the graph.3.2 Data Path Allocati<strong>on</strong>Data path allocati<strong>on</strong> involves mapping operati<strong>on</strong>s <strong>on</strong>to functi<strong>on</strong>alunits, assigning values to registers, and providing interc<strong>on</strong>necti<strong>on</strong>sbetween operators and registers using buses and multiplexem. Thedecisi<strong>on</strong> to use AINs instead <strong>of</strong> simple operators is also made atthis time. The optimizati<strong>on</strong> goal is usually to minimize someobjective functi<strong>on</strong>, such asl total interc<strong>on</strong>nect length,- total register, bus driver and multiplexer cost, or. critical path delays.There may also be <strong>on</strong>e or more c<strong>on</strong>straints <strong>on</strong> tbe design whichlimit total area <strong>of</strong> the design, total throughput, or delay from inputto output.The techniques used to perform data path allocati<strong>on</strong> can beclassified into two types, iterative/c<strong>on</strong>structive, and global.Iterative/c<strong>on</strong>structive techniques assign elements <strong>on</strong>e at a time,while global techniques find simultaneous soluti<strong>on</strong>s to a number <strong>of</strong>assignments at a time. Exhaustive search is an extreme case <strong>of</strong> aglobal soluti<strong>on</strong> technique. Iterative/C<strong>on</strong>structive techniques gen-3.2.1 Iterative/C<strong>on</strong>structive Techniques Iterative/c<strong>on</strong>structive techniquesselect an operati<strong>on</strong>, value or interc<strong>on</strong>necti<strong>on</strong> to be assigned,make the assignment, and tben iterate. The rules which determinethe next operati<strong>on</strong>, value or interc<strong>on</strong>nect to be selected can varyfrom global rules, which examine many or all items before selecting<strong>on</strong>e, to local selecti<strong>on</strong> rules, which select the items in a fixedorder, usually as they occur in the data flow graph from inputs tooutputs. Global selecti<strong>on</strong> involves selecting a candidate forassignment <strong>on</strong> the basis <strong>of</strong> some metric, for example taking thecandidate that would add the minimum additi<strong>on</strong>al cost to thedesign. Hafer’s data path allocator, the first RT synthesis programwhich dealt with ‘ITL chips was iterative, and used local selecti<strong>on</strong>[9]. The DAA used a local criteri<strong>on</strong> to select which element toassign next, but chose where to assign it <strong>on</strong> the basis <strong>of</strong> rules thatencoded expert knowledge about the data path design <strong>of</strong> microptocessors.Once this knowledge base had been tested and improvedthrough repeated interviews with designers, the DAA was able toproduced much cleaner data paths than when it began [ 13 pages26-311. EMUCS 1101 used a global selecti<strong>on</strong> criteri<strong>on</strong>, based <strong>on</strong>minimizing both the number <strong>of</strong> functi<strong>on</strong>al tits and registers andthe multiplexing needed, to choose the next element to assign andwhere to assign it. The Elf system also sought to minimize interc<strong>on</strong>nect,but used a local selecti<strong>on</strong> criteri<strong>on</strong>. The REAL program[15] separated out register allocati<strong>on</strong> and performed it afterscheduling, but prior to operator and interc<strong>on</strong>nect allocati<strong>on</strong>.REAL is c<strong>on</strong>structive, and selects the earliest value to assign ateach step, sharing registers am<strong>on</strong>g values whenever possible.al +4mm -- J -- “J?l -_+ +aa3,-JIG%* _-- c,a4 T3 +(1a 11 a3,a4+(2 a2r!!!?lml ,m2Figure 6. Greedy Data Path Allocati<strong>on</strong>An example <strong>of</strong> greedy allocati<strong>on</strong> is shown in fig. 6. The dataflowgraph <strong>on</strong> the left is processed from earliest time step to latest.Operators, registers and interc<strong>on</strong>nect are allocated for each timestep in sequence. Thus, the selecti<strong>on</strong> rule is local, and the allocati<strong>on</strong>c<strong>on</strong>structive. Assignments are made so as to minimize interc<strong>on</strong>nect.In the case shown in the figure, a2 was assigned toadder2 since the increase in multiplexing cost required by thatallocati<strong>on</strong> was zero. a4 was assigned to adder1 because there wasalready a c<strong>on</strong>necti<strong>on</strong> from the register to that adder. Other variati<strong>on</strong>sare possible, each with different multiplexing costs. Forexample, if we had assigned a2 to adder1 and a4 to adder1 withoutchecking for interc<strong>on</strong>necti<strong>on</strong> costs, then the final multiplexingwould have been more expensive. A more global selecti<strong>on</strong> rulealso could have been applied. For example, we could haveselected tbe next item for allocati<strong>on</strong> <strong>on</strong> the basis <strong>of</strong> minimizati<strong>on</strong><strong>of</strong> cost increase. In this case, if we had already allocated a3 toaddet2, then the next step would be to allocate a4 to the sameadder, since they occur in different time steps, and the incrementalcost <strong>of</strong> doing that assignment is less than assigning a2 to adderl.Paper 23.1334


1. Barbacci. M.R. Automated Explorati<strong>on</strong> <strong>of</strong> the DesignSpace for Register Transfer (RT) Systems. PhD Thesis,Carnegie-Mell<strong>on</strong> <strong>University</strong>, 1973.2.3.4.5.6.7.8.9.10.11.12.13.14.15.REFERENCES 16.Barbacci. M.R. Instructi<strong>on</strong> Set Processor Specificati<strong>on</strong>s(ISPS): The Notati<strong>on</strong> and its Applicati<strong>on</strong>s. IEEE Transacti<strong>on</strong>s<strong>on</strong> Computers C-30, 1 (January, 1981). 2440. 18.BorrielIo, G. and Katz, R.H. <strong>Synthesis</strong> and Optimizati<strong>on</strong><strong>of</strong> Interface Transducer Logic. Proceedings <strong>of</strong> the internati<strong>on</strong>C<strong>on</strong>ference <strong>on</strong> Computer-Aided Design (Nove.mber 9,1987). 274-277.Brayt<strong>on</strong>, R.K.. Camposano. R., DeMicheli, G.:. Gtten,R.H.J.M., and vanEijndhoven, J. The Yorktown Silic<strong>on</strong>Compiler. In Silic<strong>on</strong> Compilati<strong>on</strong>, D.D. Gajski, Ed.Addis<strong>on</strong>-Wesley, Reading, MA, 1988, pp. 204-311.Brewer, F.D. and Gajski. D.D. Knowledge Based C<strong>on</strong>trolin Micro-Architecture Design. In Proceedings <strong>of</strong> the 24thDesign Automati<strong>on</strong> C<strong>on</strong>ference, ACM and IEEE, June,1987, pp. 203-209.Davids<strong>on</strong>, S., Landskov. D., Shriver, B.D.. and Mallet&P.W. Some experiments in local microcode compacti<strong>on</strong> forhoriz<strong>on</strong>tal machines. IEEE Transacti<strong>on</strong>s <strong>on</strong> Computers C-30. 7 (July, 1981). 460-477.DeMan, H., Rabaey, J., Six, P., and Claesen. L. CathedralII: A Silic<strong>on</strong> Compiler for Digital Signal Processing. IEEEDesign and Test 3, 6 (December, 1986), 13-25. 23.Girczyc, E.F. Automatic Generati<strong>on</strong> <strong>of</strong> MicrosequencedData Paths to Realize ADA Circuit Descripti<strong>on</strong>s. PhDThesis, Carlet<strong>on</strong> <strong>University</strong>, July, 1984. 24.Hafer. L.J. and Parker, A.C. Register-Transfer <strong>Level</strong> DigitalDesign Automati<strong>on</strong>: The Allocati<strong>on</strong> Process. InProceedings <strong>of</strong> the 15th Design Automati<strong>on</strong> C<strong>on</strong>ference,ACM and IEEE, June, 1978, pp. 213-219.Hitchcock, C.Y. and Thomas, D.E. A Method <strong>of</strong>Automatic Data Path <strong>Synthesis</strong>. In Proceedings <strong>of</strong> the 20thDesign Automati<strong>on</strong> C<strong>on</strong>ference, ACM and IEEE, June,1983, pp. 484-489.Johns<strong>on</strong>, S.D. <strong>Synthesis</strong> <strong>of</strong> Digital Designs from Recursi<strong>on</strong>Equati<strong>on</strong>s. PhD Thesis, Indiana <strong>University</strong>, 1984.MIT Press.Knapp, D.. Granacki, J., and Parker, A.C. An Expert <strong>Synthesis</strong>System. In Proceedings <strong>of</strong> the Internati<strong>on</strong>al C<strong>on</strong>ference<strong>on</strong> Computer-aided Design, ACM and IEEE,September, 1984, pp. 419-24. 27.Kowalski, T.J. An Artificial Intelligence Approach to VLSIDesign. Kluwer Academic Publishers, Bost<strong>on</strong>, 1985.Kurdahi, F.J. and Parker, A.C. PLEST: A Program forArea Estimati<strong>on</strong> <strong>of</strong> VLSI Integrated Circuits. In Proceedings<strong>of</strong> the 23rd Design Automati<strong>on</strong> C<strong>on</strong>ference, ACM andIEEE, June, 1986, pp. 467-473.Kurdahi, F.J. and Parker, A.C. REAL: A Program forREgister ALlocati<strong>on</strong>. In Proceedings <strong>of</strong> the 24th DesignAutomati<strong>on</strong> C<strong>on</strong>ference, ACM and IEEE, June, 1987, pp.210-215.17.19.20.21.22.25.26.28.29.McFarland, M.C. The VT: A Database for AutomatedDigital Design. DRC-01-4-80, Design Research Center,Carnegie-Mell<strong>on</strong> <strong>University</strong>, December, 1978.McFarland, M.C. Using Bottom-Up Design Techniques inthe <strong>Synthesis</strong> <strong>of</strong> Digital Hardware from Abstract BehavioralDescripti<strong>on</strong>s. In Proceedings <strong>of</strong> the 23rd Design Automati<strong>on</strong>C<strong>on</strong>ference, IEEE and ACM, June, 1986.McFarland, M.C. and Parker, A.C. An Abstract Model <strong>of</strong>Behavior for Hardware Descripti<strong>on</strong>s. IEEE Transacti<strong>on</strong>s<strong>on</strong> Computers C-32, 7 (July, 1983). 621-36.Nestor, J.A. Specificati<strong>on</strong> 8z <strong>Synthesis</strong> <strong>of</strong> Digital Systemswith Interfaces. CMUCAD-87-10, Department <strong>of</strong> Electricaland Computer Engineering, Carnegie-Mell<strong>on</strong> Uy, April,1987.Park, N. and Parker, AC. Sehwa: A S<strong>of</strong>tware Package for<strong>Synthesis</strong> <strong>of</strong> Pipelines from Behavioral Specificati<strong>on</strong>s.IEEE Transacti<strong>on</strong>s <strong>on</strong> Computer-Aided Design <strong>of</strong> DigitalCircuits and Systems 7, 3 (March, 1988), 356-370.Parker, A.C., Plzarm, J., and Mlinar, M. MAHA: A Pmgramfor Datapath <strong>Synthesis</strong>. In Proceedings <strong>of</strong> the 23rdDesign Automati<strong>on</strong> C<strong>on</strong>ference, ACM and IEEE, June,1986, pp. 461-466.Paulin, P.G. and Knight, J.P. Force-Directed Scheduling inAutomatic Data Path <strong>Synthesis</strong>. in Proceedings <strong>of</strong> the 24thDesign Automati<strong>on</strong> C<strong>on</strong>ference, ACM and IEEE, June,1987, pp. 195-202.Peng, 2. <strong>Synthesis</strong> <strong>of</strong> VLSI Systems with the CAMADDesign Aid. In Proceedings <strong>of</strong> the 23rd Design Automati<strong>on</strong>C<strong>on</strong>ference, IEEE and ACM, June, 1986, pp. 278-284.Rajan, J.V. and Thomas, D.E. <strong>Synthesis</strong> by Delayed Binding<strong>of</strong> Decisi<strong>on</strong>s. In Proceedings <strong>of</strong> the 22nd Design Automati<strong>on</strong>Corgference, ACM and IEEE, June, 1985. pp. 367-73.Rosenstiel, W. and Camposano, R. Synthesizing Circuitsfrom ,Behavioral <strong>Level</strong> Specificati<strong>on</strong>s. In Proceedings <strong>of</strong>the 7th Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Computer HardwareDescripti<strong>on</strong> Languages and their Applicati<strong>on</strong>s, C. Koomenand T. Moto-oka, Eds.. North-Holland, August, 1985, pp.391-402.Snow, E.A.. Siewiorek, D.P., and Thomas, D.E. ATechnology-Relative Computer-Aided Design System:Abstract Representati<strong>on</strong>s, Transformati<strong>on</strong>s, and DesignTrade<strong>of</strong>fs. In Proceedings <strong>of</strong> the 15th Design Automati<strong>on</strong>C<strong>on</strong>ference, ACM and IEEE, 1978, pp. 220-226.Trickey, H. Flamel: A <strong>High</strong>-<strong>Level</strong> Hardware Compiler.IEEE Transacti<strong>on</strong>s <strong>on</strong> CAD CAD-6, 2 (March, 1987), 259-269.Tseng, C. and Siewiorek. D.P. Automated <strong>Synthesis</strong> <strong>of</strong>Data Paths in Digital Systems. IEEE Transacti<strong>on</strong>s <strong>on</strong>Computer-Aided Design <strong>of</strong> Integrated Circuits and SystemsCAD-5, 3 (July, 1986). 379-395.Zimmemamm, G. MDS-The Mimola Design Method.Journal <strong>of</strong> Digital Systems 4, 3 (1980), 337-369.Paper 23.1336

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!