Pdf - Tutorial on High-Level Synthesis - University of Windsor

Pdf - Tutorial on High-Level Synthesis - University of Windsor Pdf - Tutorial on High-Level Synthesis - University of Windsor

files.vlsi.uwindsor.ca
from files.vlsi.uwindsor.ca More from this publisher
12.07.2015 Views

The Yorktown Silicon Compiler (YSC) [4] does allocation andscheduling together, but in a different way. It begins with eachoperation being done on a separate functional unit and all operationsbeing done in the same control step. Additional control stepsare added for loop boundaries, and as required to avoid conflictsover register and memory usage. The hardware is then optimizedso as to share resources as much as possible. If there is too muchhardware or there are too many operations chained together in thesame control step, more contml steps are added and the datapathstructure is again optimized. This process is repeated until thehardware and time constraints are met.Finally, functional unit allocation can be done first, followed byscheduling. In the BUD system [17], operations are first partitionedinto clusters, using a metric that takes into account potentialfunctional unit sharing, interconnect, and parallelism. Then functionalunits are assigned to each cluster and the scheduling is done.The number of clusters to be used is determined by searchingthrough a range of possible clusterings, choosing the one that bestmeets the design objectives.In the Karlsruhe CADDY/DSL system [25], the datapath is builtfirst, assuming maximal parallelism. This is then optimized,locally and globally, guided by both area constraints and timing.The operations are then scheduled, subject to the constraintsimposed by the datapath.3.1.2 Scheduling Algorithms There are two basic classes ofscheduling algorithms: transformational and iterative/constructive.A transformational type of algorithm begins with a defaultschedule, usually either maximally serial or maximally parallel, andapplies transformations to it to obtain other schedules. Thetransformations move serial operations in parallel and paralleloperations in series. Transformational algorithms differ in howthey choose what transformations to apply.Barbacci’s EXPL [l], one of the earliest high-level synthesis system,used exhaustive search. That is, it tried all possible combmationsof serial and parallel transformations and chose the bestdesign found. This method has the advantage that it looks throughall possible designs, but of course it is computationally very expensiveand not practical for sizable designs. Exhaustive search can beimproved somewhat by using branch-and-bound techniques, whichcut off the search along any path that can be recognized to besuboptimal.Another approach to scheduling by transformation is to use heuristicsto guide the process. Transformations are chosen that promiseto move the design closer to the given constraints or to optimizethe objective. This is the approach used, for example, in the YorktownSilicon Compiler [4] and the CAMAD design system [23].The transformations used in the YSC can be shown to produce afastest possible schedule for a given specification.Operatives are first sorted topologically; that is. if operation x2 isconstrained to follow operation xl by some necessary dataflow orcontrol relationship, then x2 will follow xl in the topological order.Operations are taken from the list in order and each is put into theearliest control step possible, given its dependence on other operationsand the limits on resource usage. Figure 3 shows a dataflowgraph and its ASAP schedule. This was the type of schedulingused in the CMUDA system [lo]. in the MIMOLA system and inFlamel. The problem with this algorithm is that no priority isgiven to operations on the critical path, so that when there are limitson resource usage, operations that are less critical can bescheduled first on limited resources and thus block critical operations.This is shown in Figure 3, where operation 1 is scheduledahead of operation 2. which is on the critical path, so that operation2 is scheduled later than is necessary, forcing a longer than optimalschedule.Figure 3. ASAP SchedulingList scheduling overcomes this problem by using a more global criterionfor selecting the next operation to be scheduled. For eachcontrol step to be scheduled, the operations that are available to bescheduled into that control step, that is, those whose predecessorshave already been scheduled, are kept in a list, ordered by somepriority function. Each operation on the list is taken in turn and isscheduled if the resources it needs are still free in that step; otherwiseit is deferred to the next step. When no more operations canbe scheduled, the algorithm moves to the next control step, theavailable operations are found and ordered, and the process isrepeated. This continues until all the operations have beenscheduled. Studies have shown that this form of scheduling worksnearly as well as branch-and-bound scheduling in microcode optimization[6]. Figure 4 shows a list schedule for the graph in Figure3. Here the priority is the length of the path from the operation tothe end of the block. Since operation 2 has a higher priority thanoperation 1, it is scheduled first, giving an optimal schedule for thiscase.A number of schedulers use list scheduling, though they differsomewhat in the priority function they use. The scheduler in theBUD system uses the length of the path from the operation to theend of the block it is in. Elf [8] and ISYN [19] use the “urgency”of au operation, the length of the shortest path from that operationto the nearest local constraint.The other class of algorithms, the iterative/constructive ones, buildup a schedule by adding operations one at a time until all theoperations have been scheduled. They differ in how the nextoperation to be scheduled is chosen and in how they determinewhere to schedule each operation.The simplest type of scheduling, as soon as possible (ASAP)scheduling, is local both in the selection of the operation to bescheduled and in where it is placed. ASAP scheduling assumesthat the number of functional units has already been specified.Figure 4. A List SchedulePaper 23.1333

erally look at less of the search space than global techniques, andtherefore are mote efficient, but ate less likely to find optimal solutions.Figure 5. A Distribution GraphThe last trpe of scheduling algorithm we will consider is globalboth in the way it selects the next operation to be scheduled and inthe way it decides the control step in which to put it. In this typeof algorithm, the range of possible control step assignments foreach operation is calculated, given the time constraints and the precedencerelations between the operations. In freedom-basedscheduling, the operations on the critical path are scheduled firstand assigned to functional units. Then the other operations arescheduled and assigned one at a time. At each step theunschedtded operation with the least freedom, that is, the one withthe smallest range of control steps into which it can go, is chosen,so that operations that might present more difficult schedulingproblems are taken care of first, before they become blocked.In force-directed scheduling. the range of possible control steps foreach operation is used to form a so-called Distribution Graph. Thedistribution graph shows, for each control step, how heavily loadedthat step is, given that alI possible schedules are equally likely. Ifan operation could be done in any of k control steps, then l/k isadded to each of those control steps in the graph. For exampleFigure 5 shows a dataflow graph, the range of steps for each operation,and the corresponding distribution graph for the additionoperations, assuming a time constraint of three control steps.Addition al must be scheduled in step 1, so it contributes 1 to thatstep. Similarly addition a2 adds 1 to control step 2. Addition a3could be scheduled in either step 2 or step 3, so it contributes I/; toeach. Operations ate then selected and placed so as to balance thedistribution as much as possible. In the above example, a.3 wouldfirst be scheduled into step 3. since that would have the greatesteffect in balancing the graph.3.2 Data Path AllocationData path allocation involves mapping operations onto functionalunits, assigning values to registers, and providing interconnectionsbetween operators and registers using buses and multiplexem. Thedecision to use AINs instead of simple operators is also made atthis time. The optimization goal is usually to minimize someobjective function, such asl total interconnect length,- total register, bus driver and multiplexer cost, or. critical path delays.There may also be one or more constraints on tbe design whichlimit total area of the design, total throughput, or delay from inputto output.The techniques used to perform data path allocation can beclassified into two types, iterative/constructive, and global.Iterative/constructive techniques assign elements one at a time,while global techniques find simultaneous solutions to a number ofassignments at a time. Exhaustive search is an extreme case of aglobal solution technique. Iterative/Constructive techniques gen-3.2.1 Iterative/Constructive Techniques Iterative/constructive techniquesselect an operation, value or interconnection to be assigned,make the assignment, and tben iterate. The rules which determinethe next operation, value or interconnect to be selected can varyfrom global rules, which examine many or all items before selectingone, to local selection rules, which select the items in a fixedorder, usually as they occur in the data flow graph from inputs tooutputs. Global selection involves selecting a candidate forassignment on the basis of some metric, for example taking thecandidate that would add the minimum additional cost to thedesign. Hafer’s data path allocator, the first RT synthesis programwhich dealt with ‘ITL chips was iterative, and used local selection[9]. The DAA used a local criterion to select which element toassign next, but chose where to assign it on the basis of rules thatencoded expert knowledge about the data path design of microptocessors.Once this knowledge base had been tested and improvedthrough repeated interviews with designers, the DAA was able toproduced much cleaner data paths than when it began [ 13 pages26-311. EMUCS 1101 used a global selection criterion, based onminimizing both the number of functional tits and registers andthe multiplexing needed, to choose the next element to assign andwhere to assign it. The Elf system also sought to minimize interconnect,but used a local selection criterion. The REAL program[15] separated out register allocation and performed it afterscheduling, but prior to operator and interconnect allocation.REAL is constructive, and selects the earliest value to assign ateach step, sharing registers among values whenever possible.al +4mm -- J -- “J?l -_+ +aa3,-JIG%* _-- c,a4 T3 +(1a 11 a3,a4+(2 a2r!!!?lml ,m2Figure 6. Greedy Data Path AllocationAn example of greedy allocation is shown in fig. 6. The dataflowgraph on the left is processed from earliest time step to latest.Operators, registers and interconnect are allocated for each timestep in sequence. Thus, the selection rule is local, and the allocationconstructive. Assignments are made so as to minimize interconnect.In the case shown in the figure, a2 was assigned toadder2 since the increase in multiplexing cost required by thatallocation was zero. a4 was assigned to adder1 because there wasalready a connection from the register to that adder. Other variationsare possible, each with different multiplexing costs. Forexample, if we had assigned a2 to adder1 and a4 to adder1 withoutchecking for interconnection costs, then the final multiplexingwould have been more expensive. A more global selection rulealso could have been applied. For example, we could haveselected tbe next item for allocation on the basis of minimizationof cost increase. In this case, if we had already allocated a3 toaddet2, then the next step would be to allocate a4 to the sameadder, since they occur in different time steps, and the incrementalcost of doing that assignment is less than assigning a2 to adderl.Paper 23.1334

The Yorktown Silic<strong>on</strong> Compiler (YSC) [4] does allocati<strong>on</strong> andscheduling together, but in a different way. It begins with eachoperati<strong>on</strong> being d<strong>on</strong>e <strong>on</strong> a separate functi<strong>on</strong>al unit and all operati<strong>on</strong>sbeing d<strong>on</strong>e in the same c<strong>on</strong>trol step. Additi<strong>on</strong>al c<strong>on</strong>trol stepsare added for loop boundaries, and as required to avoid c<strong>on</strong>flictsover register and memory usage. The hardware is then optimizedso as to share resources as much as possible. If there is too muchhardware or there are too many operati<strong>on</strong>s chained together in thesame c<strong>on</strong>trol step, more c<strong>on</strong>tml steps are added and the datapathstructure is again optimized. This process is repeated until thehardware and time c<strong>on</strong>straints are met.Finally, functi<strong>on</strong>al unit allocati<strong>on</strong> can be d<strong>on</strong>e first, followed byscheduling. In the BUD system [17], operati<strong>on</strong>s are first partiti<strong>on</strong>edinto clusters, using a metric that takes into account potentialfuncti<strong>on</strong>al unit sharing, interc<strong>on</strong>nect, and parallelism. Then functi<strong>on</strong>alunits are assigned to each cluster and the scheduling is d<strong>on</strong>e.The number <strong>of</strong> clusters to be used is determined by searchingthrough a range <strong>of</strong> possible clusterings, choosing the <strong>on</strong>e that bestmeets the design objectives.In the Karlsruhe CADDY/DSL system [25], the datapath is builtfirst, assuming maximal parallelism. This is then optimized,locally and globally, guided by both area c<strong>on</strong>straints and timing.The operati<strong>on</strong>s are then scheduled, subject to the c<strong>on</strong>straintsimposed by the datapath.3.1.2 Scheduling Algorithms There are two basic classes <strong>of</strong>scheduling algorithms: transformati<strong>on</strong>al and iterative/c<strong>on</strong>structive.A transformati<strong>on</strong>al type <strong>of</strong> algorithm begins with a defaultschedule, usually either maximally serial or maximally parallel, andapplies transformati<strong>on</strong>s to it to obtain other schedules. Thetransformati<strong>on</strong>s move serial operati<strong>on</strong>s in parallel and paralleloperati<strong>on</strong>s in series. Transformati<strong>on</strong>al algorithms differ in howthey choose what transformati<strong>on</strong>s to apply.Barbacci’s EXPL [l], <strong>on</strong>e <strong>of</strong> the earliest high-level synthesis system,used exhaustive search. That is, it tried all possible combmati<strong>on</strong>s<strong>of</strong> serial and parallel transformati<strong>on</strong>s and chose the bestdesign found. This method has the advantage that it looks throughall possible designs, but <strong>of</strong> course it is computati<strong>on</strong>ally very expensiveand not practical for sizable designs. Exhaustive search can beimproved somewhat by using branch-and-bound techniques, whichcut <strong>of</strong>f the search al<strong>on</strong>g any path that can be recognized to besuboptimal.Another approach to scheduling by transformati<strong>on</strong> is to use heuristicsto guide the process. Transformati<strong>on</strong>s are chosen that promiseto move the design closer to the given c<strong>on</strong>straints or to optimizethe objective. This is the approach used, for example, in the YorktownSilic<strong>on</strong> Compiler [4] and the CAMAD design system [23].The transformati<strong>on</strong>s used in the YSC can be shown to produce afastest possible schedule for a given specificati<strong>on</strong>.Operatives are first sorted topologically; that is. if operati<strong>on</strong> x2 isc<strong>on</strong>strained to follow operati<strong>on</strong> xl by some necessary dataflow orc<strong>on</strong>trol relati<strong>on</strong>ship, then x2 will follow xl in the topological order.Operati<strong>on</strong>s are taken from the list in order and each is put into theearliest c<strong>on</strong>trol step possible, given its dependence <strong>on</strong> other operati<strong>on</strong>sand the limits <strong>on</strong> resource usage. Figure 3 shows a dataflowgraph and its ASAP schedule. This was the type <strong>of</strong> schedulingused in the CMUDA system [lo]. in the MIMOLA system and inFlamel. The problem with this algorithm is that no priority isgiven to operati<strong>on</strong>s <strong>on</strong> the critical path, so that when there are limits<strong>on</strong> resource usage, operati<strong>on</strong>s that are less critical can bescheduled first <strong>on</strong> limited resources and thus block critical operati<strong>on</strong>s.This is shown in Figure 3, where operati<strong>on</strong> 1 is scheduledahead <strong>of</strong> operati<strong>on</strong> 2. which is <strong>on</strong> the critical path, so that operati<strong>on</strong>2 is scheduled later than is necessary, forcing a l<strong>on</strong>ger than optimalschedule.Figure 3. ASAP SchedulingList scheduling overcomes this problem by using a more global criteri<strong>on</strong>for selecting the next operati<strong>on</strong> to be scheduled. For eachc<strong>on</strong>trol step to be scheduled, the operati<strong>on</strong>s that are available to bescheduled into that c<strong>on</strong>trol step, that is, those whose predecessorshave already been scheduled, are kept in a list, ordered by somepriority functi<strong>on</strong>. Each operati<strong>on</strong> <strong>on</strong> the list is taken in turn and isscheduled if the resources it needs are still free in that step; otherwiseit is deferred to the next step. When no more operati<strong>on</strong>s canbe scheduled, the algorithm moves to the next c<strong>on</strong>trol step, theavailable operati<strong>on</strong>s are found and ordered, and the process isrepeated. This c<strong>on</strong>tinues until all the operati<strong>on</strong>s have beenscheduled. Studies have shown that this form <strong>of</strong> scheduling worksnearly as well as branch-and-bound scheduling in microcode optimizati<strong>on</strong>[6]. Figure 4 shows a list schedule for the graph in Figure3. Here the priority is the length <strong>of</strong> the path from the operati<strong>on</strong> tothe end <strong>of</strong> the block. Since operati<strong>on</strong> 2 has a higher priority thanoperati<strong>on</strong> 1, it is scheduled first, giving an optimal schedule for thiscase.A number <strong>of</strong> schedulers use list scheduling, though they differsomewhat in the priority functi<strong>on</strong> they use. The scheduler in theBUD system uses the length <strong>of</strong> the path from the operati<strong>on</strong> to theend <strong>of</strong> the block it is in. Elf [8] and ISYN [19] use the “urgency”<strong>of</strong> au operati<strong>on</strong>, the length <strong>of</strong> the shortest path from that operati<strong>on</strong>to the nearest local c<strong>on</strong>straint.The other class <strong>of</strong> algorithms, the iterative/c<strong>on</strong>structive <strong>on</strong>es, buildup a schedule by adding operati<strong>on</strong>s <strong>on</strong>e at a time until all theoperati<strong>on</strong>s have been scheduled. They differ in how the nextoperati<strong>on</strong> to be scheduled is chosen and in how they determinewhere to schedule each operati<strong>on</strong>.The simplest type <strong>of</strong> scheduling, as so<strong>on</strong> as possible (ASAP)scheduling, is local both in the selecti<strong>on</strong> <strong>of</strong> the operati<strong>on</strong> to bescheduled and in where it is placed. ASAP scheduling assumesthat the number <strong>of</strong> functi<strong>on</strong>al units has already been specified.Figure 4. A List SchedulePaper 23.1333

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!