Pdf - Tutorial on High-Level Synthesis - University of Windsor

<strong>Tutorial</strong> on High-Level SynthesisMichael C. McFarland, SJAlice C. ParkerBoston CollegeUniversity of Southern CaliforniaChestnut Hill, MA 02167 Los Angeles, CA 90007Abstract. High-level synthesis takes an abstract behavioralspecification of a digital system and finds a register-transfer levelstructure that realizes the given behavior. In this tutorial we willexamine the high-level synthesis task, showing how it can bedecomposed into a number of distinct but not independenl. subtasks.Then we will present the techniques that have beendeveloped for solving those subtasks. Finally, we will note thoseareas related to high-level synthesis that are still open problems.1. Introduction1.1 What is High-Level Synthesis?The synthesis task is to take a specification of the behaviorrequired of a system and a set of constraints and goals to besatisfied, and to fmd a structure that implements the behavior whilesatisfying the goals and constraints. By behavior we mean the waythe system or its components interact with their environment, i.e.,the mapping from inputs to outputs. Structure refers to the set ofinterconnected components that make up the system - somethinglike a netlist. Usually there are many different structums that canbe used to realize a given behavior. One of the tasks of synthesisis to tlnd the structure that best meets the constraints, such as limitationson cycle time. area or power, while minimizing other costs.For example, the goal might be to minimize area while achieving acertain minimum processing rate.Synthesis can take place at various levels of abstraction becausedesigns can be described at various levels of detail. The type ofsynthesis we will focus on in this tutorial begins with a behavioralspecification at what is often called the algorithmic level. The primarydata types at this level are integers and/or bit strings andarrays, rather than boolean variables. The input specification givesthe required mappings from sequences of inputs to sequences ofoutputs. It should constrain the internal structure of the system tobe designed as little as possible. From that input specification, thesynthesis system produces a description of a register-rran#er levelstructure that realizes the specified behavior. This sbuctumincludes a data path, that is, a network of registers, functionalunits, multiplexem and buses, as well as hardware to control thedata transfers in that network. If the control is not integrated intothe datapath -and it usually is not - the synthesis system mustalso produce the specification of a fmite state machine that drivesthe datapaths so as to produce the required behavior. The connospecification could be in terms of microcode, a PLA profile or randomlogic.High-level synthesis as we define it must be distinguished fromother types of synthesis, which operate at different levels of thedesign hierarchy. For example, high-level synthesis is noL to beconfused with logic synthesis, where the system is specified interms of logic equations, which must be optimized and mappedinto a given technology. Logic synthesis might in fact be used ona design after high-level synthesis has been done, since it pmsupposesthe sorts of decisions that high-level synthesis makes. At theother end of the spectrum, there is some promising work underway on system level synthesis, for example on partitioning an algo-Paper 23.1330Raul CarnposanoIBM T.J. Watson Research CenterYorktown Heights, NYrithm into multiple processes that can run in parallel or be pipelined.However, this. work is still in its preliminary stages: and wewill not report on it hem.1.2 Why Study High Level Synthesis?25th ACM/IEEE Design Automation Conference@’In recent years there has been a trend toward automating synthesisat higher and higher levels of the design hierarchy. Logic synthesisis gaining acceptance in industry, and there has been considerableinterest shown in synthesis at higher levels. There am anumber of masons for this:. Shorter design cycle. If mom of the design process isautomated, a company can get a design out the door faster, andthus have a better chance of hitting the market window for thatdesign. Furthermore, since much of the cost of the chip is indesign development, automating mom of that process can lowerthe cost significantly.- Fewer Errors. If the synthesis process can be verified to becorrect - by no means a trivial task - there is a greaterassurance that the final design will correspond to the initialspecification. This will mean fewer errors and less debuggingtime for new chips.. The ability to search the design space. A good synthesis systemcan produce several designs for the same specification in areasonable amount of time. This allows the developer toexplore different trade-offs between c0st, speed, power and soon, or to take an existing design and produce a functionallyequivalent one that is faster or less expensive.- The design process is self-documenting. An automated systemcan keep track of what design decisions were made andwhy, and what the effect of those decisions was.. Availability of IC technology to more people. As moredesign expertise is moved into the synthesis system, it becomeseasier for a non-expert to produce a chip that meets a given setof specifications.We expect this trend toward higher levels of synthesis to continue.Already there are a number of research groups working on highlevelsynthesis, and a great deal of progress has been made infinding good techniques for optimization and for exploring designtrade-offs. These techniques are.very important because decisionsmade at the algorithmic level tend to have a much greater impacton the design than those at lower levels.Them is now a sizable body of knowledge on high-level synthesis,although for the most part it has not yet been systematized. In theremainder of this paper, we will describe what the problems are inhigh-level synthesis, and what techniques have been developed tosolve them. To that end, Section 2 will describe the various tasksinvolved in developing a register-transfer level structure from analgorithmic level specification. Section 3 will describe. the basictechniques that have been developed for performing those tasks.Finally, Section 4 will look at those issues that have not been adequatelyaddressed and thus provide promising areas for futureresearch.CH2540-3/88/0000/0330$01.OO 0 1988 IEEE

2. The Synthesis TaskThe system to be designed is usually represented at the algorithmiclevel by a programming language such as Pascal [27] or Ada [8],or by a hardware description language that is similar to a programminglanguage, such as ISPS [2]. Most of the languages used areproceduring languages. That is, they describe data manipulation interms of assignment statements that am organized into larger blocksusing standard control constructs for sequential execution, conditionalexecution and iteration. There have been experiments, however,with various types of non-procedural hardware descriptionlanguages, including applicative, LISP-like languages [ll] anddeclarative languages such as Prolog.The first step in high-level synthesis is usually the compilation ofthe formal language into an internal representation. Two types ofinternal representations are generally used: parse trees and graphs.Most approaches use variations of graphs that contain both thedata-flow and the control flow implied by the specification [16],[26], [12]. Fig. 1 shows a part of a simple program that computesthe square-root of X using Newton’s method, along with its graphicalrepresentation. The number of iterations necessary in practiceis very small. In the example, 4 iterations were chosen. A firstdegree minimax polynomial approximation for the interval gives the initial value. The data-flow and control flow graphsare shown seprately in the figure for intelligibility. The controlgraph is derived directly from the explicit order given in the programand from the compiler’s choice of how to parse the arithmeticexpressions. The data-flow graph shows the essential ordering ofoperations in the program imposed by the data relations in thespecification. For example, in fig. 1, the addition at the top of thediagram depends for its input on data produced by the multiplication.This implies that the multiplication must be done first in anyvalid ordvring of the operations. On the other hand, there is nodependence between the I + 1 operation inside the loop and any ofthe operations in the chain that calculates Y, so the I + 1 may bedone in parallel with those operations, as well as before. or afterthem. The data-flow graph can also be used to remove the dependenceon the way internal variables are used in the specification,since each value produced by one operation and consumed byanother is represented uniquely by an arc. This ability to reassignvariables is important both for reordering operations and for simplifyingtbe datapaths....Y :=0.222222+0.888889*X;I:=O;DOUNTILI>3LOOPY:=O.S*(Y+X/Y);*I:=I+l;+ I*t IFigure 1. High-level Specification and graph for sqrtb > fThe rest of this section outlines the various steps used in turningthe intermediate form into a RT-level StNCtUre, using the SqUaEroot example to illustrate the different steps.Since the specification has been written for human readability andnot for direct translation into hardware, it is desirable to do someinitial optimization of the internal representation. These high-leveltransformations include such compiler-like optimizations as deadcode elimination, constant propagation, common subexpressionelimination, inline expansion of procedures and loop unrolling.Local transformations, including those that are more specific tohardware, are also used. In the example, the loop-ending criterioncan be changed to I = 0 using a two-bit variable for 1. The multiplicationtimes 0.5 can be replaced by a right shift by one. ‘Iheaddition of 1 to I can be replaced by an increment operation. Theinternal representation after these optimizations is depicted on theleft in fig. 2. Loop unrolling can also be done in this case sincethe number of iterations is fixed and small.The next two steps in synthesis are the core of transformingbehavior into structure: scheduling and allocation. They are closelyinterrelated and depend on each other. Scheduling consists inassigning the operations to so-called control steps. A control stepis the fundamental sequencing unit in synchronous systems; itcorresponds to a clock cycle. AUocation consists in assigning theoperations to hardware, i.e. allocating functional units, storage andcommunication paths.The aim of scheduling is to minimize the amount of time or thenumber of control steps needed for completion of the program,given certain limits on the available hardware resources. In ourexample, a trivial special case uses just one functional unit and onememory. Each operation has to be scheduled in a different controlstep, so the computation takes 3+4*5=23 control steps. To speedup the computation at the expense of adding more hardware, thecontrol graph can be packed into contml steps as tightly aspossible, observing only the essential dependencies required by tbedata-flow graph and by the loop boundaries. This form is shown infig. 2. Notice that two dummy nodes to delimit the loop boundarieswere introduced. Since the shift operation is free, with twofunctional units the operations can now he scheduled in 2+4*2=10control steps.control66d0p91010l multiplexer.# shiftc constantFigure 2. Optimized Control Graph and SchedulePaper 23.1331

The Yorktown Silicon Compiler (YSC) [4] does allocation andscheduling together, but in a different way. It begins with eachoperation being done on a separate functional unit and all operationsbeing done in the same control step. Additional control stepsare added for loop boundaries, and as required to avoid conflictsover register and memory usage. The hardware is then optimizedso as to share resources as much as possible. If there is too muchhardware or there are too many operations chained together in thesame control step, more contml steps are added and the datapathstructure is again optimized. This process is repeated until thehardware and time constraints are met.Finally, functional unit allocation can be done first, followed byscheduling. In the BUD system [17], operations are first partitionedinto clusters, using a metric that takes into account potentialfunctional unit sharing, interconnect, and parallelism. Then functionalunits are assigned to each cluster and the scheduling is done.The number of clusters to be used is determined by searchingthrough a range of possible clusterings, choosing the one that bestmeets the design objectives.In the Karlsruhe CADDY/DSL system [25], the datapath is builtfirst, assuming maximal parallelism. This is then optimized,locally and globally, guided by both area constraints and timing.The operations are then scheduled, subject to the constraintsimposed by the datapath.3.1.2 Scheduling Algorithms There are two basic classes ofscheduling algorithms: transformational and iterative/constructive.A transformational type of algorithm begins with a defaultschedule, usually either maximally serial or maximally parallel, andapplies transformations to it to obtain other schedules. Thetransformations move serial operations in parallel and paralleloperations in series. Transformational algorithms differ in howthey choose what transformations to apply.Barbacci’s EXPL [l], one of the earliest high-level synthesis system,used exhaustive search. That is, it tried all possible combmationsof serial and parallel transformations and chose the bestdesign found. This method has the advantage that it looks throughall possible designs, but of course it is computationally very expensiveand not practical for sizable designs. Exhaustive search can beimproved somewhat by using branch-and-bound techniques, whichcut off the search along any path that can be recognized to besuboptimal.Another approach to scheduling by transformation is to use heuristicsto guide the process. Transformations are chosen that promiseto move the design closer to the given constraints or to optimizethe objective. This is the approach used, for example, in the YorktownSilicon Compiler [4] and the CAMAD design system [23].The transformations used in the YSC can be shown to produce afastest possible schedule for a given specification.Operatives are first sorted topologically; that is. if operation x2 isconstrained to follow operation xl by some necessary dataflow orcontrol relationship, then x2 will follow xl in the topological order.Operations are taken from the list in order and each is put into theearliest control step possible, given its dependence on other operationsand the limits on resource usage. Figure 3 shows a dataflowgraph and its ASAP schedule. This was the type of schedulingused in the CMUDA system [lo]. in the MIMOLA system and inFlamel. The problem with this algorithm is that no priority isgiven to operations on the critical path, so that when there are limitson resource usage, operations that are less critical can bescheduled first on limited resources and thus block critical operations.This is shown in Figure 3, where operation 1 is scheduledahead of operation 2. which is on the critical path, so that operation2 is scheduled later than is necessary, forcing a longer than optimalschedule.Figure 3. ASAP SchedulingList scheduling overcomes this problem by using a more global criterionfor selecting the next operation to be scheduled. For eachcontrol step to be scheduled, the operations that are available to bescheduled into that control step, that is, those whose predecessorshave already been scheduled, are kept in a list, ordered by somepriority function. Each operation on the list is taken in turn and isscheduled if the resources it needs are still free in that step; otherwiseit is deferred to the next step. When no more operations canbe scheduled, the algorithm moves to the next control step, theavailable operations are found and ordered, and the process isrepeated. This continues until all the operations have beenscheduled. Studies have shown that this form of scheduling worksnearly as well as branch-and-bound scheduling in microcode optimization[6]. Figure 4 shows a list schedule for the graph in Figure3. Here the priority is the length of the path from the operation tothe end of the block. Since operation 2 has a higher priority thanoperation 1, it is scheduled first, giving an optimal schedule for thiscase.A number of schedulers use list scheduling, though they differsomewhat in the priority function they use. The scheduler in theBUD system uses the length of the path from the operation to theend of the block it is in. Elf [8] and ISYN [19] use the “urgency”of au operation, the length of the shortest path from that operationto the nearest local constraint.The other class of algorithms, the iterative/constructive ones, buildup a schedule by adding operations one at a time until all theoperations have been scheduled. They differ in how the nextoperation to be scheduled is chosen and in how they determinewhere to schedule each operation.The simplest type of scheduling, as soon as possible (ASAP)scheduling, is local both in the selection of the operation to bescheduled and in where it is placed. ASAP scheduling assumesthat the number of functional units has already been specified.Figure 4. A List SchedulePaper 23.1333

erally look at less of the search space than global techniques, andtherefore are mote efficient, but ate less likely to find optimal solutions.Figure 5. A Distribution GraphThe last trpe of scheduling algorithm we will consider is globalboth in the way it selects the next operation to be scheduled and inthe way it decides the control step in which to put it. In this typeof algorithm, the range of possible control step assignments foreach operation is calculated, given the time constraints and the precedencerelations between the operations. In freedom-basedscheduling, the operations on the critical path are scheduled firstand assigned to functional units. Then the other operations arescheduled and assigned one at a time. At each step theunschedtded operation with the least freedom, that is, the one withthe smallest range of control steps into which it can go, is chosen,so that operations that might present more difficult schedulingproblems are taken care of first, before they become blocked.In force-directed scheduling. the range of possible control steps foreach operation is used to form a so-called Distribution Graph. Thedistribution graph shows, for each control step, how heavily loadedthat step is, given that alI possible schedules are equally likely. Ifan operation could be done in any of k control steps, then l/k isadded to each of those control steps in the graph. For exampleFigure 5 shows a dataflow graph, the range of steps for each operation,and the corresponding distribution graph for the additionoperations, assuming a time constraint of three control steps.Addition al must be scheduled in step 1, so it contributes 1 to thatstep. Similarly addition a2 adds 1 to control step 2. Addition a3could be scheduled in either step 2 or step 3, so it contributes I/; toeach. Operations ate then selected and placed so as to balance thedistribution as much as possible. In the above example, a.3 wouldfirst be scheduled into step 3. since that would have the greatesteffect in balancing the graph.3.2 Data Path AllocationData path allocation involves mapping operations onto functionalunits, assigning values to registers, and providing interconnectionsbetween operators and registers using buses and multiplexem. Thedecision to use AINs instead of simple operators is also made atthis time. The optimization goal is usually to minimize someobjective function, such asl total interconnect length,- total register, bus driver and multiplexer cost, or. critical path delays.There may also be one or more constraints on tbe design whichlimit total area of the design, total throughput, or delay from inputto output.The techniques used to perform data path allocation can beclassified into two types, iterative/constructive, and global.Iterative/constructive techniques assign elements one at a time,while global techniques find simultaneous solutions to a number ofassignments at a time. Exhaustive search is an extreme case of aglobal solution technique. Iterative/Constructive techniques gen-3.2.1 Iterative/Constructive Techniques Iterative/constructive techniquesselect an operation, value or interconnection to be assigned,make the assignment, and tben iterate. The rules which determinethe next operation, value or interconnect to be selected can varyfrom global rules, which examine many or all items before selectingone, to local selection rules, which select the items in a fixedorder, usually as they occur in the data flow graph from inputs tooutputs. Global selection involves selecting a candidate forassignment on the basis of some metric, for example taking thecandidate that would add the minimum additional cost to thedesign. Hafer’s data path allocator, the first RT synthesis programwhich dealt with ‘ITL chips was iterative, and used local selection[9]. The DAA used a local criterion to select which element toassign next, but chose where to assign it on the basis of rules thatencoded expert knowledge about the data path design of microptocessors.Once this knowledge base had been tested and improvedthrough repeated interviews with designers, the DAA was able toproduced much cleaner data paths than when it began [ 13 pages26-311. EMUCS 1101 used a global selection criterion, based onminimizing both the number of functional tits and registers andthe multiplexing needed, to choose the next element to assign andwhere to assign it. The Elf system also sought to minimize interconnect,but used a local selection criterion. The REAL program[15] separated out register allocation and performed it afterscheduling, but prior to operator and interconnect allocation.REAL is constructive, and selects the earliest value to assign ateach step, sharing registers among values whenever possible.al +4mm -- J -- “J?l -_+ +aa3,-JIG%* _-- c,a4 T3 +(1a 11 a3,a4+(2 a2r!!!?lml ,m2Figure 6. Greedy Data Path AllocationAn example of greedy allocation is shown in fig. 6. The dataflowgraph on the left is processed from earliest time step to latest.Operators, registers and interconnect are allocated for each timestep in sequence. Thus, the selection rule is local, and the allocationconstructive. Assignments are made so as to minimize interconnect.In the case shown in the figure, a2 was assigned toadder2 since the increase in multiplexing cost required by thatallocation was zero. a4 was assigned to adder1 because there wasalready a connection from the register to that adder. Other variationsare possible, each with different multiplexing costs. Forexample, if we had assigned a2 to adder1 and a4 to adder1 withoutchecking for interconnection costs, then the final multiplexingwould have been more expensive. A more global selection rulealso could have been applied. For example, we could haveselected tbe next item for allocation on the basis of minimizationof cost increase. In this case, if we had already allocated a3 toaddet2, then the next step would be to allocate a4 to the sameadder, since they occur in different time steps, and the incrementalcost of doing that assignment is less than assigning a2 to adderl.Paper 23.1334

1. Barbacci. M.R. Automated Exploration of the DesignSpace for Register Transfer (RT) Systems. PhD Thesis,Carnegie-Mellon University, 1973.2.3.4.5.6.7.8.9.10.11.12.13.14.15.REFERENCES 16.Barbacci. M.R. Instruction Set Processor Specifications(ISPS): The Notation and its Applications. IEEE Transactionson Computers C-30, 1 (January, 1981). 2440. 18.BorrielIo, G. and Katz, R.H. Synthesis and Optimizationof Interface Transducer Logic. Proceedings of the internationConference on Computer-Aided Design (Nove.mber 9,1987). 274-277.Brayton, R.K.. Camposano. R., DeMicheli, G.:. Gtten,R.H.J.M., and vanEijndhoven, J. The Yorktown SiliconCompiler. In Silicon Compilation, D.D. Gajski, Ed.Addison-Wesley, Reading, MA, 1988, pp. 204-311.Brewer, F.D. and Gajski. D.D. Knowledge Based Controlin Micro-Architecture Design. In Proceedings of the 24thDesign Automation Conference, ACM and IEEE, June,1987, pp. 203-209.Davidson, S., Landskov. D., Shriver, B.D.. and Mallet&P.W. Some experiments in local microcode compaction forhorizontal machines. IEEE Transactions on Computers C-30. 7 (July, 1981). 460-477.DeMan, H., Rabaey, J., Six, P., and Claesen. L. CathedralII: A Silicon Compiler for Digital Signal Processing. IEEEDesign and Test 3, 6 (December, 1986), 13-25. 23.Girczyc, E.F. Automatic Generation of MicrosequencedData Paths to Realize ADA Circuit Descriptions. PhDThesis, Carleton University, July, 1984. 24.Hafer. L.J. and Parker, A.C. Register-Transfer Level DigitalDesign Automation: The Allocation Process. InProceedings of the 15th Design Automation Conference,ACM and IEEE, June, 1978, pp. 213-219.Hitchcock, C.Y. and Thomas, D.E. A Method ofAutomatic Data Path Synthesis. In Proceedings of the 20thDesign Automation Conference, ACM and IEEE, June,1983, pp. 484-489.Johnson, S.D. Synthesis of Digital Designs from RecursionEquations. PhD Thesis, Indiana University, 1984.MIT Press.Knapp, D.. Granacki, J., and Parker, A.C. An Expert SynthesisSystem. In Proceedings of the International Conferenceon Computer-aided Design, ACM and IEEE,September, 1984, pp. 419-24. 27.Kowalski, T.J. An Artificial Intelligence Approach to VLSIDesign. Kluwer Academic Publishers, Boston, 1985.Kurdahi, F.J. and Parker, A.C. PLEST: A Program forArea Estimation of VLSI Integrated Circuits. In Proceedingsof the 23rd Design Automation Conference, ACM andIEEE, June, 1986, pp. 467-473.Kurdahi, F.J. and Parker, A.C. REAL: A Program forREgister ALlocation. In Proceedings of the 24th DesignAutomation Conference, ACM and IEEE, June, 1987, pp.210-215.17.19.20.21.22.25.26.28.29.McFarland, M.C. The VT: A Database for AutomatedDigital Design. DRC-01-4-80, Design Research Center,Carnegie-Mellon University, December, 1978.McFarland, M.C. Using Bottom-Up Design Techniques inthe Synthesis of Digital Hardware from Abstract BehavioralDescriptions. In Proceedings of the 23rd Design AutomationConference, IEEE and ACM, June, 1986.McFarland, M.C. and Parker, A.C. An Abstract Model ofBehavior for Hardware Descriptions. IEEE Transactionson Computers C-32, 7 (July, 1983). 621-36.Nestor, J.A. Specification 8z Synthesis of Digital Systemswith Interfaces. CMUCAD-87-10, Department of Electricaland Computer Engineering, Carnegie-Mellon Uy, April,1987.Park, N. and Parker, AC. Sehwa: A Software Package forSynthesis of Pipelines from Behavioral Specifications.IEEE Transactions on Computer-Aided Design of DigitalCircuits and Systems 7, 3 (March, 1988), 356-370.Parker, A.C., Plzarm, J., and Mlinar, M. MAHA: A Pmgramfor Datapath Synthesis. In Proceedings of the 23rdDesign Automation Conference, ACM and IEEE, June,1986, pp. 461-466.Paulin, P.G. and Knight, J.P. Force-Directed Scheduling inAutomatic Data Path Synthesis. in Proceedings of the 24thDesign Automation Conference, ACM and IEEE, June,1987, pp. 195-202.Peng, 2. Synthesis of VLSI Systems with the CAMADDesign Aid. In Proceedings of the 23rd Design AutomationConference, IEEE and ACM, June, 1986, pp. 278-284.Rajan, J.V. and Thomas, D.E. Synthesis by Delayed Bindingof Decisions. In Proceedings of the 22nd Design AutomationCorgference, ACM and IEEE, June, 1985. pp. 367-73.Rosenstiel, W. and Camposano, R. Synthesizing Circuitsfrom ,Behavioral Level Specifications. In Proceedings ofthe 7th International Conference on Computer HardwareDescription Languages and their Applications, C. Koomenand T. Moto-oka, Eds.. North-Holland, August, 1985, pp.391-402.Snow, E.A.. Siewiorek, D.P., and Thomas, D.E. ATechnology-Relative Computer-Aided Design System:Abstract Representations, Transformations, and DesignTradeoffs. In Proceedings of the 15th Design AutomationConference, ACM and IEEE, 1978, pp. 220-226.Trickey, H. Flamel: A High-Level Hardware Compiler.IEEE Transactions on CAD CAD-6, 2 (March, 1987), 259-269.Tseng, C. and Siewiorek. D.P. Automated Synthesis ofData Paths in Digital Systems. IEEE Transactions onComputer-Aided Design of Integrated Circuits and SystemsCAD-5, 3 (July, 1986). 379-395.Zimmemamm, G. MDS-The Mimola Design Method.Journal of Digital Systems 4, 3 (1980), 337-369.Paper 23.1336

Pdf - Tutorial on High-Level Synthesis - University of Windsor

Create successful ePaper yourself

Delete template?

Save as template?