FEADS: A Framework for Exploring the Application Design Space on ...

FEADS: A Framework for Exploring the Application Design Space on ... FEADS: A Framework for Exploring the Application Design Space on ...

from serc.iisc.ernet.in More from this publisher

11.07.2015 Views

International Journal of Parallel Programming (© 2006)DOI: 10.1007/s10766-006-0023-01234ong>FEADSong>: A ong>Frameworkong> ong>forong> ong>Exploringong>ong>theong> ong>Applicationong> ong>Designong> ong>Spaceong>on Network ProcessorsRajani Pai 1,2 , R. Govindarajan 1Received ; revised ; accepted 5678910111213141516171819202122232425Network processors are designed to handle ong>theong> inherently parallel natureof network processing applications. However, partitioning and schedulingof application tasks and data allocation to reduce memory contentionremain as major challenges in realizing ong>theong> full perong>forong>mance potential of agiven network processors. The large variety of processor architectures inuse and ong>theong> increasing complexity of network applications furong>theong>r aggravateong>theong> problem. This work proposes a novel framework, called ong>FEADSong>, ong>forong>automating ong>theong> task of application partitioning and scheduling ong>forong> networkprocessors. The ong>FEADSong> uses ong>theong> simulated annealing approach to perong>forong>mdesign space exploration of application mapping onto processor resources.Furong>theong>r, it uses cyclic and r-periodic scheduling to achieve higher throughputschedules. To evaluate dynamic perong>forong>mance metrics such as throughputand resource utilization under realistic workloads, ong>FEADSong> automaticallygenerates a Petri net (PN) which models ong>theong> application, architecturalresources, mapping and ong>theong> constructed schedule and ong>theong>ir interaction. Thethroughput obtained by schedules constructed by ong>FEADSong> is comparable tothat obtained by manual scheduling ong>forong> linear task flow graphs; ong>forong> morecomplicated task graphs, ong>FEADSong>’ schedules have a throughput which isupto 2.5 times higher compared to ong>theong> manual schedules. Furong>theong>r, staticscheduling of tasks results in an increase in throughput by upto 30%compared to an implementation of ong>theong> same mapping without task scheduling.2627KEYWORDS:Cyclic scheduling; design space exploration; network processor;programming model; perong>forong>mance Evaluation; petri Nets.1 Department of Computer Science and Automation, Supercomputer Education andResearch Centre, Indian Institute of Science, Bangalore 560 012, India.2 To whom correspondence should be addressed. E-mail:rajani@csa.iisc.ernet.in1© 2006 Springer Science+Business Media, Inc.Journal: IJPP10766 CMS: 23 ̌ TYPESET ̌ DISK LE CP Disp.: 28/9/2006 Pages: 31

International Journal of Parallel Programming (© 2006)DOI: 10.1007/s10766-006-0023-01234<strong>FEADS</strong>: A <strong>Framework</strong> <strong>for</strong> <strong>Exploring</strong><strong>the</strong> <strong>Application</strong> <strong>Design</strong> <strong>Space</strong>on Network ProcessorsRajani Pai 1,2 , R. Govindarajan 1Received ; revised ; accepted 5678910111213141516171819202122232425Network processors are designed to handle <strong>the</strong> inherently parallel natureof network processing applications. However, partitioning and schedulingof application tasks and data allocation to reduce memory contentionremain as major challenges in realizing <strong>the</strong> full per<strong>for</strong>mance potential of agiven network processors. The large variety of processor architectures inuse and <strong>the</strong> increasing complexity of network applications fur<strong>the</strong>r aggravate<strong>the</strong> problem. This work proposes a novel framework, called <strong>FEADS</strong>, <strong>for</strong>automating <strong>the</strong> task of application partitioning and scheduling <strong>for</strong> networkprocessors. The <strong>FEADS</strong> uses <strong>the</strong> simulated annealing approach to per<strong>for</strong>mdesign space exploration of application mapping onto processor resources.Fur<strong>the</strong>r, it uses cyclic and r-periodic scheduling to achieve higher throughputschedules. To evaluate dynamic per<strong>for</strong>mance metrics such as throughputand resource utilization under realistic workloads, <strong>FEADS</strong> automaticallygenerates a Petri net (PN) which models <strong>the</strong> application, architecturalresources, mapping and <strong>the</strong> constructed schedule and <strong>the</strong>ir interaction. Thethroughput obtained by schedules constructed by <strong>FEADS</strong> is comparable tothat obtained by manual scheduling <strong>for</strong> linear task flow graphs; <strong>for</strong> morecomplicated task graphs, <strong>FEADS</strong>’ schedules have a throughput which isupto 2.5 times higher compared to <strong>the</strong> manual schedules. Fur<strong>the</strong>r, staticscheduling of tasks results in an increase in throughput by upto 30%compared to an implementation of <strong>the</strong> same mapping without task scheduling.2627KEYWORDS:Cyclic scheduling; design space exploration; network processor;programming model; per<strong>for</strong>mance Evaluation; petri Nets.1 Department of Computer Science and Automation, Supercomputer Education andResearch Centre, Indian Institute of Science, Bangalore 560 012, India.2 To whom correspondence should be addressed. E-mail:rajani@csa.iisc.ernet.in1© 2006 Springer Science+Business Media, Inc.Journal: IJPP10766 CMS: 23 ̌ TYPESET ̌ DISK LE CP Disp.: 28/9/2006 Pages: 31

4 Pai and Govindarajan93949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134can be fed back to <strong>the</strong> scheduler module to fur<strong>the</strong>r tune <strong>the</strong> schedulingand mapping process. In order to accomplish this, it is necessary to automatethis process of obtaining dynamic per<strong>for</strong>mance metrics as well.In this paper, we propose <strong>FEADS</strong>, a framework <strong>for</strong> exploring <strong>the</strong>application design space on network processors, which can evaluate variousalternatives in application partitioning, scheduling and mapping onto<strong>the</strong> network processor architecture and choose <strong>the</strong> best alternative. Weaddress <strong>the</strong>se issues by partitioning <strong>the</strong> application into tasks based onwhe<strong>the</strong>r <strong>the</strong>y utilize <strong>the</strong> processor or memory, thus yielding tasks of finergranularity. The application is represented as an Annotated Directed AcyclicGraph (ADAG) where nodes represent <strong>the</strong> fine-grain tasks per<strong>for</strong>medon packet. For example, a node in an ADAG may represent <strong>the</strong> lookuptask per<strong>for</strong>med by packet <strong>for</strong>warding applications.The design space of mapping <strong>the</strong> tasks to resources is explored usingsimulated annealing. (12) The mapping scheme can take into account variousconstraints such as code store size, communication options, and<strong>the</strong>ir latencies. It explores whe<strong>the</strong>r pipelined or parallel, or a combinationof <strong>the</strong>se mappings is better. Once a mapping is finalized, <strong>the</strong> tasksare scheduled using cyclic scheduling method (software pipelining). Weuse a variant of decomposed software pipelining method. (21) Fur<strong>the</strong>r, weconsider unrolling of <strong>the</strong> task flow graph to obtain better throughput. Thisis similar to r-periodic scheduling. (20) We propose a simple heuristic toestimate <strong>the</strong> number of instances required. To obtain <strong>the</strong> dynamic per<strong>for</strong>mancemetrics, we use <strong>the</strong> Petri net (PN) model developed in Ref. (6)<strong>for</strong> per<strong>for</strong>mance evaluation of network applications on network processors.A salient feature of <strong>the</strong> PN model is that it models <strong>the</strong> application,<strong>the</strong> architecture, <strong>the</strong> mapping and schedule derived by <strong>FEADS</strong>, and<strong>the</strong>ir interaction in detail. The PN model is automatically generated byour PN Generate tool. The PN generated changes whenever <strong>the</strong> architecture,application, schedule or <strong>the</strong> mapping changes. The generated PN issimulated using CNET, (23) a PN simulator tool, and per<strong>for</strong>mance metricsare obtained.The throughput obtained by <strong>the</strong> mapping generated by <strong>FEADS</strong> <strong>for</strong><strong>the</strong> applications under consideration — IPv4, NAT, and IPSec — is comparableto that obtained by a manual binding of tasks to resources as inRef. (6) . Fur<strong>the</strong>r, en<strong>for</strong>cing static scheduling of tasks, as opposed to schedulingpackets immediately upon arrival, results in an increase in throughputby 30%. In <strong>the</strong> case of IPSec, with code size constraints imposed on<strong>the</strong> mapping, <strong>the</strong> mapping and schedule produced by <strong>FEADS</strong> results in27% higher throughput than <strong>the</strong> manual mapping. Fur<strong>the</strong>r studies withmore complicated task graphs showed that our framework yields upto 2.5times higher throughput as compared to a manual binding under code

A <strong>Framework</strong> <strong>for</strong> <strong>Exploring</strong> <strong>the</strong> <strong>Application</strong> <strong>Design</strong> <strong>Space</strong> 5135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174size constraints. This clearly indicates that with <strong>the</strong> growing complexity ofnetwork applications and increasing resource constraints, manual bindingcannot adequately meet <strong>the</strong> line speeds required by <strong>the</strong>se applications.The rest of <strong>the</strong> paper is organized as follows. First, a discussion on relatedwork is presented. In Section 3, a brief description of various processors in<strong>the</strong> IXP family of network processors is presented. Section 4 presents FE-ADS, a framework which per<strong>for</strong>ms task partitioning and scheduling. Section5 presents <strong>the</strong> experimental results. Finally, Section 6 concludes <strong>the</strong> paper.2. RELATED WORKKulkarni et al., studied <strong>the</strong> per<strong>for</strong>mance impact of partitioning ondifferent network processor architectures.(13)An IP <strong>for</strong>warding routerapplication was implemented on <strong>the</strong> IXP1200 and Motorola C-5 networkprocessors. The application was partitioned at <strong>the</strong> logical cutting pointsrevealed by <strong>the</strong> application, such as packet receive, header processing, andpacket transmit. It was observed that inefficient partitioning negativelyimpacted <strong>the</strong> throughput by more than 30%, and localization of computationrelated to <strong>the</strong> memories increased <strong>the</strong> available bandwidth on internalbuses by a factor of two.Several methodologies <strong>for</strong> <strong>the</strong> design space exploration of networkprocessors have also been proposed. Specifically, Blickle et al., (1) and Thieleet al., (19) use a genetic algorithm to explore <strong>the</strong> design space of networkprocessors. Given an application, flow characterization and availableresources, <strong>the</strong>y bind and schedule <strong>the</strong> tasks of <strong>the</strong> application onto <strong>the</strong>resources. The objective function considers <strong>the</strong> cost of implementing <strong>the</strong>given mapping in hardware. Once a set of optimal points are obtained,simulation can be per<strong>for</strong>med to select <strong>the</strong> best architectural configuration<strong>for</strong> <strong>the</strong> given application and flow characterization. Their work looks at<strong>the</strong> problem from an architectural design perspective, as reflected by <strong>the</strong>irobjective function, while we approach it from an application design viewpoint.Moreover, common features of network processors like <strong>the</strong> availabilityof multiple threads on a processing engine are not considered by<strong>the</strong>m while binding tasks to resources. They also do not accommodate <strong>the</strong>fact that some tasks can be bound to more than one resource at a giventime, such as a processor thread and memory element, as is <strong>the</strong> case withmemory-bound tasks. Fur<strong>the</strong>r, <strong>the</strong>y have used an evolutionary approachto explore <strong>the</strong> solution space, while we use simulated annealing since wefound that it covers <strong>the</strong> solution space more effectively and arrives at abetter solution.Weng and Wolf (22) construct an ADAG of <strong>the</strong> application usingruntime traces and use this to per<strong>for</strong>m design space exploration. An

6 Pai and Govindarajan175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215ADAG is a directed acyclic graph whose nodes are annotated with<strong>the</strong>ir processing and memory requirements, while <strong>the</strong> edges are annotatedwith <strong>the</strong> number of bytes of communication between <strong>the</strong> nodesconnected by <strong>the</strong> edge. They use a randomized algorithm to per<strong>for</strong>mmapping of nodes to processors and memories. The system throughputis determined by modeling <strong>the</strong> processing time <strong>for</strong> each processingelement in <strong>the</strong> system based on its workload, <strong>the</strong> memory contentionon each memory interface, and <strong>the</strong> communication overhead betweenpipeline stages.Memory contention is modeled using a queuing networkapproach. At <strong>the</strong> end of <strong>the</strong> mapping process, <strong>the</strong> best overall mappingis reported. It should be noted that <strong>the</strong>ir model considers a networkprocessor architecture with fixed binding of processors to memory andcommunication elements, and hence does not take into account <strong>the</strong> availabilityof multiple memory modules <strong>for</strong> binding even after <strong>the</strong> processorhas been selected. Since we use a simulated annealing approach, we believethat our framework will explore <strong>the</strong> design space more thoroughly as comparedto multiple randomized mappings, where each solution point is independentof <strong>the</strong> o<strong>the</strong>r.Franklin and Datar (4) propose an algorithm which employs a greedyheuristic to schedule tasks derived from multiple application flows on pipelineswith an arbitrary number of stages. Tasks may be shared, and differentbandwidths may be associated with each of <strong>the</strong> application flows. However,memory contention is not modeled. They have observed that task partitioningresults in greater flexibility in assigning tasks to <strong>the</strong> hardware pipeline,but with a corresponding increase in <strong>the</strong> complexity of implementing<strong>the</strong> partitioning and larger inter-task communication costs. Sharing of taskscommon to different flows also leads to better throughput, since memorycontention is reduced. The task graph is partitioned at arbitrary points toachieve load balancing across processors, while in our work, we start witha more clearly demarcated set of processing, memory and communicationtasks. The demarcation of task boundaries with memory and communicationtask enables us to easily take into account <strong>the</strong> memory and communicationcosts and possible thread context switching.Chen et al. (2) system consists of a programing language, a compiler<strong>for</strong> optimizing network programs using both traditional and specializedoptimization techniques, and a runtime system which identifies hot codepaths and improves <strong>the</strong>ir mapping to processing resources. Aggregation oftasks from different pipe stages is done so as to minimize communicationcost. Pipeline aggregate duplication is used to improve <strong>the</strong> throughput of<strong>the</strong> slowest pipe aggregate if its throughput is much less than <strong>the</strong> o<strong>the</strong>rpipeline aggregates. We have found that scheduling multiple instances of

8 Pai and Govindarajan243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281The memory architecture of <strong>the</strong> IXP2400 consists of SRAM, DRAM,scratch memory, local memory, and next neighbor registers. Typically,packets are stored in DRAM memory, while <strong>the</strong> SRAM memory storesroute lookup tables. 16 KB of low-latency scratchpad memory is also provided,which along with SRAM memory can be used <strong>for</strong> communicationbetween MEs. There are separate SRAM and DRAM controllers <strong>for</strong>interfacing <strong>the</strong> external memories. Next neighbor (NN) registers can beused <strong>for</strong> communication between adjacent MEs. Communication betweenNNs is unidirectional, as shown in Fig. 2. Additionally, each ME consistsof 640 words of local memory, which can be used <strong>for</strong> communicationbetween hardware contexts. Specialized hardware like <strong>the</strong> hash and cryptounits are also available on processors in <strong>the</strong> IXP family.As can be seen, <strong>the</strong>re is a wide variety of options <strong>for</strong> mappingcommunication and processing tasks. Moreover, <strong>the</strong> resource used <strong>for</strong>communication and hence <strong>the</strong> latency is constrained by <strong>the</strong> processoronto which adjacent tasks in <strong>the</strong> task flow graph are bound. Thesize of <strong>the</strong> solution space under consideration implies that <strong>the</strong> functionof mapping tasks to processors and memory modules should beautomated.O<strong>the</strong>r processors in <strong>the</strong> IXP family include <strong>the</strong> IXP1200, <strong>the</strong> IXP2800,and <strong>the</strong> IXP2850. These processors have a similar structure, but <strong>the</strong> numberof MEs, memory controllers and <strong>the</strong> specialized hardware differ. TheIXP1200 consists of six MEs with four threads per ME. (8) It has only oneSRAM controller, one DRAM controller and no next neighbor registers<strong>for</strong> communication. The IXP2800, on <strong>the</strong> o<strong>the</strong>r hand, has 16 MEs distributedover two clusters, four SRAM controllers and three DRAM controllers.(10) O<strong>the</strong>r features, however, remain identical to <strong>the</strong> IXP2400. TheIXP2850 is similar to <strong>the</strong> IXP2800, but additionally consists of two cryptounits which implement cryptographic algorithms in hardware. (11)Due to <strong>the</strong>se dissimilarities in <strong>the</strong> architecture of processors in <strong>the</strong>IXP family, <strong>the</strong> binding done <strong>for</strong> one processor might not lead to an optimalsolution <strong>for</strong> <strong>the</strong> o<strong>the</strong>rs. Task partitioning and per<strong>for</strong>mance evaluationtechniques <strong>the</strong>re<strong>for</strong>e need to consider <strong>the</strong> various resources available oneach processor in order to utilize <strong>the</strong>m effectively.4. FRAMEWORK FOR TASK SCHEDULING AND PARTITIONINGFigure 3 depicts <strong>the</strong> design of <strong>FEADS</strong>, a framework <strong>for</strong> task schedulingand per<strong>for</strong>mance evaluation <strong>for</strong> a given architectural specification. Thedetails of <strong>the</strong> different components of <strong>the</strong> entire framework are describedin detail below.

A <strong>Framework</strong> <strong>for</strong> <strong>Exploring</strong> <strong>the</strong> <strong>Application</strong> <strong>Design</strong> <strong>Space</strong> 9Architectural specificationNetwork applicationcodeADAG generationmoduleScheduler moduleDynamic per<strong>for</strong>manceevaluation<strong>FEADS</strong>Fig. 3.Proposed design <strong>for</strong> <strong>the</strong> framework.2822832842852862872882892902912922932942952962972982993003013023033043053063073083093103114.1. ADAG Generation ModuleThe application to be scheduled can be specified using a frameworklike Click or NP-Click. An ADAG <strong>for</strong> this task flow graph is generated,along with various constraints like resources that <strong>the</strong> tasks can be mappedonto and instruction store requirements of tasks. Task nodes in <strong>the</strong> ADAGare explicitly classified into processing, memory, communication, or application-specifictasks meant <strong>for</strong> <strong>the</strong> hash or crypto units. The nodes of <strong>the</strong>ADAG are considered to be atomic and cannot be split fur<strong>the</strong>r. Based on<strong>the</strong> type of node, it is ei<strong>the</strong>r annotated with <strong>the</strong> processing requirements interms of <strong>the</strong> number of cycles or <strong>the</strong> memory requirements in terms of <strong>the</strong>number of bytes to be transferred. Memory nodes are also annotated within<strong>for</strong>mation about <strong>the</strong> type of memory — SRAM or DRAM — that <strong>the</strong>yneed to access. Similar constraints can also be imposed on <strong>the</strong> binding ofo<strong>the</strong>r nodes. Since we use nodes of finer granularity than those providedby Click or NP-Click, we model <strong>the</strong> memory contention and communicationcosts with greater accuracy and hence obtain a better schedule.An example ADAG is shown in Fig. 4. Processing nodes are denotedby P , memory nodes by M, and communication nodes by C. Communicationnodes are annotated with <strong>the</strong> number of bytes of communicationneeded by <strong>the</strong> two adjacent nodes. For example, <strong>the</strong> communication fromnodes a to b requires 32 bytes, and is represented by node e in Fig. 4. Theannotations <strong>for</strong> node a indicate that it requires 10 clock cycles <strong>for</strong> executionand should be bound to a ME. Similarly, node b transfers 32 bytesof data and should be bound to SRAM (S). Node c should be bound toDRAM (D).In this work, we do not concentrate on <strong>the</strong> ADAG generation module.We assume that an ADAG representation of <strong>the</strong> application <strong>for</strong>ms aninput to <strong>FEADS</strong>. For <strong>the</strong> example network applications considered in thispaper, viz., IP <strong>for</strong>warding (IPv4), Network Address Translation (NAT),and IP Security (IPsec) (refer to Section 5.1 <strong>for</strong> a detailed discussion on

10 Pai and GovindarajanP10MEaeC32C32fbM32SM32DcFig. 4.Nodes in an ADAG.312313314315316317318319320321322323324325326327328329330331332333<strong>the</strong>se applications), we have shown <strong>the</strong> ADAGs in Fig. 5. The generationof <strong>the</strong> ADAG <strong>for</strong> an application is left <strong>for</strong> future work.4.2. Scheduler ModuleThe ADAG and architectural specification serve as inputs to <strong>the</strong>scheduler, which per<strong>for</strong>ms mapping of ADAG nodes onto <strong>the</strong> appropriateresources. The architecture of <strong>the</strong> processor on which <strong>the</strong> applicationis to be scheduled is specified as a pair consisting of <strong>the</strong> type of resourceand number of resources of that type available. The overall design andimplementation of <strong>the</strong> scheduler module is shown in Fig. 6. The outerloop estimates <strong>the</strong> number of instances which would result in optimalper instance throughput and per<strong>for</strong>ms a mapping and scheduling of <strong>the</strong>estimated number of instances using simulated annealing. Details are discussedin Section 4.2.2. PN Generate is a part of Dynamic Per<strong>for</strong>manceEvaluation module. The inner loop uses simulated annealing to generatea mapping <strong>for</strong> <strong>the</strong> given ADAG. For each mapping, <strong>the</strong> tasks are scheduledand <strong>the</strong> schedule length is obtained. We use schedule length as <strong>the</strong>static per<strong>for</strong>mance metric <strong>for</strong> comparing different mappings. The simulatedannealing algorithm and <strong>the</strong> o<strong>the</strong>r modules are described in <strong>the</strong> followingsubsections.4.2.1. Task MappingSimulated annealing (12) is used to explore <strong>the</strong> task mapping space.Mapping is done using Algorithm 1. The schedule length <strong>for</strong> a mapping is

12 Pai and GovindarajanArch. specification ADAGEstimate instancesInitial mappingGenerate mapping<strong>for</strong> currenttemperatureDecrease temp.PN_Generate348349350351352353354355356357358359360361Fig. 6.Overall design of scheduler module.( )−(currSchedLen − prevSchedLen)exp,k ∗ Twhere k is <strong>the</strong> Boltzmann constant and T <strong>the</strong> current temperature. Afterevery iteration of <strong>the</strong> inner loop, <strong>the</strong> temperature is reduced and <strong>the</strong> processis repeated. Use of simulated annealing <strong>for</strong> exploring <strong>the</strong> design spaceis acceptable since <strong>the</strong> mapping and scheduling need to be done only oncewhen <strong>the</strong> processor is programed.4.2.2. Scheduling Multiple InstancesMapping a single instance of <strong>the</strong> application might result in someresources being idle. For example, <strong>for</strong> <strong>the</strong> ADAG shown in Fig. 7(a), a singleinstance resulted in a schedule shown in Fig. 7(b) with a throughputof 0.5. However unrolling <strong>the</strong> ADAG multiple times and cyclic scheduling<strong>the</strong> unrolled ADAG can result in higher throughput. Scheduling twoand three instances of <strong>the</strong> ADAG results in an II (Initiation Interval) of

14 Pai and GovindarajanPE2XXYYXC1ZYXII = 2Throughput = 1/2 = 0.5ZZYPE2Z(a) ADAG(b) Scheduling a Single InstanceXXYX1YZX1X2ZY1Z1XYX1II = 3Throughput = 2/3 = 0.67Y1Z1Y2Z2XYX1II = 4Throughput = 3/4 = 0.75ZY1ZY1Z1X2Y2Z1Z2(c) Scheduling Two Instances(d) Scheduling Three InstancesFig. 7. Scheduling Multiple Instances of an ADAG.383384385386387388389390391where i <strong>the</strong> number of instances,f <strong>the</strong> resource type,nT f <strong>the</strong> number of tasks bound to resource f,lat f <strong>the</strong> latency of resource f, andnR f <strong>the</strong> number of resources of type f.We use <strong>the</strong> concept of r-periodic scheduling (20) to determine <strong>the</strong> optimalnumber of application instances to be scheduled. In this scheme, <strong>the</strong>schedule of two consecutive iterations of a loop may not be <strong>the</strong> same,but will repeat every r iterations. In our case, r determines <strong>the</strong> number of

A <strong>Framework</strong> <strong>for</strong> <strong>Exploring</strong> <strong>the</strong> <strong>Application</strong> <strong>Design</strong> <strong>Space</strong> 15392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431instances of <strong>the</strong> application to be scheduled. When multiple instances of<strong>the</strong> application are to be scheduled, <strong>the</strong> ADAG is replicated and <strong>the</strong> nodesare mapped onto <strong>the</strong> resources and scheduled.From <strong>the</strong> equation it can be seen that multiples of <strong>the</strong> resourcewith <strong>the</strong> largest occupancy yield <strong>the</strong> best <strong>the</strong>oretical lower bound andhence <strong>the</strong>se can be used as starting points <strong>for</strong> fur<strong>the</strong>r exploration in ourdesign. It might not always be possible to obtain a schedule with an IIequal to <strong>the</strong> computed lower bound because our mapping might resultin some additional communication tasks being bound to some of <strong>the</strong>seresources. For example, if tasks X and Y operate on some common data,but are bound to two different MEs, a communication task with additionallatency is generated as part of <strong>the</strong> mapping. Also, since some tasksmight be bound to more than one resource, <strong>the</strong> actual schedule lengthmight differ from <strong>the</strong> computed one.4.2.3. Task SchedulingFor each mapping, <strong>FEADS</strong> constructs a valid schedule. Since cyclicor software pipelined schedules (14) yield a better throughput than blockschedules, <strong>FEADS</strong> considers only cyclic schedules. In cyclic scheduling,successive iterations of <strong>the</strong> ADAG are overlapped. A specific <strong>for</strong>m ofcyclic scheduling, modulo scheduling, schedules different instances of anode, corresponding to different iterations, exactly II cycles apart. The1throughput achieved by such a schedule isII. Hence cyclic schedulingattempts to minimize <strong>the</strong> II. Consider <strong>the</strong> ADAG in Fig. 7(a) with <strong>the</strong>available resources being three processing engines (PE) and one memoryelement. Assume <strong>the</strong> PEs are non-pipelined. Nodes x and z require twocycles each <strong>for</strong> execution, while node y takes one cycle. Cyclic schedulingresults in <strong>the</strong> schedule shown in Fig. 7(b). The throughput of this scheduleis2 1 packets per cycle. Note that in this schedule instances correspondingto three iterations overlap in an II.To construct an r-periodic cyclic schedule <strong>for</strong> a given II (computedusing Eq. 1), and <strong>the</strong> mapping obtained using <strong>the</strong> simulated annealing. weuse decomposed software pipelining algorithm. (21) . We have used <strong>the</strong> FirstRow Last Column (FRLC) algorithm <strong>for</strong> decomposed software pipeliningin our implementation. Here <strong>the</strong> row numbers denote <strong>the</strong> cycles of executionand <strong>the</strong> column numbers denote <strong>the</strong> iteration. Since <strong>the</strong>re are no loopcarried dependences in our graph, <strong>the</strong> set of Strongly Connected Components(SCC) is empty. We generate <strong>the</strong> new Loop Data Dependence Graph(LDDG) by removing all edges which do not connect <strong>the</strong> SCCs, as per<strong>the</strong> algorithm. We <strong>the</strong>n per<strong>for</strong>m list scheduling (7) on this newly generatedLDDG. The row number <strong>for</strong> each task is determined by <strong>the</strong> cycle at which

16 Pai and Govindarajan432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468it starts execution. The II <strong>for</strong> <strong>the</strong> pipelined loop is given by <strong>the</strong> number ofrows as computed by <strong>the</strong> FRLC algorithm.4.3. Per<strong>for</strong>mance Evaluation ModuleA dynamic per<strong>for</strong>mance evaluation technique is incorporated into<strong>FEADS</strong> to study <strong>the</strong> per<strong>for</strong>mance of <strong>the</strong> statically generated scheduleunder realistic workloads. A PN model of <strong>the</strong> application mapping andschedule on <strong>the</strong> given processor is generated <strong>for</strong> dynamic per<strong>for</strong>manceevaluation. This is similar to <strong>the</strong> technique used by Weng and Wolf(22)wherein a per<strong>for</strong>mance model of <strong>the</strong> architecture under considerationis used to compute <strong>the</strong> throughput due to a given binding usingqueuing-network approach.A brief description of PN models in general and <strong>the</strong>ir use in modelingapplications mapped onto network processors in particular is given in <strong>the</strong>following subsections. We also describe <strong>the</strong> scheme used by our frameworkto automate <strong>the</strong> generation of <strong>the</strong>se models <strong>for</strong> use by <strong>the</strong> CNET simulator(23) <strong>for</strong> PNs.4.3.1. The Petri Net ModelThe PN is a ma<strong>the</strong>matical modeling tool which is commonly used tomodel concurrency and conflicts in systems. Circles represent places, boxesrepresent <strong>the</strong> timed transitions. Timed Petri nets (TPN) are an extensionto PN where a finite firing time is associated with transitions. The TPNshave been used in modeling multithreaded processors. The main advantagein using TPN <strong>for</strong> processor modeling is <strong>the</strong> added ability to capture<strong>the</strong> latency of operations like memory access and processor execution at ahigh level of abstraction.The PN model generated by our framework models <strong>the</strong> architectureas well as <strong>the</strong> application in great detail. An example model of a singlemicroengine in IXP2400 running <strong>the</strong> IPv4 application is shown in Fig. 8.For clarity only a part of <strong>the</strong> model which captures <strong>the</strong> flow of packetsfrom <strong>the</strong> external link to DRAM through <strong>the</strong> MAC is shown. The firingtime of a timed transition is assumed to be deterministic. The placesUE, THREAD, UE CMD Q, DRAM Q, CMD BUS, represent variousresources available and thus model <strong>the</strong> processor architecture, and <strong>the</strong>timed transitions UE PROCESSING and RFIFO DRAM represent <strong>the</strong>specific tasks. The time taken by <strong>the</strong>se transitions model <strong>the</strong> time takenby <strong>the</strong>se tasks in <strong>the</strong> specific unit. Thus <strong>the</strong> PN model is able to capture<strong>the</strong> processor architecture, applications and <strong>the</strong>ir interaction in detail. A

18 Pai and GovindarajanInput_LineIMACHASH_PROCHASHIPORTLineRateRMA_CMEMTHRDMEM_R2SRAMRFIFOMAC_FIFOMEM_R3UEUE_PROC1MEM_R1UE_PROC2UE_PROC2DRAMDRAM_XFER(a)NATInput_LineIMACMEM_R2IPORTLineRateMEM_R4RMA_CMEMTHRDSRAMFromDRAMRFIFOMAC_FIFOCRYPTO_PROCTo DRAMUEUE_PROC1MEM_R1HASHUE_PROC2DRAMUE_PROC2(b)IPsecDRAM_XFERFig. 9.PN models <strong>for</strong> a single ME in IXP2400 running NAT and IPSec.

A <strong>Framework</strong> <strong>for</strong> <strong>Exploring</strong> <strong>the</strong> <strong>Application</strong> <strong>Design</strong> <strong>Space</strong> 19474475476477478479480481482483484485486487488489490491492493494495496497498499500501502previous subsection. This model is simulated using <strong>the</strong> CNET simulator<strong>for</strong> PNs. (23) We generate <strong>the</strong> PN model of <strong>the</strong> application mapping onto<strong>the</strong> processor and this <strong>for</strong>ms <strong>the</strong> input to <strong>the</strong> CNET simulator. We traverse<strong>the</strong> ADAG and generate transitions <strong>for</strong> each node. Resources towhich <strong>the</strong> task is bound are acquired during <strong>the</strong> beginning of a timedtransition and released when <strong>the</strong> transition fires. Dummy transitions aregenerated between two places to pause between transitions and henceimpose <strong>the</strong> computed static schedule on <strong>the</strong> model.For <strong>the</strong> example shown in Fig. 8, if <strong>the</strong> task UE PROCESSING takes10 cycles <strong>for</strong> execution, its transition <strong>for</strong> <strong>the</strong> CNET simulator is generatedasNo.UE PROCESSING : D ∗ 10{UE, UE RFIFO/MEMR1},where D indicates deterministic firing time. Terms to <strong>the</strong> left of <strong>the</strong>/correspondto input places and represent resources that <strong>the</strong> transition needs <strong>for</strong>execution and those to <strong>the</strong> right correspond to output places and representresources released by <strong>the</strong> task upon completion. In generating <strong>the</strong> PN,<strong>the</strong> tool makes use of <strong>the</strong> mapping generated by <strong>the</strong> simulated annealingmethod, and an appropriate resource is used as an input place. Similarly,<strong>the</strong> number of instances and <strong>the</strong> cyclic schedule generated by our schedulingmethod is used by <strong>the</strong> PN Generate tool.The binding, scheduling, and verification stages have been madegeneric enough to allow <strong>FEADS</strong> to be used <strong>for</strong> application mapping ono<strong>the</strong>r processors in <strong>the</strong> IXP family. Only some small modifications to <strong>the</strong>architectural description file need to be made, specifying <strong>the</strong> change in <strong>the</strong>number and type of available resources.5. RESULTSIn this section, first, we present our experimental setup, including abrief description of <strong>the</strong> application programs considered in our study. Thesubsequent sections discuss per<strong>for</strong>mance results of <strong>FEADS</strong>.5035045055065075085095.1. Experimental SetupWe evaluated <strong>the</strong> schedules generated by <strong>FEADS</strong> <strong>for</strong> three networkprocessing applications — IPv4, IPSec, and NAT. A brief description ofeach application and <strong>the</strong> sizes of <strong>the</strong>ir ADAGs are given in below.IPv4: The IP <strong>for</strong>warding is a fundamental operation per<strong>for</strong>med by <strong>the</strong>router. The IPv4 uses <strong>the</strong> header of <strong>the</strong> packet to determine <strong>the</strong> destinationaddress. A lookup is per<strong>for</strong>med based on <strong>the</strong> destination address in

20 Pai and GovindarajanTable I. Time Taken by <strong>FEADS</strong><strong>for</strong> Different Benchmarks<strong>Application</strong>Exec. Time(in Sec.)IPv4 24.7NAT 33.0IPSec 30.9510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542<strong>the</strong> IP to determine <strong>the</strong> destination port number and <strong>the</strong> next hop address.The lookup table is stored in <strong>the</strong> SRAM. The time to live field in <strong>the</strong>IP header is decremented and <strong>the</strong> cyclic redundancy checksum (CRC) isrecomputed. The packet is <strong>the</strong>n <strong>for</strong>warded to <strong>the</strong> next hop. The ADAGin this case consists of 23 nodes, five of which are memory nodes.NAT: The NAT is a method by which many network addresses and <strong>the</strong>irTCP/UDP ports are translated into a single network address and itsTCP/UDP ports. A translation table stores <strong>the</strong> corresponding translationfrom <strong>the</strong> private IP address and port number to <strong>the</strong> globally visible routerIP address and a unique port number. The translation table is stored inSRAM. The ADAG <strong>for</strong> NAT consists of 31 nodes, with seven memorynodes.IPSec: The IPSec protocol is used to provide privacy and au<strong>the</strong>nticationservices at <strong>the</strong> IP layer. IPSec supports two protocols depending on <strong>the</strong>level of security. A 32-bit connection identifier, referred to as a SecurityParameter Index (SPI), contains a shared key used <strong>for</strong> encryption and<strong>the</strong> algorithm used <strong>for</strong> encryption. We assume that <strong>the</strong> SPI is stored inSRAM. The ADAG <strong>for</strong> IPSec consists of 29 nodes, six of which are memorynodes.We consider <strong>the</strong> ADAGs <strong>for</strong> <strong>the</strong>se applications (IPv4, IPSec andNAT) as input to <strong>the</strong> <strong>FEADS</strong>. These ADAGs are shown in Fig. 5. Thesizes of <strong>the</strong> ADAGs <strong>the</strong>mselves vary from seven to 11 nodes, excluding<strong>the</strong> communication nodes which may be introduced during mapping. Dueto <strong>the</strong> small sizes of <strong>the</strong> ADAGs under consideration, converging to asolution using simulated annealing takes less than 30 s of runtime on astandard PC (Pentium 4 with 1 GB RAM operating at 3 GHz) even whileconsidering multiple instances of <strong>the</strong> ADAG. The execution time of <strong>the</strong><strong>FEADS</strong> framework <strong>for</strong> <strong>the</strong> different benchmarks are shown in Table I.We do not consider o<strong>the</strong>r benchmark applications <strong>for</strong> <strong>the</strong> followingtwo reasons. First, <strong>the</strong> ADAGS <strong>for</strong> o<strong>the</strong>r applications are not readily availableand, at this point, we do not have a tool to generate <strong>the</strong> same automatically.Second, most of <strong>the</strong> network processing applications show verysimilar behavior, at least within <strong>the</strong> class of header processing and payload

A <strong>Framework</strong> <strong>for</strong> <strong>Exploring</strong> <strong>the</strong> <strong>Application</strong> <strong>Design</strong> <strong>Space</strong> 21543544545546547548549550551552553554555556557processing applications, in terms of <strong>the</strong>ir program flow, accessing SRAM,and DRAM memory, application specific units, etc. Hence we consider <strong>the</strong>ADAGS <strong>for</strong> IPv4, NAT, and IPSec as representative of commonly usednetwork processing applications.The IPv4 and NAT were scheduled on an IXP2400 processor, whileIPSec was scheduled on IXP2850 since it uses crypto units <strong>for</strong> packet processing.The clock speed <strong>for</strong> IXP2400 is 600 MHz and that <strong>for</strong> IXP2850is 1400 MHz. In all cases, we assume packet arrival is exponential distributed3 with a mean arrival corresponding to a 5 Gbps line <strong>for</strong> IPv4 andNAT, and over a 10 Gbps line <strong>for</strong> IPSec. In all experiments, we use minimumsize packets (64 bytes) as is customary in <strong>the</strong> evaluation of networkprocessors (2) as this corresponds to <strong>the</strong> per<strong>for</strong>mance of <strong>the</strong> system underdenial of service (DoS) attacks. (5)The results <strong>for</strong> various scenarios that we evaluated are presented in<strong>the</strong> following subsections.5585595605615625635645655665675685695705715725735745755765775785.2. Naive versus <strong>FEADS</strong> MappingFirst, we compare <strong>FEADS</strong> with a naive binding scheme in which alltasks <strong>for</strong> a given instance of <strong>the</strong> application are bound to a single ME.Multiple instances are bound to different MEs in a round robin fashion.This results in an even distribution of tasks across microengines with nocommunication cost among <strong>the</strong> tasks. This is similar to <strong>the</strong> task assignmentpolicy followed in Ref.(6) . The tasks were scheduled using decomposedsoftware pipelining. Note that although we call this scheme as Naive, only<strong>the</strong> binding is naive; <strong>the</strong> scheduling of tasks (within a ME) is still softwarepipelined. Thus a comparison of <strong>FEADS</strong> with <strong>the</strong> Naive scheme brings out<strong>the</strong> benefits of task binding per<strong>for</strong>med by <strong>FEADS</strong>.As <strong>the</strong> number of instances are increased, <strong>the</strong> throughput <strong>for</strong> both<strong>FEADS</strong> and naive mapping improves upto a certain extent (refer toFig. 10). Beyond this <strong>the</strong> throughput fluctuates as <strong>the</strong> number of instancesincreases. This happens due to <strong>the</strong> ceil function described in Eq. 1. Wesee that <strong>the</strong> <strong>FEADS</strong> mapping achieves a higher throughput (7% in IPv4,4% in NAT, and 9% in IPSec) compared to naive mapping. When fewerinstances of <strong>the</strong> application are scheduled, <strong>FEADS</strong> per<strong>for</strong>ms much betterthan <strong>the</strong> naive mapping as <strong>the</strong> latter loads only a few microengines while<strong>the</strong> o<strong>the</strong>rs are left idle. As <strong>the</strong> number of instances are increased, bothnaive and <strong>FEADS</strong> mapping per<strong>for</strong>m comparably. This is because tasks are3 The per<strong>for</strong>mance of <strong>FEADS</strong> under bursty traffic has been evaluated and reported inSection 5.4.

22 Pai and Govindarajan(a)(b)IPv4(c)NATIPSecFig. 10.Comparison of <strong>FEADS</strong> Mapping with Naive Mapping.

A <strong>Framework</strong> <strong>for</strong> <strong>Exploring</strong> <strong>the</strong> <strong>Application</strong> <strong>Design</strong> <strong>Space</strong> 23579580581582583584585586587588589590591592593594595596597598599600601602more evenly balanced across all resources and <strong>the</strong> schedule thus obtainedis equal in length to that computed by <strong>FEADS</strong>.5.3. Scheduled versus Immediate ExecutionWe have evaluated <strong>the</strong> mapping generated by <strong>the</strong> <strong>FEADS</strong> frameworkwith and without imposing <strong>the</strong> static schedule generated by <strong>the</strong> cyclicscheduler on <strong>the</strong> PN model.In <strong>the</strong> <strong>for</strong>mer case, arriving packets are held in <strong>the</strong> receive buffer untilit is time to schedule <strong>the</strong> packet. Scheduling of packets happens at multiplesof II time steps. In <strong>the</strong> case of immediate execution, packets arescheduled as soon as <strong>the</strong>y arrive. The PN model generated by <strong>the</strong> schedulermodel was simulated using <strong>the</strong> CNET (23) simulator. The throughputand buffer requirements (in RFIFO) in ei<strong>the</strong>r case were obtained from <strong>the</strong>model. The results <strong>for</strong> <strong>the</strong> two cases is shown in Fig. 11, and <strong>the</strong> bufferrequirements are compared in Fig. 12. Static scheduling of tasks results in14% higher throughput on an average. Specifically, it achieves an improvementof 16% in IPv4, 13% in NAT, and 14% in IPSec over immediatescheduling. Note that <strong>the</strong> per<strong>for</strong>mance improvement is consistent acrossapplications <strong>for</strong> all instances. Although <strong>the</strong> buffer requirement of <strong>FEADS</strong>schedule is higher than that of Immediate schedule, <strong>the</strong> maximum bufferrequirement is still less than 64 packets in all cases. This corresponds toa buffer size of 4 KB, which is quite common. When we consider packetsof size larger than 64 bytes at <strong>the</strong> same line rate, <strong>the</strong> interarrivaltime between packets is higher and hence <strong>the</strong> buffer requirements will besmaller.6036046056066076086096106116126135.4. Per<strong>for</strong>mance of <strong>FEADS</strong> under Bursty TrafficFur<strong>the</strong>r, we evaluated <strong>the</strong> per<strong>for</strong>mance of <strong>FEADS</strong> <strong>for</strong> bursty traffic,wherein packets over <strong>the</strong> line arrive in bursts with periods of idle time.Specifically, <strong>the</strong> bursty traffic is is modeled as an aggregate of differentsubstreams. (15) Each substream is assumed to be of constant packet sizewith finite ON/OFF periods. In <strong>the</strong> ON time each sub stream generatespackets of constant size and specified line rate and restarts <strong>the</strong> packet generationafter <strong>the</strong> OFF period. The arrival rate in <strong>the</strong> ON period within asubstream is equal to (packetsize)/(linerate) . The ON and OFF timesof each sub stream are Pareto distributed with a probability distributionfunction f(x) given by614f(x)= αβ α /x α+1 , (2)

24 Pai and Govindarajan(a)(b)IPv4NAT(c)IPSecFig. 11.Comparison of scheduled versus immediate execution.

26 Pai and Govindarajan(a)IPv4(b)(c)NATIPSecFig. 13.Throughput with bursty traffic.

28 Pai and Govindarajan6406416426436446456466476486496506516526536546556565.6. Per<strong>for</strong>mance of Non-linear ADAGsThe ADAGs <strong>for</strong> <strong>the</strong> applications under consideration thus far arelinear by nature. With such linear ADAGs, per<strong>for</strong>ming manual mappingis straight<strong>for</strong>ward (with or without code constraints) and oftenresults in <strong>the</strong> same mapping as in <strong>FEADS</strong>. We model more complextask graphs to better reflect <strong>the</strong> increasing complexity of network applicationsbeing implemented on network processors. We compared <strong>the</strong> per<strong>for</strong>manceof our scheme <strong>for</strong> two non-linear ADAGs, <strong>FEADS</strong> input1 and FE-ADS input2. The ADAG <strong>for</strong> <strong>FEADS</strong> input1 was constructed using twoparallel branches, one branch each from IPv4 and NAT. The throughputobtained by <strong>the</strong> naive scheme and from <strong>FEADS</strong> are shown in Figure 15.In case of <strong>FEADS</strong> input2, <strong>the</strong> throughput obtained by <strong>FEADS</strong> is upto 2.5times higher than <strong>the</strong> naive scheme. Also we observe that <strong>the</strong> per<strong>for</strong>manceimprovement is consistent <strong>for</strong> all number of instances. The results clearlyindicate that with <strong>the</strong> increasing complexity of network applications, manualpartitioning and binding of tasks is no longer sufficient to achieve <strong>the</strong>necessary per<strong>for</strong>mance.(a)(b)<strong>FEADS</strong>_input1FEDAS_input2Fig. 15.Throughput <strong>for</strong> non-linear ADAGs.

A <strong>Framework</strong> <strong>for</strong> <strong>Exploring</strong> <strong>the</strong> <strong>Application</strong> <strong>Design</strong> <strong>Space</strong> 296576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916925.7. SummaryFrom <strong>the</strong> above results, we observe that static task scheduling isbeneficial and should be en<strong>for</strong>ced when buffer requirements can be met.While manual binding of tasks to resources can result in a fairly goodsolution <strong>for</strong> linear graphs and without additional code-size constraintsbeing imposed, it becomes increasingly difficult when ei<strong>the</strong>r of <strong>the</strong>se conditionsis not met. <strong>FEADS</strong> yields a mapping whose throughput is comparableto that obtained by manual partitioning and scheduling <strong>for</strong> linear taskflow graphs and upto 2.5 times higher <strong>for</strong> more complicated task graphs,and <strong>the</strong>re<strong>for</strong>e effectively automates <strong>the</strong> process of application developmenton network processors.6. ConclusionsThis paper presents <strong>FEADS</strong>, a framework <strong>for</strong> automating <strong>the</strong> processof task partitioning, mapping and scheduling on network processors. The<strong>FEADS</strong> uses an ADAG with tasks of finer granularity than those proposedby previous approaches. It obtains an optimal mapping and schedulingof tasks onto resources using simulated annealing. Task scheduling isdone using cyclic and r-periodic scheduling. We find that <strong>FEADS</strong> generatesmappings whose throughput is comparable to that obtained by manualbinding <strong>for</strong> linear ADAGs and upto 2.5 times higher <strong>for</strong> non-linearADAGs. Fur<strong>the</strong>r, imposing a static schedule on <strong>the</strong> generated mappingresults in 14% higher throughput compared to unscheduled execution. The<strong>FEADS</strong> <strong>the</strong>re<strong>for</strong>e provides an effective solution <strong>for</strong> mitigating <strong>the</strong> growingcomplexity of designing network applications to meet line speeds.Automation of <strong>the</strong> ADAG generation module from a given applicationspecification can be undertaken <strong>for</strong> future work. The framework canalso be extended <strong>for</strong> mapping, scheduling and per<strong>for</strong>mance evaluation ofapplications <strong>for</strong> o<strong>the</strong>r network processor families.7. ACKNOWLEDGMENTSWe acknowledge <strong>the</strong> Consortium <strong>for</strong> Embedded and InternetworkingTechnologies (CEINT) and Arizona State University, Tempe, USA <strong>for</strong> <strong>the</strong>partial support provided through a research subcontract. We are thankfulto Govind Shenoy <strong>for</strong> his help in <strong>the</strong> PN model and <strong>the</strong> membersof <strong>the</strong> High Per<strong>for</strong>mance Computing laboratory <strong>for</strong> <strong>the</strong>ir useful suggestionsand comments. Lastly, we are grateful to <strong>the</strong> reviewers <strong>for</strong> <strong>the</strong>ir usefulcomments.

30 Pai and Govindarajan693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741REFERENCES1. T. Blickle, J. Teich, and L. Thiele, System-Level Syn<strong>the</strong>sis Using EvolutionaryAlgorithms, in <strong>Design</strong> Automation <strong>for</strong> Embedded Systems, 1998.2. M. Chen, X. Li, R. Lian, J. Lin, L. Liu, T. Liu, and R. Ju. Shangri-La, AchievingHigh Per<strong>for</strong>mance from Compiled Network <strong>Application</strong>s While Enabling Ease ofProgramming, in ACM SIGPLAN 2005 Conference on Programming Language <strong>Design</strong>and Implementation (PLDI), June 2005.3. R. Ennals, R. Sharp, and A. Mycroft, Task Partitioning <strong>for</strong> Multi-Core networkprocessors, in International Conference on Compiler Construction (CC), 2005.4. M. Franklin and S. Datar, Pipeline Task Scheduling on Network Processors, inProceedings of Third Workshop on Network Processors, 2004.5. L. Garber, Denial of Service Attacks Rip <strong>the</strong> Internet. IEEE Comput, 33(4): 12–17(2000).6. S. Govind and R. Govindarajan, Per<strong>for</strong>mance Modeling and Architecture Explorationof Network Processors. in Proceedings of <strong>the</strong> International Conference on QuantitativeEvaluation of Systems, Torino, Italy, 2005.7. T. C. Hu, Parallel Sequencing and Assembly Line Problems, Oper. Res., 9(6): (1961).8. Intel Corp, Intel IXP1200 Network processor family.9. Intel Corp, Intel IXP2400 Hardware reference manual.10. Intel Corp, Intel IXP2800 Hardware reference manual.11. Intel Corp, Intel IXP2850 Hardware reference manual.12. S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi, Optimization by SimulatedAnnealing., in Science 220, 598 (1983).13. C. Kulkarni, M. Gries, C. Sauer, and K. Keutzer, Programming Challenges in NetworkProcessor Deployment. in Proceedings of <strong>the</strong> 2003 International Conference on Compilers,Architectures and Syn<strong>the</strong>sis <strong>for</strong> Embedded Systems (CASES03), 2003.14. M. Lam, Software Pipelining: An Effective Scheduling Technique <strong>for</strong> VLIW Machines,in Conference on Programming Language <strong>Design</strong> and Implementation, 1988.15. W. E. Leland, M. S. Taqqu, W. Willinger and D. Wilson, On <strong>the</strong> Self Similar Natureof E<strong>the</strong>rnet Traffic., Proc of SIGCOMM, Sept 1993.16. Lucent Technologies Inc, PayloadPlus Fast Pattern Processor, Apr. 2000.http://www.agere.com/support/non-nda/docs/FPPProductBrief.pdf17. R. Morris, E. Kohler, J. Jannotti, and M.F. Kaashoek, The Click Modular Router,ACM Trans. Comput. Syst. 2000.18. N. Shah, W. Plishker, and K. Keutzer. NP-Click, A Programming Model <strong>for</strong> <strong>the</strong> IntelIXP1200, in Proceesings of Second Workshop on Network Processors (NP2) at <strong>the</strong>Ninth International Symposium on High Per<strong>for</strong>mance Computer Architecture (HPCA9),2003.19. L. Thiele, S. Chakraborty, M. Gries, and S. Kunzli, <strong>Design</strong> <strong>Space</strong> Exploration ofNetwork Processor Architectures., in Proceedings of First Workshop on Network Processorsat <strong>the</strong> Eighth International Symposium on High Per<strong>for</strong>mance Compter Architecture(HPCA8), 2002.20. V. Van Dongen, G. Gao, and Q. Ning, A Polynomial Time Method <strong>for</strong> Optimal SoftwarePipelining., in Proceedings of <strong>the</strong> Second Joint International Conference on Vectorand Parallel Processing: Parallel Processing, 1992.21. J. Wang and C. Eisenbeis, Decomposed Software Pipelining: A New Approach toExploit Instruction Level Parallelism <strong>for</strong> Loop Programs., in Proceedings of <strong>the</strong> IFIPWG10.3. Working Conference on Architectures and Compilation Techniques <strong>for</strong> Fine andMedium Grain Parallelism, Vol. A-23, 1993.

FEADS: A Framework for Exploring the Application Design Space on ...

FEADS: A Framework for Exploring the Application Design Space on ... ... View more FEADS: A Framework for Exploring the Application Design Space on ...

Delete template?

Save as template ?

FEADS: A Framework for Exploring the Application Design Space on ... FEADS: A Framework for Exploring the Application Design Space on ...