FEADS: A Framework for Exploring the Application Design Space on ...

FEADS: A Framework for Exploring the Application Design Space on ... FEADS: A Framework for Exploring the Application Design Space on ...

serc.iisc.ernet.in
from serc.iisc.ernet.in More from this publisher
11.07.2015 Views

International Journal of Parallel Programming (© 2006)DOI: 10.1007/s10766-006-0023-01234ong>FEADSong>: A ong>Frameworkong> ong>forong> ong>Exploringong>ong>theong> ong>Applicationong> ong>Designong> ong>Spaceong>on Network ProcessorsRajani Pai 1,2 , R. Govindarajan 1Received ; revised ; accepted 5678910111213141516171819202122232425Network processors are designed to handle ong>theong> inherently parallel natureof network processing applications. However, partitioning and schedulingof application tasks and data allocation to reduce memory contentionremain as major challenges in realizing ong>theong> full perong>forong>mance potential of agiven network processors. The large variety of processor architectures inuse and ong>theong> increasing complexity of network applications furong>theong>r aggravateong>theong> problem. This work proposes a novel framework, called ong>FEADSong>, ong>forong>automating ong>theong> task of application partitioning and scheduling ong>forong> networkprocessors. The ong>FEADSong> uses ong>theong> simulated annealing approach to perong>forong>mdesign space exploration of application mapping onto processor resources.Furong>theong>r, it uses cyclic and r-periodic scheduling to achieve higher throughputschedules. To evaluate dynamic perong>forong>mance metrics such as throughputand resource utilization under realistic workloads, ong>FEADSong> automaticallygenerates a Petri net (PN) which models ong>theong> application, architecturalresources, mapping and ong>theong> constructed schedule and ong>theong>ir interaction. Thethroughput obtained by schedules constructed by ong>FEADSong> is comparable tothat obtained by manual scheduling ong>forong> linear task flow graphs; ong>forong> morecomplicated task graphs, ong>FEADSong>’ schedules have a throughput which isupto 2.5 times higher compared to ong>theong> manual schedules. Furong>theong>r, staticscheduling of tasks results in an increase in throughput by upto 30%compared to an implementation of ong>theong> same mapping without task scheduling.2627KEYWORDS:Cyclic scheduling; design space exploration; network processor;programming model; perong>forong>mance Evaluation; petri Nets.1 Department of Computer Science and Automation, Supercomputer Education andResearch Centre, Indian Institute of Science, Bangalore 560 012, India.2 To whom correspondence should be addressed. E-mail:rajani@csa.iisc.ernet.in1© 2006 Springer Science+Business Media, Inc.Journal: IJPP10766 CMS: 23 ̌ TYPESET ̌ DISK LE CP Disp.: 28/9/2006 Pages: 31

Internati<strong>on</strong>al Journal of Parallel Programming (© 2006)DOI: 10.1007/s10766-006-0023-01234<str<strong>on</strong>g>FEADS</str<strong>on</strong>g>: A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g><str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g><strong>on</strong> Network ProcessorsRajani Pai 1,2 , R. Govindarajan 1Received ; revised ; accepted 5678910111213141516171819202122232425Network processors are designed to handle <str<strong>on</strong>g>the</str<strong>on</strong>g> inherently parallel natureof network processing applicati<strong>on</strong>s. However, partiti<strong>on</strong>ing and schedulingof applicati<strong>on</strong> tasks and data allocati<strong>on</strong> to reduce memory c<strong>on</strong>tenti<strong>on</strong>remain as major challenges in realizing <str<strong>on</strong>g>the</str<strong>on</strong>g> full per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance potential of agiven network processors. The large variety of processor architectures inuse and <str<strong>on</strong>g>the</str<strong>on</strong>g> increasing complexity of network applicati<strong>on</strong>s fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r aggravate<str<strong>on</strong>g>the</str<strong>on</strong>g> problem. This work proposes a novel framework, called <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>, <str<strong>on</strong>g>for</str<strong>on</strong>g>automating <str<strong>on</strong>g>the</str<strong>on</strong>g> task of applicati<strong>on</strong> partiti<strong>on</strong>ing and scheduling <str<strong>on</strong>g>for</str<strong>on</strong>g> networkprocessors. The <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> uses <str<strong>on</strong>g>the</str<strong>on</strong>g> simulated annealing approach to per<str<strong>on</strong>g>for</str<strong>on</strong>g>mdesign space explorati<strong>on</strong> of applicati<strong>on</strong> mapping <strong>on</strong>to processor resources.Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r, it uses cyclic and r-periodic scheduling to achieve higher throughputschedules. To evaluate dynamic per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance metrics such as throughputand resource utilizati<strong>on</strong> under realistic workloads, <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> automaticallygenerates a Petri net (PN) which models <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong>, architecturalresources, mapping and <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>structed schedule and <str<strong>on</strong>g>the</str<strong>on</strong>g>ir interacti<strong>on</strong>. Thethroughput obtained by schedules c<strong>on</strong>structed by <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> is comparable tothat obtained by manual scheduling <str<strong>on</strong>g>for</str<strong>on</strong>g> linear task flow graphs; <str<strong>on</strong>g>for</str<strong>on</strong>g> morecomplicated task graphs, <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>’ schedules have a throughput which isupto 2.5 times higher compared to <str<strong>on</strong>g>the</str<strong>on</strong>g> manual schedules. Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r, staticscheduling of tasks results in an increase in throughput by upto 30%compared to an implementati<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> same mapping without task scheduling.2627KEYWORDS:Cyclic scheduling; design space explorati<strong>on</strong>; network processor;programming model; per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance Evaluati<strong>on</strong>; petri Nets.1 Department of Computer Science and Automati<strong>on</strong>, Supercomputer Educati<strong>on</strong> andResearch Centre, Indian Institute of Science, Bangalore 560 012, India.2 To whom corresp<strong>on</strong>dence should be addressed. E-mail:rajani@csa.iisc.ernet.in1© 2006 Springer Science+Business Media, Inc.Journal: IJPP10766 CMS: 23 ̌ TYPESET ̌ DISK LE CP Disp.: 28/9/2006 Pages: 31


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 3MemoryPEPEPEInterc<strong>on</strong>necti<strong>on</strong> NetworkMemoryPEPEPEm pipeline stagesInterc<strong>on</strong>necti<strong>on</strong> NetworkMemoryPEPEPEn processing elementsFig. 1.Generic network processor architecture.68697071727374757677787980818283848586878889909192applicati<strong>on</strong> might yield better throughput than a single instance as will beshown in Secti<strong>on</strong> 4.2.2.This is because <str<strong>on</strong>g>the</str<strong>on</strong>g> initiati<strong>on</strong> interval <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> schedule has to be anintegral value, and if <str<strong>on</strong>g>the</str<strong>on</strong>g> minimum initiati<strong>on</strong> interval computed <str<strong>on</strong>g>for</str<strong>on</strong>g> a givenschedule is fracti<strong>on</strong>al, <str<strong>on</strong>g>the</str<strong>on</strong>g>n it is rounded up to <str<strong>on</strong>g>the</str<strong>on</strong>g> next integral value.Thus c<strong>on</strong>sidering n instances, where n is chosen such that <str<strong>on</strong>g>the</str<strong>on</strong>g> computedminimum initiati<strong>on</strong> interval is an integral value, may improve <str<strong>on</strong>g>the</str<strong>on</strong>g> overallthroughput. This is similar to <str<strong>on</strong>g>the</str<strong>on</strong>g> r-periodic schedule discussed inRef. 20. Hence <str<strong>on</strong>g>the</str<strong>on</strong>g> number of instances required to arrive at <str<strong>on</strong>g>the</str<strong>on</strong>g> optimalthroughput needs to be determined.The above functi<strong>on</strong>s, generically referred to as mapping and schedulingof tasks in this work, become increasingly more complex as larger networkapplicati<strong>on</strong>s are mapped <strong>on</strong>to newer and different processor families.It is imperative to automate this process to explore <str<strong>on</strong>g>the</str<strong>on</strong>g> design space andchoose <str<strong>on</strong>g>the</str<strong>on</strong>g> most suitable mapping of an applicati<strong>on</strong> to <str<strong>on</strong>g>the</str<strong>on</strong>g> given architectureand find an optimal schedule <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping. However, <str<strong>on</strong>g>the</str<strong>on</strong>g> mappingand scheduling process can <strong>on</strong>ly provide an estimate <str<strong>on</strong>g>for</str<strong>on</strong>g> (packet) throughputs,which are static in nature, under c<strong>on</strong>stant packet arrival assumpti<strong>on</strong>.To get a better picture of <str<strong>on</strong>g>the</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance of scheduling and mappingschemes, it is necessary to obtain dynamic per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance metrics, such aspacket throughput and utilizati<strong>on</strong> of various resources (e.g., micro-engines,DRAM, and SRAM) under realistic workloads, such as packet trafficmodeled using Poiss<strong>on</strong> distributi<strong>on</strong>. Such dynamic per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance metricsalso reveal bottleneck resources and insights <str<strong>on</strong>g>for</str<strong>on</strong>g> remapping and scheduling.Hence <str<strong>on</strong>g>the</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance results obtained from <str<strong>on</strong>g>the</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance module


4 Pai and Govindarajan93949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134can be fed back to <str<strong>on</strong>g>the</str<strong>on</strong>g> scheduler module to fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r tune <str<strong>on</strong>g>the</str<strong>on</strong>g> schedulingand mapping process. In order to accomplish this, it is necessary to automatethis process of obtaining dynamic per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance metrics as well.In this paper, we propose <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>, a framework <str<strong>on</strong>g>for</str<strong>on</strong>g> exploring <str<strong>on</strong>g>the</str<strong>on</strong>g>applicati<strong>on</strong> design space <strong>on</strong> network processors, which can evaluate variousalternatives in applicati<strong>on</strong> partiti<strong>on</strong>ing, scheduling and mapping <strong>on</strong>to<str<strong>on</strong>g>the</str<strong>on</strong>g> network processor architecture and choose <str<strong>on</strong>g>the</str<strong>on</strong>g> best alternative. Weaddress <str<strong>on</strong>g>the</str<strong>on</strong>g>se issues by partiti<strong>on</strong>ing <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> into tasks based <strong>on</strong>whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r <str<strong>on</strong>g>the</str<strong>on</strong>g>y utilize <str<strong>on</strong>g>the</str<strong>on</strong>g> processor or memory, thus yielding tasks of finergranularity. The applicati<strong>on</strong> is represented as an Annotated Directed AcyclicGraph (ADAG) where nodes represent <str<strong>on</strong>g>the</str<strong>on</strong>g> fine-grain tasks per<str<strong>on</strong>g>for</str<strong>on</strong>g>med<strong>on</strong> packet. For example, a node in an ADAG may represent <str<strong>on</strong>g>the</str<strong>on</strong>g> lookuptask per<str<strong>on</strong>g>for</str<strong>on</strong>g>med by packet <str<strong>on</strong>g>for</str<strong>on</strong>g>warding applicati<strong>on</strong>s.The design space of mapping <str<strong>on</strong>g>the</str<strong>on</strong>g> tasks to resources is explored usingsimulated annealing. (12) The mapping scheme can take into account variousc<strong>on</strong>straints such as code store size, communicati<strong>on</strong> opti<strong>on</strong>s, and<str<strong>on</strong>g>the</str<strong>on</strong>g>ir latencies. It explores whe<str<strong>on</strong>g>the</str<strong>on</strong>g>r pipelined or parallel, or a combinati<strong>on</strong>of <str<strong>on</strong>g>the</str<strong>on</strong>g>se mappings is better. Once a mapping is finalized, <str<strong>on</strong>g>the</str<strong>on</strong>g> tasksare scheduled using cyclic scheduling method (software pipelining). Weuse a variant of decomposed software pipelining method. (21) Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r, wec<strong>on</strong>sider unrolling of <str<strong>on</strong>g>the</str<strong>on</strong>g> task flow graph to obtain better throughput. Thisis similar to r-periodic scheduling. (20) We propose a simple heuristic toestimate <str<strong>on</strong>g>the</str<strong>on</strong>g> number of instances required. To obtain <str<strong>on</strong>g>the</str<strong>on</strong>g> dynamic per<str<strong>on</strong>g>for</str<strong>on</strong>g>mancemetrics, we use <str<strong>on</strong>g>the</str<strong>on</strong>g> Petri net (PN) model developed in Ref. (6)<str<strong>on</strong>g>for</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance evaluati<strong>on</strong> of network applicati<strong>on</strong>s <strong>on</strong> network processors.A salient feature of <str<strong>on</strong>g>the</str<strong>on</strong>g> PN model is that it models <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong>,<str<strong>on</strong>g>the</str<strong>on</strong>g> architecture, <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping and schedule derived by <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>, and<str<strong>on</strong>g>the</str<strong>on</strong>g>ir interacti<strong>on</strong> in detail. The PN model is automatically generated byour PN Generate tool. The PN generated changes whenever <str<strong>on</strong>g>the</str<strong>on</strong>g> architecture,applicati<strong>on</strong>, schedule or <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping changes. The generated PN issimulated using CNET, (23) a PN simulator tool, and per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance metricsare obtained.The throughput obtained by <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping generated by <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g><str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong>s under c<strong>on</strong>siderati<strong>on</strong> — IPv4, NAT, and IPSec — is comparableto that obtained by a manual binding of tasks to resources as inRef. (6) . Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r, en<str<strong>on</strong>g>for</str<strong>on</strong>g>cing static scheduling of tasks, as opposed to schedulingpackets immediately up<strong>on</strong> arrival, results in an increase in throughputby 30%. In <str<strong>on</strong>g>the</str<strong>on</strong>g> case of IPSec, with code size c<strong>on</strong>straints imposed <strong>on</strong><str<strong>on</strong>g>the</str<strong>on</strong>g> mapping, <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping and schedule produced by <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> results in27% higher throughput than <str<strong>on</strong>g>the</str<strong>on</strong>g> manual mapping. Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r studies withmore complicated task graphs showed that our framework yields upto 2.5times higher throughput as compared to a manual binding under code


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 5135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174size c<strong>on</strong>straints. This clearly indicates that with <str<strong>on</strong>g>the</str<strong>on</strong>g> growing complexity ofnetwork applicati<strong>on</strong>s and increasing resource c<strong>on</strong>straints, manual bindingcannot adequately meet <str<strong>on</strong>g>the</str<strong>on</strong>g> line speeds required by <str<strong>on</strong>g>the</str<strong>on</strong>g>se applicati<strong>on</strong>s.The rest of <str<strong>on</strong>g>the</str<strong>on</strong>g> paper is organized as follows. First, a discussi<strong>on</strong> <strong>on</strong> relatedwork is presented. In Secti<strong>on</strong> 3, a brief descripti<strong>on</strong> of various processors in<str<strong>on</strong>g>the</str<strong>on</strong>g> IXP family of network processors is presented. Secti<strong>on</strong> 4 presents FE-ADS, a framework which per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms task partiti<strong>on</strong>ing and scheduling. Secti<strong>on</strong>5 presents <str<strong>on</strong>g>the</str<strong>on</strong>g> experimental results. Finally, Secti<strong>on</strong> 6 c<strong>on</strong>cludes <str<strong>on</strong>g>the</str<strong>on</strong>g> paper.2. RELATED WORKKulkarni et al., studied <str<strong>on</strong>g>the</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance impact of partiti<strong>on</strong>ing <strong>on</strong>different network processor architectures.(13)An IP <str<strong>on</strong>g>for</str<strong>on</strong>g>warding routerapplicati<strong>on</strong> was implemented <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> IXP1200 and Motorola C-5 networkprocessors. The applicati<strong>on</strong> was partiti<strong>on</strong>ed at <str<strong>on</strong>g>the</str<strong>on</strong>g> logical cutting pointsrevealed by <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong>, such as packet receive, header processing, andpacket transmit. It was observed that inefficient partiti<strong>on</strong>ing negativelyimpacted <str<strong>on</strong>g>the</str<strong>on</strong>g> throughput by more than 30%, and localizati<strong>on</strong> of computati<strong>on</strong>related to <str<strong>on</strong>g>the</str<strong>on</strong>g> memories increased <str<strong>on</strong>g>the</str<strong>on</strong>g> available bandwidth <strong>on</strong> internalbuses by a factor of two.Several methodologies <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> design space explorati<strong>on</strong> of networkprocessors have also been proposed. Specifically, Blickle et al., (1) and Thieleet al., (19) use a genetic algorithm to explore <str<strong>on</strong>g>the</str<strong>on</strong>g> design space of networkprocessors. Given an applicati<strong>on</strong>, flow characterizati<strong>on</strong> and availableresources, <str<strong>on</strong>g>the</str<strong>on</strong>g>y bind and schedule <str<strong>on</strong>g>the</str<strong>on</strong>g> tasks of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> <strong>on</strong>to <str<strong>on</strong>g>the</str<strong>on</strong>g>resources. The objective functi<strong>on</strong> c<strong>on</strong>siders <str<strong>on</strong>g>the</str<strong>on</strong>g> cost of implementing <str<strong>on</strong>g>the</str<strong>on</strong>g>given mapping in hardware. Once a set of optimal points are obtained,simulati<strong>on</strong> can be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med to select <str<strong>on</strong>g>the</str<strong>on</strong>g> best architectural c<strong>on</strong>figurati<strong>on</strong><str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> given applicati<strong>on</strong> and flow characterizati<strong>on</strong>. Their work looks at<str<strong>on</strong>g>the</str<strong>on</strong>g> problem from an architectural design perspective, as reflected by <str<strong>on</strong>g>the</str<strong>on</strong>g>irobjective functi<strong>on</strong>, while we approach it from an applicati<strong>on</strong> design viewpoint.Moreover, comm<strong>on</strong> features of network processors like <str<strong>on</strong>g>the</str<strong>on</strong>g> availabilityof multiple threads <strong>on</strong> a processing engine are not c<strong>on</strong>sidered by<str<strong>on</strong>g>the</str<strong>on</strong>g>m while binding tasks to resources. They also do not accommodate <str<strong>on</strong>g>the</str<strong>on</strong>g>fact that some tasks can be bound to more than <strong>on</strong>e resource at a giventime, such as a processor thread and memory element, as is <str<strong>on</strong>g>the</str<strong>on</strong>g> case withmemory-bound tasks. Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r, <str<strong>on</strong>g>the</str<strong>on</strong>g>y have used an evoluti<strong>on</strong>ary approachto explore <str<strong>on</strong>g>the</str<strong>on</strong>g> soluti<strong>on</strong> space, while we use simulated annealing since wefound that it covers <str<strong>on</strong>g>the</str<strong>on</strong>g> soluti<strong>on</strong> space more effectively and arrives at abetter soluti<strong>on</strong>.Weng and Wolf (22) c<strong>on</strong>struct an ADAG of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> usingruntime traces and use this to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m design space explorati<strong>on</strong>. An


6 Pai and Govindarajan175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215ADAG is a directed acyclic graph whose nodes are annotated with<str<strong>on</strong>g>the</str<strong>on</strong>g>ir processing and memory requirements, while <str<strong>on</strong>g>the</str<strong>on</strong>g> edges are annotatedwith <str<strong>on</strong>g>the</str<strong>on</strong>g> number of bytes of communicati<strong>on</strong> between <str<strong>on</strong>g>the</str<strong>on</strong>g> nodesc<strong>on</strong>nected by <str<strong>on</strong>g>the</str<strong>on</strong>g> edge. They use a randomized algorithm to per<str<strong>on</strong>g>for</str<strong>on</strong>g>mmapping of nodes to processors and memories. The system throughputis determined by modeling <str<strong>on</strong>g>the</str<strong>on</strong>g> processing time <str<strong>on</strong>g>for</str<strong>on</strong>g> each processingelement in <str<strong>on</strong>g>the</str<strong>on</strong>g> system based <strong>on</strong> its workload, <str<strong>on</strong>g>the</str<strong>on</strong>g> memory c<strong>on</strong>tenti<strong>on</strong><strong>on</strong> each memory interface, and <str<strong>on</strong>g>the</str<strong>on</strong>g> communicati<strong>on</strong> overhead betweenpipeline stages.Memory c<strong>on</strong>tenti<strong>on</strong> is modeled using a queuing networkapproach. At <str<strong>on</strong>g>the</str<strong>on</strong>g> end of <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping process, <str<strong>on</strong>g>the</str<strong>on</strong>g> best overall mappingis reported. It should be noted that <str<strong>on</strong>g>the</str<strong>on</strong>g>ir model c<strong>on</strong>siders a networkprocessor architecture with fixed binding of processors to memory andcommunicati<strong>on</strong> elements, and hence does not take into account <str<strong>on</strong>g>the</str<strong>on</strong>g> availabilityof multiple memory modules <str<strong>on</strong>g>for</str<strong>on</strong>g> binding even after <str<strong>on</strong>g>the</str<strong>on</strong>g> processorhas been selected. Since we use a simulated annealing approach, we believethat our framework will explore <str<strong>on</strong>g>the</str<strong>on</strong>g> design space more thoroughly as comparedto multiple randomized mappings, where each soluti<strong>on</strong> point is independentof <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r.Franklin and Datar (4) propose an algorithm which employs a greedyheuristic to schedule tasks derived from multiple applicati<strong>on</strong> flows <strong>on</strong> pipelineswith an arbitrary number of stages. Tasks may be shared, and differentbandwidths may be associated with each of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> flows. However,memory c<strong>on</strong>tenti<strong>on</strong> is not modeled. They have observed that task partiti<strong>on</strong>ingresults in greater flexibility in assigning tasks to <str<strong>on</strong>g>the</str<strong>on</strong>g> hardware pipeline,but with a corresp<strong>on</strong>ding increase in <str<strong>on</strong>g>the</str<strong>on</strong>g> complexity of implementing<str<strong>on</strong>g>the</str<strong>on</strong>g> partiti<strong>on</strong>ing and larger inter-task communicati<strong>on</strong> costs. Sharing of taskscomm<strong>on</strong> to different flows also leads to better throughput, since memoryc<strong>on</strong>tenti<strong>on</strong> is reduced. The task graph is partiti<strong>on</strong>ed at arbitrary points toachieve load balancing across processors, while in our work, we start witha more clearly demarcated set of processing, memory and communicati<strong>on</strong>tasks. The demarcati<strong>on</strong> of task boundaries with memory and communicati<strong>on</strong>task enables us to easily take into account <str<strong>on</strong>g>the</str<strong>on</strong>g> memory and communicati<strong>on</strong>costs and possible thread c<strong>on</strong>text switching.Chen et al. (2) system c<strong>on</strong>sists of a programing language, a compiler<str<strong>on</strong>g>for</str<strong>on</strong>g> optimizing network programs using both traditi<strong>on</strong>al and specializedoptimizati<strong>on</strong> techniques, and a runtime system which identifies hot codepaths and improves <str<strong>on</strong>g>the</str<strong>on</strong>g>ir mapping to processing resources. Aggregati<strong>on</strong> oftasks from different pipe stages is d<strong>on</strong>e so as to minimize communicati<strong>on</strong>cost. Pipeline aggregate duplicati<strong>on</strong> is used to improve <str<strong>on</strong>g>the</str<strong>on</strong>g> throughput of<str<strong>on</strong>g>the</str<strong>on</strong>g> slowest pipe aggregate if its throughput is much less than <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>rpipeline aggregates. We have found that scheduling multiple instances of


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 7216217218219220221222223224225226227228229230231232233234235236237238239240241242<str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> achieves <str<strong>on</strong>g>the</str<strong>on</strong>g> same result as duplicati<strong>on</strong> of individual pipestages.Click (17) and NP-Click (18) provide a programing abstracti<strong>on</strong> to specifynetwork applicati<strong>on</strong>s. Click is a software architecture comprising ofpacket processing modules called elements. To build a router c<strong>on</strong>figurati<strong>on</strong>,<str<strong>on</strong>g>the</str<strong>on</strong>g> user c<strong>on</strong>nects a collecti<strong>on</strong> of elements into a graph. Packets movefrom element to element al<strong>on</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> graph’s edges. NP-Click (18) builds <strong>on</strong>this framework to provide an abstracti<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> underlying hardware anddata layout that exposes enough architectural detail to write efficient code<str<strong>on</strong>g>for</str<strong>on</strong>g> that plat<str<strong>on</strong>g>for</str<strong>on</strong>g>m, while hiding less essential architectural complexity. Inei<str<strong>on</strong>g>the</str<strong>on</strong>g>r case, elements have to be rewritten <str<strong>on</strong>g>for</str<strong>on</strong>g> portability. Also, <str<strong>on</strong>g>the</str<strong>on</strong>g> tasksunder c<strong>on</strong>siderati<strong>on</strong> involve many memory accesses internally and henceit is difficult to model c<strong>on</strong>tenti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> memory. We have used an ADAGwith tasks of finer granularity which can be generated from an applicati<strong>on</strong>specified using ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r of <str<strong>on</strong>g>the</str<strong>on</strong>g>se abstracti<strong>on</strong>s.Ennals et al., (3) describe trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s that can be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med <strong>on</strong>network processing applicati<strong>on</strong>s such as pipelining tasks and merging pipelinedtasks. Since <str<strong>on</strong>g>the</str<strong>on</strong>g>ir work mainly deals with task partiti<strong>on</strong>ing, it isorthog<strong>on</strong>al to our work.3. IXP FAMILY OF NETWORK PROCESSORSFigure 2 shows a block diagram of <str<strong>on</strong>g>the</str<strong>on</strong>g> Intel 2400 network processor,(9) which c<strong>on</strong>sists of an Intel XScale core and eight microengines. TheIntel XScale core initializes and manages <str<strong>on</strong>g>the</str<strong>on</strong>g> chip and is used <str<strong>on</strong>g>for</str<strong>on</strong>g> c<strong>on</strong>trolplane functi<strong>on</strong>s like excepti<strong>on</strong> handling. The microengines (MEs) are 32-bit programable processors specialized <str<strong>on</strong>g>for</str<strong>on</strong>g> network processing, which handle<str<strong>on</strong>g>the</str<strong>on</strong>g> data plane processing <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> packets. Each ME supports eightthreads and allows zero-overhead c<strong>on</strong>text switching.Media SwitchFabric (MSF)ScratchpadMemorySRAMC<strong>on</strong>troller 0SRAMC<strong>on</strong>troller 1DRAMC<strong>on</strong>trollerIntelXScaleCoreHashUnitPCI C<strong>on</strong>trollerCAPMENNNNMENNMENNMENNIntel XScaleCorePeripheralsMENNMEMENNMEPer<str<strong>on</strong>g>for</str<strong>on</strong>g>manceM<strong>on</strong>itorME Cluster 0 ME Cluster 1Fig. 2. Block diagram of IXP2400 (9) .


8 Pai and Govindarajan243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281The memory architecture of <str<strong>on</strong>g>the</str<strong>on</strong>g> IXP2400 c<strong>on</strong>sists of SRAM, DRAM,scratch memory, local memory, and next neighbor registers. Typically,packets are stored in DRAM memory, while <str<strong>on</strong>g>the</str<strong>on</strong>g> SRAM memory storesroute lookup tables. 16 KB of low-latency scratchpad memory is also provided,which al<strong>on</strong>g with SRAM memory can be used <str<strong>on</strong>g>for</str<strong>on</strong>g> communicati<strong>on</strong>between MEs. There are separate SRAM and DRAM c<strong>on</strong>trollers <str<strong>on</strong>g>for</str<strong>on</strong>g>interfacing <str<strong>on</strong>g>the</str<strong>on</strong>g> external memories. Next neighbor (NN) registers can beused <str<strong>on</strong>g>for</str<strong>on</strong>g> communicati<strong>on</strong> between adjacent MEs. Communicati<strong>on</strong> betweenNNs is unidirecti<strong>on</strong>al, as shown in Fig. 2. Additi<strong>on</strong>ally, each ME c<strong>on</strong>sistsof 640 words of local memory, which can be used <str<strong>on</strong>g>for</str<strong>on</strong>g> communicati<strong>on</strong>between hardware c<strong>on</strong>texts. Specialized hardware like <str<strong>on</strong>g>the</str<strong>on</strong>g> hash and cryptounits are also available <strong>on</strong> processors in <str<strong>on</strong>g>the</str<strong>on</strong>g> IXP family.As can be seen, <str<strong>on</strong>g>the</str<strong>on</strong>g>re is a wide variety of opti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> mappingcommunicati<strong>on</strong> and processing tasks. Moreover, <str<strong>on</strong>g>the</str<strong>on</strong>g> resource used <str<strong>on</strong>g>for</str<strong>on</strong>g>communicati<strong>on</strong> and hence <str<strong>on</strong>g>the</str<strong>on</strong>g> latency is c<strong>on</strong>strained by <str<strong>on</strong>g>the</str<strong>on</strong>g> processor<strong>on</strong>to which adjacent tasks in <str<strong>on</strong>g>the</str<strong>on</strong>g> task flow graph are bound. Thesize of <str<strong>on</strong>g>the</str<strong>on</strong>g> soluti<strong>on</strong> space under c<strong>on</strong>siderati<strong>on</strong> implies that <str<strong>on</strong>g>the</str<strong>on</strong>g> functi<strong>on</strong>of mapping tasks to processors and memory modules should beautomated.O<str<strong>on</strong>g>the</str<strong>on</strong>g>r processors in <str<strong>on</strong>g>the</str<strong>on</strong>g> IXP family include <str<strong>on</strong>g>the</str<strong>on</strong>g> IXP1200, <str<strong>on</strong>g>the</str<strong>on</strong>g> IXP2800,and <str<strong>on</strong>g>the</str<strong>on</strong>g> IXP2850. These processors have a similar structure, but <str<strong>on</strong>g>the</str<strong>on</strong>g> numberof MEs, memory c<strong>on</strong>trollers and <str<strong>on</strong>g>the</str<strong>on</strong>g> specialized hardware differ. TheIXP1200 c<strong>on</strong>sists of six MEs with four threads per ME. (8) It has <strong>on</strong>ly <strong>on</strong>eSRAM c<strong>on</strong>troller, <strong>on</strong>e DRAM c<strong>on</strong>troller and no next neighbor registers<str<strong>on</strong>g>for</str<strong>on</strong>g> communicati<strong>on</strong>. The IXP2800, <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hand, has 16 MEs distributedover two clusters, four SRAM c<strong>on</strong>trollers and three DRAM c<strong>on</strong>trollers.(10) O<str<strong>on</strong>g>the</str<strong>on</strong>g>r features, however, remain identical to <str<strong>on</strong>g>the</str<strong>on</strong>g> IXP2400. TheIXP2850 is similar to <str<strong>on</strong>g>the</str<strong>on</strong>g> IXP2800, but additi<strong>on</strong>ally c<strong>on</strong>sists of two cryptounits which implement cryptographic algorithms in hardware. (11)Due to <str<strong>on</strong>g>the</str<strong>on</strong>g>se dissimilarities in <str<strong>on</strong>g>the</str<strong>on</strong>g> architecture of processors in <str<strong>on</strong>g>the</str<strong>on</strong>g>IXP family, <str<strong>on</strong>g>the</str<strong>on</strong>g> binding d<strong>on</strong>e <str<strong>on</strong>g>for</str<strong>on</strong>g> <strong>on</strong>e processor might not lead to an optimalsoluti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs. Task partiti<strong>on</strong>ing and per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance evaluati<strong>on</strong>techniques <str<strong>on</strong>g>the</str<strong>on</strong>g>re<str<strong>on</strong>g>for</str<strong>on</strong>g>e need to c<strong>on</strong>sider <str<strong>on</strong>g>the</str<strong>on</strong>g> various resources available <strong>on</strong>each processor in order to utilize <str<strong>on</strong>g>the</str<strong>on</strong>g>m effectively.4. FRAMEWORK FOR TASK SCHEDULING AND PARTITIONINGFigure 3 depicts <str<strong>on</strong>g>the</str<strong>on</strong>g> design of <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>, a framework <str<strong>on</strong>g>for</str<strong>on</strong>g> task schedulingand per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance evaluati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> a given architectural specificati<strong>on</strong>. Thedetails of <str<strong>on</strong>g>the</str<strong>on</strong>g> different comp<strong>on</strong>ents of <str<strong>on</strong>g>the</str<strong>on</strong>g> entire framework are describedin detail below.


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 9Architectural specificati<strong>on</strong>Network applicati<strong>on</strong>codeADAG generati<strong>on</strong>moduleScheduler moduleDynamic per<str<strong>on</strong>g>for</str<strong>on</strong>g>manceevaluati<strong>on</strong><str<strong>on</strong>g>FEADS</str<strong>on</strong>g>Fig. 3.Proposed design <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> framework.2822832842852862872882892902912922932942952962972982993003013023033043053063073083093103114.1. ADAG Generati<strong>on</strong> ModuleThe applicati<strong>on</strong> to be scheduled can be specified using a frameworklike Click or NP-Click. An ADAG <str<strong>on</strong>g>for</str<strong>on</strong>g> this task flow graph is generated,al<strong>on</strong>g with various c<strong>on</strong>straints like resources that <str<strong>on</strong>g>the</str<strong>on</strong>g> tasks can be mapped<strong>on</strong>to and instructi<strong>on</strong> store requirements of tasks. Task nodes in <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAGare explicitly classified into processing, memory, communicati<strong>on</strong>, or applicati<strong>on</strong>-specifictasks meant <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> hash or crypto units. The nodes of <str<strong>on</strong>g>the</str<strong>on</strong>g>ADAG are c<strong>on</strong>sidered to be atomic and cannot be split fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r. Based <strong>on</strong><str<strong>on</strong>g>the</str<strong>on</strong>g> type of node, it is ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r annotated with <str<strong>on</strong>g>the</str<strong>on</strong>g> processing requirements interms of <str<strong>on</strong>g>the</str<strong>on</strong>g> number of cycles or <str<strong>on</strong>g>the</str<strong>on</strong>g> memory requirements in terms of <str<strong>on</strong>g>the</str<strong>on</strong>g>number of bytes to be transferred. Memory nodes are also annotated within<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> about <str<strong>on</strong>g>the</str<strong>on</strong>g> type of memory — SRAM or DRAM — that <str<strong>on</strong>g>the</str<strong>on</strong>g>yneed to access. Similar c<strong>on</strong>straints can also be imposed <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> binding ofo<str<strong>on</strong>g>the</str<strong>on</strong>g>r nodes. Since we use nodes of finer granularity than those providedby Click or NP-Click, we model <str<strong>on</strong>g>the</str<strong>on</strong>g> memory c<strong>on</strong>tenti<strong>on</strong> and communicati<strong>on</strong>costs with greater accuracy and hence obtain a better schedule.An example ADAG is shown in Fig. 4. Processing nodes are denotedby P , memory nodes by M, and communicati<strong>on</strong> nodes by C. Communicati<strong>on</strong>nodes are annotated with <str<strong>on</strong>g>the</str<strong>on</strong>g> number of bytes of communicati<strong>on</strong>needed by <str<strong>on</strong>g>the</str<strong>on</strong>g> two adjacent nodes. For example, <str<strong>on</strong>g>the</str<strong>on</strong>g> communicati<strong>on</strong> fromnodes a to b requires 32 bytes, and is represented by node e in Fig. 4. Theannotati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> node a indicate that it requires 10 clock cycles <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong>and should be bound to a ME. Similarly, node b transfers 32 bytesof data and should be bound to SRAM (S). Node c should be bound toDRAM (D).In this work, we do not c<strong>on</strong>centrate <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG generati<strong>on</strong> module.We assume that an ADAG representati<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g>ms aninput to <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>. For <str<strong>on</strong>g>the</str<strong>on</strong>g> example network applicati<strong>on</strong>s c<strong>on</strong>sidered in thispaper, viz., IP <str<strong>on</strong>g>for</str<strong>on</strong>g>warding (IPv4), Network Address Translati<strong>on</strong> (NAT),and IP Security (IPsec) (refer to Secti<strong>on</strong> 5.1 <str<strong>on</strong>g>for</str<strong>on</strong>g> a detailed discussi<strong>on</strong> <strong>on</strong>


10 Pai and GovindarajanP10MEaeC32C32fbM32SM32DcFig. 4.Nodes in an ADAG.312313314315316317318319320321322323324325326327328329330331332333<str<strong>on</strong>g>the</str<strong>on</strong>g>se applicati<strong>on</strong>s), we have shown <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAGs in Fig. 5. The generati<strong>on</strong>of <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG <str<strong>on</strong>g>for</str<strong>on</strong>g> an applicati<strong>on</strong> is left <str<strong>on</strong>g>for</str<strong>on</strong>g> future work.4.2. Scheduler ModuleThe ADAG and architectural specificati<strong>on</strong> serve as inputs to <str<strong>on</strong>g>the</str<strong>on</strong>g>scheduler, which per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms mapping of ADAG nodes <strong>on</strong>to <str<strong>on</strong>g>the</str<strong>on</strong>g> appropriateresources. The architecture of <str<strong>on</strong>g>the</str<strong>on</strong>g> processor <strong>on</strong> which <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong>is to be scheduled is specified as a pair c<strong>on</strong>sisting of <str<strong>on</strong>g>the</str<strong>on</strong>g> type of resourceand number of resources of that type available. The overall design andimplementati<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> scheduler module is shown in Fig. 6. The outerloop estimates <str<strong>on</strong>g>the</str<strong>on</strong>g> number of instances which would result in optimalper instance throughput and per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms a mapping and scheduling of <str<strong>on</strong>g>the</str<strong>on</strong>g>estimated number of instances using simulated annealing. Details are discussedin Secti<strong>on</strong> 4.2.2. PN Generate is a part of Dynamic Per<str<strong>on</strong>g>for</str<strong>on</strong>g>manceEvaluati<strong>on</strong> module. The inner loop uses simulated annealing to generatea mapping <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> given ADAG. For each mapping, <str<strong>on</strong>g>the</str<strong>on</strong>g> tasks are scheduledand <str<strong>on</strong>g>the</str<strong>on</strong>g> schedule length is obtained. We use schedule length as <str<strong>on</strong>g>the</str<strong>on</strong>g>static per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance metric <str<strong>on</strong>g>for</str<strong>on</strong>g> comparing different mappings. The simulatedannealing algorithm and <str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r modules are described in <str<strong>on</strong>g>the</str<strong>on</strong>g> followingsubsecti<strong>on</strong>s.4.2.1. Task MappingSimulated annealing (12) is used to explore <str<strong>on</strong>g>the</str<strong>on</strong>g> task mapping space.Mapping is d<strong>on</strong>e using Algorithm 1. The schedule length <str<strong>on</strong>g>for</str<strong>on</strong>g> a mapping is


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 11PD PD PH P SP −− ProcessorD −− DRAM(a)P D PIPv4DS −− SRAMH −− HASHP(b)D PD PH P SP S P S P D P DNATP −− ProcessorS −− SRAMD −− DRAMH −− HASHPDPPDPPSP −− ProcessorP S PD −− DRAMS −− SRAMH −− HASHP SP D P DC −− CRYPTO(c)IPSecFig. 5.ADAGs <str<strong>on</strong>g>for</str<strong>on</strong>g> IPv4, NAT, and IPSec.334335336337338339340341342343344345346347used as <str<strong>on</strong>g>the</str<strong>on</strong>g> objective functi<strong>on</strong>. Starting with an initial binding,successive mappings <str<strong>on</strong>g>for</str<strong>on</strong>g> a given temperature are incrementally derived.The incremental change in <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping <str<strong>on</strong>g>for</str<strong>on</strong>g> a task to a given resourceis a functi<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> temperature under c<strong>on</strong>siderati<strong>on</strong>. At higher temperatures,<str<strong>on</strong>g>the</str<strong>on</strong>g>re can be larger changes in <str<strong>on</strong>g>the</str<strong>on</strong>g> binding of tasks to a resourcein terms of <str<strong>on</strong>g>the</str<strong>on</strong>g> distance from <str<strong>on</strong>g>the</str<strong>on</strong>g> previous mapping. As <str<strong>on</strong>g>the</str<strong>on</strong>g> temperaturedecreases, <str<strong>on</strong>g>the</str<strong>on</strong>g> amount of perturbati<strong>on</strong> allowed also reduces. When a mappingchanges, some of <str<strong>on</strong>g>the</str<strong>on</strong>g> bindings previously associated with communicati<strong>on</strong>tasks might no l<strong>on</strong>ger be valid due to changes in <str<strong>on</strong>g>the</str<strong>on</strong>g> bindingsof tasks which <str<strong>on</strong>g>the</str<strong>on</strong>g>y c<strong>on</strong>nect and hence <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping has to be repaired.For each mapping thus obtained, a schedule is c<strong>on</strong>structed as describedin Secti<strong>on</strong> 4.2.3. A mapping with lower schedule length is unc<strong>on</strong>diti<strong>on</strong>allyselected <str<strong>on</strong>g>for</str<strong>on</strong>g> annealing, while mappings with higher schedule length areselected with probability


12 Pai and GovindarajanArch. specificati<strong>on</strong> ADAGEstimate instancesInitial mappingGenerate mapping<str<strong>on</strong>g>for</str<strong>on</strong>g> currenttemperatureDecrease temp.PN_Generate348349350351352353354355356357358359360361Fig. 6.Overall design of scheduler module.( )−(currSchedLen − prevSchedLen)exp,k ∗ Twhere k is <str<strong>on</strong>g>the</str<strong>on</strong>g> Boltzmann c<strong>on</strong>stant and T <str<strong>on</strong>g>the</str<strong>on</strong>g> current temperature. Afterevery iterati<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> inner loop, <str<strong>on</strong>g>the</str<strong>on</strong>g> temperature is reduced and <str<strong>on</strong>g>the</str<strong>on</strong>g> processis repeated. Use of simulated annealing <str<strong>on</strong>g>for</str<strong>on</strong>g> exploring <str<strong>on</strong>g>the</str<strong>on</strong>g> design spaceis acceptable since <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping and scheduling need to be d<strong>on</strong>e <strong>on</strong>ly <strong>on</strong>cewhen <str<strong>on</strong>g>the</str<strong>on</strong>g> processor is programed.4.2.2. Scheduling Multiple InstancesMapping a single instance of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> might result in someresources being idle. For example, <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG shown in Fig. 7(a), a singleinstance resulted in a schedule shown in Fig. 7(b) with a throughputof 0.5. However unrolling <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG multiple times and cyclic scheduling<str<strong>on</strong>g>the</str<strong>on</strong>g> unrolled ADAG can result in higher throughput. Scheduling twoand three instances of <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG results in an II (Initiati<strong>on</strong> Interval) of


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 13Algorithm 1 Algorithm <str<strong>on</strong>g>for</str<strong>on</strong>g> Mapping using Simulated AnnealingprevBinding = GenerateinitialbindingprevSchedLen = schedule(prevBinding)<str<strong>on</strong>g>for</str<strong>on</strong>g> T = initialT emp to f inalT emp do<str<strong>on</strong>g>for</str<strong>on</strong>g> j = 1toN d<strong>on</strong>ewBinding = f(prevBinding,T)newSchedLen = schedule(newBinding)if newSchedLen < prevSchedLen or random(0, 1) >exp(−(currSchedLen − prevSchedLen)/(k ∗ T)) <str<strong>on</strong>g>the</str<strong>on</strong>g>nprevBinding = newBindingprevSchedLen = newSchedLenend ifend <str<strong>on</strong>g>for</str<strong>on</strong>g>T=T-1end <str<strong>on</strong>g>for</str<strong>on</strong>g>362363364365366367368369370371372three and four, respectively, which corresp<strong>on</strong>ds to a throughput of 2 3 =0.67and4 3 =0.75 packets per cycle, respectively, as shown in Fig. 7(c) and (d).A fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r increase in <str<strong>on</strong>g>the</str<strong>on</strong>g> number of instances does not yield any improvementin throughput since all <str<strong>on</strong>g>the</str<strong>on</strong>g> PEs are saturated.Replicating instances of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> is equivalent to loop unrollingand results in better resource utilizati<strong>on</strong> and per instance throughput.Since incoming packets are independent of each o<str<strong>on</strong>g>the</str<strong>on</strong>g>r, <strong>on</strong>ly resourcedependencies need to be c<strong>on</strong>sidered while scheduling instances of tasksacross loops. The minimum initiati<strong>on</strong> interval (MII), governed <strong>on</strong>ly byresource dependencies <str<strong>on</strong>g>for</str<strong>on</strong>g> k instances of <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG, is given by(⌈ ⌉)i ∗ nTf ∗ lat fMII = max,f nR f373 where f <str<strong>on</strong>g>the</str<strong>on</strong>g> resource type,374 nT f <str<strong>on</strong>g>the</str<strong>on</strong>g> number of tasks bound to resource f, and375 lat f <str<strong>on</strong>g>the</str<strong>on</strong>g> latency of resource f.376 The throughput of a schedule with such II isII k and <str<strong>on</strong>g>the</str<strong>on</strong>g> average II377 per instance is IIk. Since we want to obtain maximum throughput with378 minimum amount of unrolling, <str<strong>on</strong>g>the</str<strong>on</strong>g> lower bound <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> II of <str<strong>on</strong>g>the</str<strong>on</strong>g> unrolled379 and pipelined loop schedule is computed as380⎛ (⌈ ⌉)381 min ⎝ max i∗nTf⎞∗lat ff nR f⎠ ,ii382(1)


14 Pai and GovindarajanPE2XXYYXC1ZYXII = 2Throughput = 1/2 = 0.5ZZYPE2Z(a) ADAG(b) Scheduling a Single InstanceXXYX1YZX1X2ZY1Z1XYX1II = 3Throughput = 2/3 = 0.67Y1Z1Y2Z2XYX1II = 4Throughput = 3/4 = 0.75ZY1ZY1Z1X2Y2Z1Z2(c) Scheduling Two Instances(d) Scheduling Three InstancesFig. 7. Scheduling Multiple Instances of an ADAG.383384385386387388389390391where i <str<strong>on</strong>g>the</str<strong>on</strong>g> number of instances,f <str<strong>on</strong>g>the</str<strong>on</strong>g> resource type,nT f <str<strong>on</strong>g>the</str<strong>on</strong>g> number of tasks bound to resource f,lat f <str<strong>on</strong>g>the</str<strong>on</strong>g> latency of resource f, andnR f <str<strong>on</strong>g>the</str<strong>on</strong>g> number of resources of type f.We use <str<strong>on</strong>g>the</str<strong>on</strong>g> c<strong>on</strong>cept of r-periodic scheduling (20) to determine <str<strong>on</strong>g>the</str<strong>on</strong>g> optimalnumber of applicati<strong>on</strong> instances to be scheduled. In this scheme, <str<strong>on</strong>g>the</str<strong>on</strong>g>schedule of two c<strong>on</strong>secutive iterati<strong>on</strong>s of a loop may not be <str<strong>on</strong>g>the</str<strong>on</strong>g> same,but will repeat every r iterati<strong>on</strong>s. In our case, r determines <str<strong>on</strong>g>the</str<strong>on</strong>g> number of


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 15392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431instances of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> to be scheduled. When multiple instances of<str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> are to be scheduled, <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG is replicated and <str<strong>on</strong>g>the</str<strong>on</strong>g> nodesare mapped <strong>on</strong>to <str<strong>on</strong>g>the</str<strong>on</strong>g> resources and scheduled.From <str<strong>on</strong>g>the</str<strong>on</strong>g> equati<strong>on</strong> it can be seen that multiples of <str<strong>on</strong>g>the</str<strong>on</strong>g> resourcewith <str<strong>on</strong>g>the</str<strong>on</strong>g> largest occupancy yield <str<strong>on</strong>g>the</str<strong>on</strong>g> best <str<strong>on</strong>g>the</str<strong>on</strong>g>oretical lower bound andhence <str<strong>on</strong>g>the</str<strong>on</strong>g>se can be used as starting points <str<strong>on</strong>g>for</str<strong>on</strong>g> fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r explorati<strong>on</strong> in ourdesign. It might not always be possible to obtain a schedule with an IIequal to <str<strong>on</strong>g>the</str<strong>on</strong>g> computed lower bound because our mapping might resultin some additi<strong>on</strong>al communicati<strong>on</strong> tasks being bound to some of <str<strong>on</strong>g>the</str<strong>on</strong>g>seresources. For example, if tasks X and Y operate <strong>on</strong> some comm<strong>on</strong> data,but are bound to two different MEs, a communicati<strong>on</strong> task with additi<strong>on</strong>allatency is generated as part of <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping. Also, since some tasksmight be bound to more than <strong>on</strong>e resource, <str<strong>on</strong>g>the</str<strong>on</strong>g> actual schedule lengthmight differ from <str<strong>on</strong>g>the</str<strong>on</strong>g> computed <strong>on</strong>e.4.2.3. Task SchedulingFor each mapping, <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> c<strong>on</strong>structs a valid schedule. Since cyclicor software pipelined schedules (14) yield a better throughput than blockschedules, <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> c<strong>on</strong>siders <strong>on</strong>ly cyclic schedules. In cyclic scheduling,successive iterati<strong>on</strong>s of <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG are overlapped. A specific <str<strong>on</strong>g>for</str<strong>on</strong>g>m ofcyclic scheduling, modulo scheduling, schedules different instances of anode, corresp<strong>on</strong>ding to different iterati<strong>on</strong>s, exactly II cycles apart. The1throughput achieved by such a schedule isII. Hence cyclic schedulingattempts to minimize <str<strong>on</strong>g>the</str<strong>on</strong>g> II. C<strong>on</strong>sider <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG in Fig. 7(a) with <str<strong>on</strong>g>the</str<strong>on</strong>g>available resources being three processing engines (PE) and <strong>on</strong>e memoryelement. Assume <str<strong>on</strong>g>the</str<strong>on</strong>g> PEs are n<strong>on</strong>-pipelined. Nodes x and z require twocycles each <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong>, while node y takes <strong>on</strong>e cycle. Cyclic schedulingresults in <str<strong>on</strong>g>the</str<strong>on</strong>g> schedule shown in Fig. 7(b). The throughput of this scheduleis2 1 packets per cycle. Note that in this schedule instances corresp<strong>on</strong>dingto three iterati<strong>on</strong>s overlap in an II.To c<strong>on</strong>struct an r-periodic cyclic schedule <str<strong>on</strong>g>for</str<strong>on</strong>g> a given II (computedusing Eq. 1), and <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping obtained using <str<strong>on</strong>g>the</str<strong>on</strong>g> simulated annealing. weuse decomposed software pipelining algorithm. (21) . We have used <str<strong>on</strong>g>the</str<strong>on</strong>g> FirstRow Last Column (FRLC) algorithm <str<strong>on</strong>g>for</str<strong>on</strong>g> decomposed software pipeliningin our implementati<strong>on</strong>. Here <str<strong>on</strong>g>the</str<strong>on</strong>g> row numbers denote <str<strong>on</strong>g>the</str<strong>on</strong>g> cycles of executi<strong>on</strong>and <str<strong>on</strong>g>the</str<strong>on</strong>g> column numbers denote <str<strong>on</strong>g>the</str<strong>on</strong>g> iterati<strong>on</strong>. Since <str<strong>on</strong>g>the</str<strong>on</strong>g>re are no loopcarried dependences in our graph, <str<strong>on</strong>g>the</str<strong>on</strong>g> set of Str<strong>on</strong>gly C<strong>on</strong>nected Comp<strong>on</strong>ents(SCC) is empty. We generate <str<strong>on</strong>g>the</str<strong>on</strong>g> new Loop Data Dependence Graph(LDDG) by removing all edges which do not c<strong>on</strong>nect <str<strong>on</strong>g>the</str<strong>on</strong>g> SCCs, as per<str<strong>on</strong>g>the</str<strong>on</strong>g> algorithm. We <str<strong>on</strong>g>the</str<strong>on</strong>g>n per<str<strong>on</strong>g>for</str<strong>on</strong>g>m list scheduling (7) <strong>on</strong> this newly generatedLDDG. The row number <str<strong>on</strong>g>for</str<strong>on</strong>g> each task is determined by <str<strong>on</strong>g>the</str<strong>on</strong>g> cycle at which


16 Pai and Govindarajan432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468it starts executi<strong>on</strong>. The II <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> pipelined loop is given by <str<strong>on</strong>g>the</str<strong>on</strong>g> number ofrows as computed by <str<strong>on</strong>g>the</str<strong>on</strong>g> FRLC algorithm.4.3. Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance Evaluati<strong>on</strong> ModuleA dynamic per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance evaluati<strong>on</strong> technique is incorporated into<str<strong>on</strong>g>FEADS</str<strong>on</strong>g> to study <str<strong>on</strong>g>the</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance of <str<strong>on</strong>g>the</str<strong>on</strong>g> statically generated scheduleunder realistic workloads. A PN model of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> mapping andschedule <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> given processor is generated <str<strong>on</strong>g>for</str<strong>on</strong>g> dynamic per<str<strong>on</strong>g>for</str<strong>on</strong>g>manceevaluati<strong>on</strong>. This is similar to <str<strong>on</strong>g>the</str<strong>on</strong>g> technique used by Weng and Wolf(22)wherein a per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance model of <str<strong>on</strong>g>the</str<strong>on</strong>g> architecture under c<strong>on</strong>siderati<strong>on</strong>is used to compute <str<strong>on</strong>g>the</str<strong>on</strong>g> throughput due to a given binding usingqueuing-network approach.A brief descripti<strong>on</strong> of PN models in general and <str<strong>on</strong>g>the</str<strong>on</strong>g>ir use in modelingapplicati<strong>on</strong>s mapped <strong>on</strong>to network processors in particular is given in <str<strong>on</strong>g>the</str<strong>on</strong>g>following subsecti<strong>on</strong>s. We also describe <str<strong>on</strong>g>the</str<strong>on</strong>g> scheme used by our frameworkto automate <str<strong>on</strong>g>the</str<strong>on</strong>g> generati<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g>se models <str<strong>on</strong>g>for</str<strong>on</strong>g> use by <str<strong>on</strong>g>the</str<strong>on</strong>g> CNET simulator(23) <str<strong>on</strong>g>for</str<strong>on</strong>g> PNs.4.3.1. The Petri Net ModelThe PN is a ma<str<strong>on</strong>g>the</str<strong>on</strong>g>matical modeling tool which is comm<strong>on</strong>ly used tomodel c<strong>on</strong>currency and c<strong>on</strong>flicts in systems. Circles represent places, boxesrepresent <str<strong>on</strong>g>the</str<strong>on</strong>g> timed transiti<strong>on</strong>s. Timed Petri nets (TPN) are an extensi<strong>on</strong>to PN where a finite firing time is associated with transiti<strong>on</strong>s. The TPNshave been used in modeling multithreaded processors. The main advantagein using TPN <str<strong>on</strong>g>for</str<strong>on</strong>g> processor modeling is <str<strong>on</strong>g>the</str<strong>on</strong>g> added ability to capture<str<strong>on</strong>g>the</str<strong>on</strong>g> latency of operati<strong>on</strong>s like memory access and processor executi<strong>on</strong> at ahigh level of abstracti<strong>on</strong>.The PN model generated by our framework models <str<strong>on</strong>g>the</str<strong>on</strong>g> architectureas well as <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> in great detail. An example model of a singlemicroengine in IXP2400 running <str<strong>on</strong>g>the</str<strong>on</strong>g> IPv4 applicati<strong>on</strong> is shown in Fig. 8.For clarity <strong>on</strong>ly a part of <str<strong>on</strong>g>the</str<strong>on</strong>g> model which captures <str<strong>on</strong>g>the</str<strong>on</strong>g> flow of packetsfrom <str<strong>on</strong>g>the</str<strong>on</strong>g> external link to DRAM through <str<strong>on</strong>g>the</str<strong>on</strong>g> MAC is shown. The firingtime of a timed transiti<strong>on</strong> is assumed to be deterministic. The placesUE, THREAD, UE CMD Q, DRAM Q, CMD BUS, represent variousresources available and thus model <str<strong>on</strong>g>the</str<strong>on</strong>g> processor architecture, and <str<strong>on</strong>g>the</str<strong>on</strong>g>timed transiti<strong>on</strong>s UE PROCESSING and RFIFO DRAM represent <str<strong>on</strong>g>the</str<strong>on</strong>g>specific tasks. The time taken by <str<strong>on</strong>g>the</str<strong>on</strong>g>se transiti<strong>on</strong>s model <str<strong>on</strong>g>the</str<strong>on</strong>g> time takenby <str<strong>on</strong>g>the</str<strong>on</strong>g>se tasks in <str<strong>on</strong>g>the</str<strong>on</strong>g> specific unit. Thus <str<strong>on</strong>g>the</str<strong>on</strong>g> PN model is able to capture<str<strong>on</strong>g>the</str<strong>on</strong>g> processor architecture, applicati<strong>on</strong>s and <str<strong>on</strong>g>the</str<strong>on</strong>g>ir interacti<strong>on</strong> in detail. A


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 17INPUT_LINEIPORTIMACLINE_RATERMACMEMRFIFOMAC_FIFOTHREADUE_RFIFOUEUE_PROCESSINGMEM_R1UE_CMD_QSWAP_OUTWAIT_CMDBUS1DRAM_QMV_DQCMD_BUSMEM_R1RFIFO_DRAMDRAMDRAM_XFERFig. 8.PN model <str<strong>on</strong>g>for</str<strong>on</strong>g> a single microengine in IXP2400 running IPv4.469470471472473more detailed explanati<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> model is given in Ref. (6) . The PN models<str<strong>on</strong>g>for</str<strong>on</strong>g>NAT and IPSec are given in Fig. 9.4.3.2. The PN Generate ToolThe dynamic per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance of <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping and schedule generated by<str<strong>on</strong>g>the</str<strong>on</strong>g> scheduler module is evaluated using <str<strong>on</strong>g>the</str<strong>on</strong>g> PN model discussed in <str<strong>on</strong>g>the</str<strong>on</strong>g>


18 Pai and GovindarajanInput_LineIMACHASH_PROCHASHIPORTLineRateRMA_CMEMTHRDMEM_R2SRAMRFIFOMAC_FIFOMEM_R3UEUE_PROC1MEM_R1UE_PROC2UE_PROC2DRAMDRAM_XFER(a)NATInput_LineIMACMEM_R2IPORTLineRateMEM_R4RMA_CMEMTHRDSRAMFromDRAMRFIFOMAC_FIFOCRYPTO_PROCTo DRAMUEUE_PROC1MEM_R1HASHUE_PROC2DRAMUE_PROC2(b)IPsecDRAM_XFERFig. 9.PN models <str<strong>on</strong>g>for</str<strong>on</strong>g> a single ME in IXP2400 running NAT and IPSec.


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 19474475476477478479480481482483484485486487488489490491492493494495496497498499500501502previous subsecti<strong>on</strong>. This model is simulated using <str<strong>on</strong>g>the</str<strong>on</strong>g> CNET simulator<str<strong>on</strong>g>for</str<strong>on</strong>g> PNs. (23) We generate <str<strong>on</strong>g>the</str<strong>on</strong>g> PN model of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> mapping <strong>on</strong>to<str<strong>on</strong>g>the</str<strong>on</strong>g> processor and this <str<strong>on</strong>g>for</str<strong>on</strong>g>ms <str<strong>on</strong>g>the</str<strong>on</strong>g> input to <str<strong>on</strong>g>the</str<strong>on</strong>g> CNET simulator. We traverse<str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG and generate transiti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> each node. Resources towhich <str<strong>on</strong>g>the</str<strong>on</strong>g> task is bound are acquired during <str<strong>on</strong>g>the</str<strong>on</strong>g> beginning of a timedtransiti<strong>on</strong> and released when <str<strong>on</strong>g>the</str<strong>on</strong>g> transiti<strong>on</strong> fires. Dummy transiti<strong>on</strong>s aregenerated between two places to pause between transiti<strong>on</strong>s and henceimpose <str<strong>on</strong>g>the</str<strong>on</strong>g> computed static schedule <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> model.For <str<strong>on</strong>g>the</str<strong>on</strong>g> example shown in Fig. 8, if <str<strong>on</strong>g>the</str<strong>on</strong>g> task UE PROCESSING takes10 cycles <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong>, its transiti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> CNET simulator is generatedasNo.UE PROCESSING : D ∗ 10{UE, UE RFIFO/MEMR1},where D indicates deterministic firing time. Terms to <str<strong>on</strong>g>the</str<strong>on</strong>g> left of <str<strong>on</strong>g>the</str<strong>on</strong>g>/corresp<strong>on</strong>dto input places and represent resources that <str<strong>on</strong>g>the</str<strong>on</strong>g> transiti<strong>on</strong> needs <str<strong>on</strong>g>for</str<strong>on</strong>g>executi<strong>on</strong> and those to <str<strong>on</strong>g>the</str<strong>on</strong>g> right corresp<strong>on</strong>d to output places and representresources released by <str<strong>on</strong>g>the</str<strong>on</strong>g> task up<strong>on</strong> completi<strong>on</strong>. In generating <str<strong>on</strong>g>the</str<strong>on</strong>g> PN,<str<strong>on</strong>g>the</str<strong>on</strong>g> tool makes use of <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping generated by <str<strong>on</strong>g>the</str<strong>on</strong>g> simulated annealingmethod, and an appropriate resource is used as an input place. Similarly,<str<strong>on</strong>g>the</str<strong>on</strong>g> number of instances and <str<strong>on</strong>g>the</str<strong>on</strong>g> cyclic schedule generated by our schedulingmethod is used by <str<strong>on</strong>g>the</str<strong>on</strong>g> PN Generate tool.The binding, scheduling, and verificati<strong>on</strong> stages have been madegeneric enough to allow <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> to be used <str<strong>on</strong>g>for</str<strong>on</strong>g> applicati<strong>on</strong> mapping <strong>on</strong>o<str<strong>on</strong>g>the</str<strong>on</strong>g>r processors in <str<strong>on</strong>g>the</str<strong>on</strong>g> IXP family. Only some small modificati<strong>on</strong>s to <str<strong>on</strong>g>the</str<strong>on</strong>g>architectural descripti<strong>on</strong> file need to be made, specifying <str<strong>on</strong>g>the</str<strong>on</strong>g> change in <str<strong>on</strong>g>the</str<strong>on</strong>g>number and type of available resources.5. RESULTSIn this secti<strong>on</strong>, first, we present our experimental setup, including abrief descripti<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> programs c<strong>on</strong>sidered in our study. Thesubsequent secti<strong>on</strong>s discuss per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance results of <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>.5035045055065075085095.1. Experimental SetupWe evaluated <str<strong>on</strong>g>the</str<strong>on</strong>g> schedules generated by <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> three networkprocessing applicati<strong>on</strong>s — IPv4, IPSec, and NAT. A brief descripti<strong>on</strong> ofeach applicati<strong>on</strong> and <str<strong>on</strong>g>the</str<strong>on</strong>g> sizes of <str<strong>on</strong>g>the</str<strong>on</strong>g>ir ADAGs are given in below.IPv4: The IP <str<strong>on</strong>g>for</str<strong>on</strong>g>warding is a fundamental operati<strong>on</strong> per<str<strong>on</strong>g>for</str<strong>on</strong>g>med by <str<strong>on</strong>g>the</str<strong>on</strong>g>router. The IPv4 uses <str<strong>on</strong>g>the</str<strong>on</strong>g> header of <str<strong>on</strong>g>the</str<strong>on</strong>g> packet to determine <str<strong>on</strong>g>the</str<strong>on</strong>g> destinati<strong>on</strong>address. A lookup is per<str<strong>on</strong>g>for</str<strong>on</strong>g>med based <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> destinati<strong>on</strong> address in


20 Pai and GovindarajanTable I. Time Taken by <str<strong>on</strong>g>FEADS</str<strong>on</strong>g><str<strong>on</strong>g>for</str<strong>on</strong>g> Different Benchmarks<str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g>Exec. Time(in Sec.)IPv4 24.7NAT 33.0IPSec 30.9510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542<str<strong>on</strong>g>the</str<strong>on</strong>g> IP to determine <str<strong>on</strong>g>the</str<strong>on</strong>g> destinati<strong>on</strong> port number and <str<strong>on</strong>g>the</str<strong>on</strong>g> next hop address.The lookup table is stored in <str<strong>on</strong>g>the</str<strong>on</strong>g> SRAM. The time to live field in <str<strong>on</strong>g>the</str<strong>on</strong>g>IP header is decremented and <str<strong>on</strong>g>the</str<strong>on</strong>g> cyclic redundancy checksum (CRC) isrecomputed. The packet is <str<strong>on</strong>g>the</str<strong>on</strong>g>n <str<strong>on</strong>g>for</str<strong>on</strong>g>warded to <str<strong>on</strong>g>the</str<strong>on</strong>g> next hop. The ADAGin this case c<strong>on</strong>sists of 23 nodes, five of which are memory nodes.NAT: The NAT is a method by which many network addresses and <str<strong>on</strong>g>the</str<strong>on</strong>g>irTCP/UDP ports are translated into a single network address and itsTCP/UDP ports. A translati<strong>on</strong> table stores <str<strong>on</strong>g>the</str<strong>on</strong>g> corresp<strong>on</strong>ding translati<strong>on</strong>from <str<strong>on</strong>g>the</str<strong>on</strong>g> private IP address and port number to <str<strong>on</strong>g>the</str<strong>on</strong>g> globally visible routerIP address and a unique port number. The translati<strong>on</strong> table is stored inSRAM. The ADAG <str<strong>on</strong>g>for</str<strong>on</strong>g> NAT c<strong>on</strong>sists of 31 nodes, with seven memorynodes.IPSec: The IPSec protocol is used to provide privacy and au<str<strong>on</strong>g>the</str<strong>on</strong>g>nticati<strong>on</strong>services at <str<strong>on</strong>g>the</str<strong>on</strong>g> IP layer. IPSec supports two protocols depending <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g>level of security. A 32-bit c<strong>on</strong>necti<strong>on</strong> identifier, referred to as a SecurityParameter Index (SPI), c<strong>on</strong>tains a shared key used <str<strong>on</strong>g>for</str<strong>on</strong>g> encrypti<strong>on</strong> and<str<strong>on</strong>g>the</str<strong>on</strong>g> algorithm used <str<strong>on</strong>g>for</str<strong>on</strong>g> encrypti<strong>on</strong>. We assume that <str<strong>on</strong>g>the</str<strong>on</strong>g> SPI is stored inSRAM. The ADAG <str<strong>on</strong>g>for</str<strong>on</strong>g> IPSec c<strong>on</strong>sists of 29 nodes, six of which are memorynodes.We c<strong>on</strong>sider <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAGs <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>se applicati<strong>on</strong>s (IPv4, IPSec andNAT) as input to <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>. These ADAGs are shown in Fig. 5. Thesizes of <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAGs <str<strong>on</strong>g>the</str<strong>on</strong>g>mselves vary from seven to 11 nodes, excluding<str<strong>on</strong>g>the</str<strong>on</strong>g> communicati<strong>on</strong> nodes which may be introduced during mapping. Dueto <str<strong>on</strong>g>the</str<strong>on</strong>g> small sizes of <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAGs under c<strong>on</strong>siderati<strong>on</strong>, c<strong>on</strong>verging to asoluti<strong>on</strong> using simulated annealing takes less than 30 s of runtime <strong>on</strong> astandard PC (Pentium 4 with 1 GB RAM operating at 3 GHz) even whilec<strong>on</strong>sidering multiple instances of <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG. The executi<strong>on</strong> time of <str<strong>on</strong>g>the</str<strong>on</strong>g><str<strong>on</strong>g>FEADS</str<strong>on</strong>g> framework <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> different benchmarks are shown in Table I.We do not c<strong>on</strong>sider o<str<strong>on</strong>g>the</str<strong>on</strong>g>r benchmark applicati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> followingtwo reas<strong>on</strong>s. First, <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAGS <str<strong>on</strong>g>for</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r applicati<strong>on</strong>s are not readily availableand, at this point, we do not have a tool to generate <str<strong>on</strong>g>the</str<strong>on</strong>g> same automatically.Sec<strong>on</strong>d, most of <str<strong>on</strong>g>the</str<strong>on</strong>g> network processing applicati<strong>on</strong>s show verysimilar behavior, at least within <str<strong>on</strong>g>the</str<strong>on</strong>g> class of header processing and payload


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 21543544545546547548549550551552553554555556557processing applicati<strong>on</strong>s, in terms of <str<strong>on</strong>g>the</str<strong>on</strong>g>ir program flow, accessing SRAM,and DRAM memory, applicati<strong>on</strong> specific units, etc. Hence we c<strong>on</strong>sider <str<strong>on</strong>g>the</str<strong>on</strong>g>ADAGS <str<strong>on</strong>g>for</str<strong>on</strong>g> IPv4, NAT, and IPSec as representative of comm<strong>on</strong>ly usednetwork processing applicati<strong>on</strong>s.The IPv4 and NAT were scheduled <strong>on</strong> an IXP2400 processor, whileIPSec was scheduled <strong>on</strong> IXP2850 since it uses crypto units <str<strong>on</strong>g>for</str<strong>on</strong>g> packet processing.The clock speed <str<strong>on</strong>g>for</str<strong>on</strong>g> IXP2400 is 600 MHz and that <str<strong>on</strong>g>for</str<strong>on</strong>g> IXP2850is 1400 MHz. In all cases, we assume packet arrival is exp<strong>on</strong>ential distributed3 with a mean arrival corresp<strong>on</strong>ding to a 5 Gbps line <str<strong>on</strong>g>for</str<strong>on</strong>g> IPv4 andNAT, and over a 10 Gbps line <str<strong>on</strong>g>for</str<strong>on</strong>g> IPSec. In all experiments, we use minimumsize packets (64 bytes) as is customary in <str<strong>on</strong>g>the</str<strong>on</strong>g> evaluati<strong>on</strong> of networkprocessors (2) as this corresp<strong>on</strong>ds to <str<strong>on</strong>g>the</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance of <str<strong>on</strong>g>the</str<strong>on</strong>g> system underdenial of service (DoS) attacks. (5)The results <str<strong>on</strong>g>for</str<strong>on</strong>g> various scenarios that we evaluated are presented in<str<strong>on</strong>g>the</str<strong>on</strong>g> following subsecti<strong>on</strong>s.5585595605615625635645655665675685695705715725735745755765775785.2. Naive versus <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> MappingFirst, we compare <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> with a naive binding scheme in which alltasks <str<strong>on</strong>g>for</str<strong>on</strong>g> a given instance of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> are bound to a single ME.Multiple instances are bound to different MEs in a round robin fashi<strong>on</strong>.This results in an even distributi<strong>on</strong> of tasks across microengines with nocommunicati<strong>on</strong> cost am<strong>on</strong>g <str<strong>on</strong>g>the</str<strong>on</strong>g> tasks. This is similar to <str<strong>on</strong>g>the</str<strong>on</strong>g> task assignmentpolicy followed in Ref.(6) . The tasks were scheduled using decomposedsoftware pipelining. Note that although we call this scheme as Naive, <strong>on</strong>ly<str<strong>on</strong>g>the</str<strong>on</strong>g> binding is naive; <str<strong>on</strong>g>the</str<strong>on</strong>g> scheduling of tasks (within a ME) is still softwarepipelined. Thus a comparis<strong>on</strong> of <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> with <str<strong>on</strong>g>the</str<strong>on</strong>g> Naive scheme brings out<str<strong>on</strong>g>the</str<strong>on</strong>g> benefits of task binding per<str<strong>on</strong>g>for</str<strong>on</strong>g>med by <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>.As <str<strong>on</strong>g>the</str<strong>on</strong>g> number of instances are increased, <str<strong>on</strong>g>the</str<strong>on</strong>g> throughput <str<strong>on</strong>g>for</str<strong>on</strong>g> both<str<strong>on</strong>g>FEADS</str<strong>on</strong>g> and naive mapping improves upto a certain extent (refer toFig. 10). Bey<strong>on</strong>d this <str<strong>on</strong>g>the</str<strong>on</strong>g> throughput fluctuates as <str<strong>on</strong>g>the</str<strong>on</strong>g> number of instancesincreases. This happens due to <str<strong>on</strong>g>the</str<strong>on</strong>g> ceil functi<strong>on</strong> described in Eq. 1. Wesee that <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> mapping achieves a higher throughput (7% in IPv4,4% in NAT, and 9% in IPSec) compared to naive mapping. When fewerinstances of <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong> are scheduled, <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms much betterthan <str<strong>on</strong>g>the</str<strong>on</strong>g> naive mapping as <str<strong>on</strong>g>the</str<strong>on</strong>g> latter loads <strong>on</strong>ly a few microengines while<str<strong>on</strong>g>the</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>rs are left idle. As <str<strong>on</strong>g>the</str<strong>on</strong>g> number of instances are increased, bothnaive and <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> mapping per<str<strong>on</strong>g>for</str<strong>on</strong>g>m comparably. This is because tasks are3 The per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance of <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> under bursty traffic has been evaluated and reported inSecti<strong>on</strong> 5.4.


22 Pai and Govindarajan(a)(b)IPv4(c)NATIPSecFig. 10.Comparis<strong>on</strong> of <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> Mapping with Naive Mapping.


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 23579580581582583584585586587588589590591592593594595596597598599600601602more evenly balanced across all resources and <str<strong>on</strong>g>the</str<strong>on</strong>g> schedule thus obtainedis equal in length to that computed by <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>.5.3. Scheduled versus Immediate Executi<strong>on</strong>We have evaluated <str<strong>on</strong>g>the</str<strong>on</strong>g> mapping generated by <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> frameworkwith and without imposing <str<strong>on</strong>g>the</str<strong>on</strong>g> static schedule generated by <str<strong>on</strong>g>the</str<strong>on</strong>g> cyclicscheduler <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> PN model.In <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g>mer case, arriving packets are held in <str<strong>on</strong>g>the</str<strong>on</strong>g> receive buffer untilit is time to schedule <str<strong>on</strong>g>the</str<strong>on</strong>g> packet. Scheduling of packets happens at multiplesof II time steps. In <str<strong>on</strong>g>the</str<strong>on</strong>g> case of immediate executi<strong>on</strong>, packets arescheduled as so<strong>on</strong> as <str<strong>on</strong>g>the</str<strong>on</strong>g>y arrive. The PN model generated by <str<strong>on</strong>g>the</str<strong>on</strong>g> schedulermodel was simulated using <str<strong>on</strong>g>the</str<strong>on</strong>g> CNET (23) simulator. The throughputand buffer requirements (in RFIFO) in ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r case were obtained from <str<strong>on</strong>g>the</str<strong>on</strong>g>model. The results <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> two cases is shown in Fig. 11, and <str<strong>on</strong>g>the</str<strong>on</strong>g> bufferrequirements are compared in Fig. 12. Static scheduling of tasks results in14% higher throughput <strong>on</strong> an average. Specifically, it achieves an improvementof 16% in IPv4, 13% in NAT, and 14% in IPSec over immediatescheduling. Note that <str<strong>on</strong>g>the</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance improvement is c<strong>on</strong>sistent acrossapplicati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> all instances. Although <str<strong>on</strong>g>the</str<strong>on</strong>g> buffer requirement of <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>schedule is higher than that of Immediate schedule, <str<strong>on</strong>g>the</str<strong>on</strong>g> maximum bufferrequirement is still less than 64 packets in all cases. This corresp<strong>on</strong>ds toa buffer size of 4 KB, which is quite comm<strong>on</strong>. When we c<strong>on</strong>sider packetsof size larger than 64 bytes at <str<strong>on</strong>g>the</str<strong>on</strong>g> same line rate, <str<strong>on</strong>g>the</str<strong>on</strong>g> interarrivaltime between packets is higher and hence <str<strong>on</strong>g>the</str<strong>on</strong>g> buffer requirements will besmaller.6036046056066076086096106116126135.4. Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance of <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> under Bursty TrafficFur<str<strong>on</strong>g>the</str<strong>on</strong>g>r, we evaluated <str<strong>on</strong>g>the</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance of <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> bursty traffic,wherein packets over <str<strong>on</strong>g>the</str<strong>on</strong>g> line arrive in bursts with periods of idle time.Specifically, <str<strong>on</strong>g>the</str<strong>on</strong>g> bursty traffic is is modeled as an aggregate of differentsubstreams. (15) Each substream is assumed to be of c<strong>on</strong>stant packet sizewith finite ON/OFF periods. In <str<strong>on</strong>g>the</str<strong>on</strong>g> ON time each sub stream generatespackets of c<strong>on</strong>stant size and specified line rate and restarts <str<strong>on</strong>g>the</str<strong>on</strong>g> packet generati<strong>on</strong>after <str<strong>on</strong>g>the</str<strong>on</strong>g> OFF period. The arrival rate in <str<strong>on</strong>g>the</str<strong>on</strong>g> ON period within asubstream is equal to (packetsize)/(linerate) . The ON and OFF timesof each sub stream are Pareto distributed with a probability distributi<strong>on</strong>functi<strong>on</strong> f(x) given by614f(x)= αβ α /x α+1 , (2)


24 Pai and Govindarajan(a)(b)IPv4NAT(c)IPSecFig. 11.Comparis<strong>on</strong> of scheduled versus immediate executi<strong>on</strong>.


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 25Fig. 12.Buffer requirements <str<strong>on</strong>g>for</str<strong>on</strong>g> scheduled versus immediate executi<strong>on</strong>.615616617618619620621622623624625626627where α represents <str<strong>on</strong>g>the</str<strong>on</strong>g> shape parameter and β represents <str<strong>on</strong>g>the</str<strong>on</strong>g> scale parameter.The shape parameter α can The shape parameter α takes a valuebetween 1 and 2. The scale parameter β is <str<strong>on</strong>g>the</str<strong>on</strong>g> ON/OFF time. The totalline rate of <str<strong>on</strong>g>the</str<strong>on</strong>g> bursty traffic can be c<strong>on</strong>trolled by <str<strong>on</strong>g>the</str<strong>on</strong>g> line rates of <str<strong>on</strong>g>the</str<strong>on</strong>g> individualsubstreams and <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> ON periods.It has been shown that <str<strong>on</strong>g>the</str<strong>on</strong>g> traffic generated using <str<strong>on</strong>g>the</str<strong>on</strong>g> above methodologyis bursty and similar to <str<strong>on</strong>g>the</str<strong>on</strong>g> traffic comm<strong>on</strong>ly encountered by routers.(15) The above traffic generator is modeled using PN. The resultanttraffic generated is used as <str<strong>on</strong>g>the</str<strong>on</strong>g> input to <str<strong>on</strong>g>the</str<strong>on</strong>g> network processor. The per<str<strong>on</strong>g>for</str<strong>on</strong>g>manceresults under <str<strong>on</strong>g>the</str<strong>on</strong>g> bursty traffic are shown in Fig. 13. We observea similar trend in per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance as in <str<strong>on</strong>g>the</str<strong>on</strong>g> earlier case. Specifically, <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>achieves an improvement of 26% in IPv4, 21% in NAT, and 17% in IPSec<strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> average.6286296306316326336346356366376386395.5. Imposing Code Size C<strong>on</strong>straintsNext we study <str<strong>on</strong>g>the</str<strong>on</strong>g> effect of code size c<strong>on</strong>straints <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> bindingof tasks to resources. Imposing code size c<strong>on</strong>straints limits mapping <str<strong>on</strong>g>the</str<strong>on</strong>g>entire applicati<strong>on</strong> to a single microengine. Thus, in this case, a naive bindingmight not result in an even distributi<strong>on</strong> of tasks. The throughputobtained by <str<strong>on</strong>g>the</str<strong>on</strong>g> naive binding scheme as compared to <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> is shownin Fig. 14 <str<strong>on</strong>g>for</str<strong>on</strong>g> a code store c<strong>on</strong>straint of 50 instructi<strong>on</strong> words per ME. Weobserve that <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> results in an increase in <str<strong>on</strong>g>the</str<strong>on</strong>g> throughput by 53% in<str<strong>on</strong>g>the</str<strong>on</strong>g> best case <str<strong>on</strong>g>for</str<strong>on</strong>g> IPSec and 12% <strong>on</strong> an average over all applicati<strong>on</strong>s underc<strong>on</strong>siderati<strong>on</strong>. The IPSec c<strong>on</strong>sists of a larger number of processing tasksthan o<str<strong>on</strong>g>the</str<strong>on</strong>g>r applicati<strong>on</strong>s and hence a naive binding does not exploit <str<strong>on</strong>g>the</str<strong>on</strong>g>available microengines and threads effectively.


26 Pai and Govindarajan(a)IPv4(b)(c)NATIPSecFig. 13.Throughput with bursty traffic.


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 27(a)(b)IPv4(c)NATIPSecFig. 14.Throughput with code size c<strong>on</strong>straints.


28 Pai and Govindarajan6406416426436446456466476486496506516526536546556565.6. Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance of N<strong>on</strong>-linear ADAGsThe ADAGs <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> applicati<strong>on</strong>s under c<strong>on</strong>siderati<strong>on</strong> thus far arelinear by nature. With such linear ADAGs, per<str<strong>on</strong>g>for</str<strong>on</strong>g>ming manual mappingis straight<str<strong>on</strong>g>for</str<strong>on</strong>g>ward (with or without code c<strong>on</strong>straints) and oftenresults in <str<strong>on</strong>g>the</str<strong>on</strong>g> same mapping as in <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>. We model more complextask graphs to better reflect <str<strong>on</strong>g>the</str<strong>on</strong>g> increasing complexity of network applicati<strong>on</strong>sbeing implemented <strong>on</strong> network processors. We compared <str<strong>on</strong>g>the</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>manceof our scheme <str<strong>on</strong>g>for</str<strong>on</strong>g> two n<strong>on</strong>-linear ADAGs, <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> input1 and FE-ADS input2. The ADAG <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> input1 was c<strong>on</strong>structed using twoparallel branches, <strong>on</strong>e branch each from IPv4 and NAT. The throughputobtained by <str<strong>on</strong>g>the</str<strong>on</strong>g> naive scheme and from <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> are shown in Figure 15.In case of <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> input2, <str<strong>on</strong>g>the</str<strong>on</strong>g> throughput obtained by <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> is upto 2.5times higher than <str<strong>on</strong>g>the</str<strong>on</strong>g> naive scheme. Also we observe that <str<strong>on</strong>g>the</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>manceimprovement is c<strong>on</strong>sistent <str<strong>on</strong>g>for</str<strong>on</strong>g> all number of instances. The results clearlyindicate that with <str<strong>on</strong>g>the</str<strong>on</strong>g> increasing complexity of network applicati<strong>on</strong>s, manualpartiti<strong>on</strong>ing and binding of tasks is no l<strong>on</strong>ger sufficient to achieve <str<strong>on</strong>g>the</str<strong>on</strong>g>necessary per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance.(a)(b)<str<strong>on</strong>g>FEADS</str<strong>on</strong>g>_input1FEDAS_input2Fig. 15.Throughput <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-linear ADAGs.


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 296576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916925.7. SummaryFrom <str<strong>on</strong>g>the</str<strong>on</strong>g> above results, we observe that static task scheduling isbeneficial and should be en<str<strong>on</strong>g>for</str<strong>on</strong>g>ced when buffer requirements can be met.While manual binding of tasks to resources can result in a fairly goodsoluti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> linear graphs and without additi<strong>on</strong>al code-size c<strong>on</strong>straintsbeing imposed, it becomes increasingly difficult when ei<str<strong>on</strong>g>the</str<strong>on</strong>g>r of <str<strong>on</strong>g>the</str<strong>on</strong>g>se c<strong>on</strong>diti<strong>on</strong>sis not met. <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> yields a mapping whose throughput is comparableto that obtained by manual partiti<strong>on</strong>ing and scheduling <str<strong>on</strong>g>for</str<strong>on</strong>g> linear taskflow graphs and upto 2.5 times higher <str<strong>on</strong>g>for</str<strong>on</strong>g> more complicated task graphs,and <str<strong>on</strong>g>the</str<strong>on</strong>g>re<str<strong>on</strong>g>for</str<strong>on</strong>g>e effectively automates <str<strong>on</strong>g>the</str<strong>on</strong>g> process of applicati<strong>on</strong> development<strong>on</strong> network processors.6. C<strong>on</strong>clusi<strong>on</strong>sThis paper presents <str<strong>on</strong>g>FEADS</str<strong>on</strong>g>, a framework <str<strong>on</strong>g>for</str<strong>on</strong>g> automating <str<strong>on</strong>g>the</str<strong>on</strong>g> processof task partiti<strong>on</strong>ing, mapping and scheduling <strong>on</strong> network processors. The<str<strong>on</strong>g>FEADS</str<strong>on</strong>g> uses an ADAG with tasks of finer granularity than those proposedby previous approaches. It obtains an optimal mapping and schedulingof tasks <strong>on</strong>to resources using simulated annealing. Task scheduling isd<strong>on</strong>e using cyclic and r-periodic scheduling. We find that <str<strong>on</strong>g>FEADS</str<strong>on</strong>g> generatesmappings whose throughput is comparable to that obtained by manualbinding <str<strong>on</strong>g>for</str<strong>on</strong>g> linear ADAGs and upto 2.5 times higher <str<strong>on</strong>g>for</str<strong>on</strong>g> n<strong>on</strong>-linearADAGs. Fur<str<strong>on</strong>g>the</str<strong>on</strong>g>r, imposing a static schedule <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> generated mappingresults in 14% higher throughput compared to unscheduled executi<strong>on</strong>. The<str<strong>on</strong>g>FEADS</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>re<str<strong>on</strong>g>for</str<strong>on</strong>g>e provides an effective soluti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> mitigating <str<strong>on</strong>g>the</str<strong>on</strong>g> growingcomplexity of designing network applicati<strong>on</strong>s to meet line speeds.Automati<strong>on</strong> of <str<strong>on</strong>g>the</str<strong>on</strong>g> ADAG generati<strong>on</strong> module from a given applicati<strong>on</strong>specificati<strong>on</strong> can be undertaken <str<strong>on</strong>g>for</str<strong>on</strong>g> future work. The framework canalso be extended <str<strong>on</strong>g>for</str<strong>on</strong>g> mapping, scheduling and per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance evaluati<strong>on</strong> ofapplicati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> o<str<strong>on</strong>g>the</str<strong>on</strong>g>r network processor families.7. ACKNOWLEDGMENTSWe acknowledge <str<strong>on</strong>g>the</str<strong>on</strong>g> C<strong>on</strong>sortium <str<strong>on</strong>g>for</str<strong>on</strong>g> Embedded and InternetworkingTechnologies (CEINT) and Ariz<strong>on</strong>a State University, Tempe, USA <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>partial support provided through a research subc<strong>on</strong>tract. We are thankfulto Govind Shenoy <str<strong>on</strong>g>for</str<strong>on</strong>g> his help in <str<strong>on</strong>g>the</str<strong>on</strong>g> PN model and <str<strong>on</strong>g>the</str<strong>on</strong>g> membersof <str<strong>on</strong>g>the</str<strong>on</strong>g> High Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance Computing laboratory <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir useful suggesti<strong>on</strong>sand comments. Lastly, we are grateful to <str<strong>on</strong>g>the</str<strong>on</strong>g> reviewers <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g>ir usefulcomments.


30 Pai and Govindarajan693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741REFERENCES1. T. Blickle, J. Teich, and L. Thiele, System-Level Syn<str<strong>on</strong>g>the</str<strong>on</strong>g>sis Using Evoluti<strong>on</strong>aryAlgorithms, in <str<strong>on</strong>g>Design</str<strong>on</strong>g> Automati<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> Embedded Systems, 1998.2. M. Chen, X. Li, R. Lian, J. Lin, L. Liu, T. Liu, and R. Ju. Shangri-La, AchievingHigh Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance from Compiled Network <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g>s While Enabling Ease ofProgramming, in ACM SIGPLAN 2005 C<strong>on</strong>ference <strong>on</strong> Programming Language <str<strong>on</strong>g>Design</str<strong>on</strong>g>and Implementati<strong>on</strong> (PLDI), June 2005.3. R. Ennals, R. Sharp, and A. Mycroft, Task Partiti<strong>on</strong>ing <str<strong>on</strong>g>for</str<strong>on</strong>g> Multi-Core networkprocessors, in Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Compiler C<strong>on</strong>structi<strong>on</strong> (CC), 2005.4. M. Franklin and S. Datar, Pipeline Task Scheduling <strong>on</strong> Network Processors, inProceedings of Third Workshop <strong>on</strong> Network Processors, 2004.5. L. Garber, Denial of Service Attacks Rip <str<strong>on</strong>g>the</str<strong>on</strong>g> Internet. IEEE Comput, 33(4): 12–17(2000).6. S. Govind and R. Govindarajan, Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance Modeling and Architecture Explorati<strong>on</strong>of Network Processors. in Proceedings of <str<strong>on</strong>g>the</str<strong>on</strong>g> Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> QuantitativeEvaluati<strong>on</strong> of Systems, Torino, Italy, 2005.7. T. C. Hu, Parallel Sequencing and Assembly Line Problems, Oper. Res., 9(6): (1961).8. Intel Corp, Intel IXP1200 Network processor family.9. Intel Corp, Intel IXP2400 Hardware reference manual.10. Intel Corp, Intel IXP2800 Hardware reference manual.11. Intel Corp, Intel IXP2850 Hardware reference manual.12. S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi, Optimizati<strong>on</strong> by SimulatedAnnealing., in Science 220, 598 (1983).13. C. Kulkarni, M. Gries, C. Sauer, and K. Keutzer, Programming Challenges in NetworkProcessor Deployment. in Proceedings of <str<strong>on</strong>g>the</str<strong>on</strong>g> 2003 Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Compilers,Architectures and Syn<str<strong>on</strong>g>the</str<strong>on</strong>g>sis <str<strong>on</strong>g>for</str<strong>on</strong>g> Embedded Systems (CASES03), 2003.14. M. Lam, Software Pipelining: An Effective Scheduling Technique <str<strong>on</strong>g>for</str<strong>on</strong>g> VLIW Machines,in C<strong>on</strong>ference <strong>on</strong> Programming Language <str<strong>on</strong>g>Design</str<strong>on</strong>g> and Implementati<strong>on</strong>, 1988.15. W. E. Leland, M. S. Taqqu, W. Willinger and D. Wils<strong>on</strong>, On <str<strong>on</strong>g>the</str<strong>on</strong>g> Self Similar Natureof E<str<strong>on</strong>g>the</str<strong>on</strong>g>rnet Traffic., Proc of SIGCOMM, Sept 1993.16. Lucent Technologies Inc, PayloadPlus Fast Pattern Processor, Apr. 2000.http://www.agere.com/support/n<strong>on</strong>-nda/docs/FPPProductBrief.pdf17. R. Morris, E. Kohler, J. Jannotti, and M.F. Kaashoek, The Click Modular Router,ACM Trans. Comput. Syst. 2000.18. N. Shah, W. Plishker, and K. Keutzer. NP-Click, A Programming Model <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> IntelIXP1200, in Proceesings of Sec<strong>on</strong>d Workshop <strong>on</strong> Network Processors (NP2) at <str<strong>on</strong>g>the</str<strong>on</strong>g>Ninth Internati<strong>on</strong>al Symposium <strong>on</strong> High Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance Computer Architecture (HPCA9),2003.19. L. Thiele, S. Chakraborty, M. Gries, and S. Kunzli, <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> Explorati<strong>on</strong> ofNetwork Processor Architectures., in Proceedings of First Workshop <strong>on</strong> Network Processorsat <str<strong>on</strong>g>the</str<strong>on</strong>g> Eighth Internati<strong>on</strong>al Symposium <strong>on</strong> High Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance Compter Architecture(HPCA8), 2002.20. V. Van D<strong>on</strong>gen, G. Gao, and Q. Ning, A Polynomial Time Method <str<strong>on</strong>g>for</str<strong>on</strong>g> Optimal SoftwarePipelining., in Proceedings of <str<strong>on</strong>g>the</str<strong>on</strong>g> Sec<strong>on</strong>d Joint Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Vectorand Parallel Processing: Parallel Processing, 1992.21. J. Wang and C. Eisenbeis, Decomposed Software Pipelining: A New Approach toExploit Instructi<strong>on</strong> Level Parallelism <str<strong>on</strong>g>for</str<strong>on</strong>g> Loop Programs., in Proceedings of <str<strong>on</strong>g>the</str<strong>on</strong>g> IFIPWG10.3. Working C<strong>on</strong>ference <strong>on</strong> Architectures and Compilati<strong>on</strong> Techniques <str<strong>on</strong>g>for</str<strong>on</strong>g> Fine andMedium Grain Parallelism, Vol. A-23, 1993.


A <str<strong>on</strong>g>Framework</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>Exploring</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>Applicati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>Design</str<strong>on</strong>g> <str<strong>on</strong>g>Space</str<strong>on</strong>g> 3174274374474574674774874922. N. Weng and T. Wolf, Pipelining versus Multiprocessors: Choosing <str<strong>on</strong>g>the</str<strong>on</strong>g> RightNetwork Processor System Topology, in Proceedings of Advanced Networking andCommunicati<strong>on</strong>s Hardware Workshop (ANCHOR 2004) in c<strong>on</strong>juncti<strong>on</strong> with <str<strong>on</strong>g>the</str<strong>on</strong>g> 31stAnnual Internati<strong>on</strong>al Symposium <strong>on</strong> Computer Architecture (ISCA 2004), June 2004.23. W. M. Zuberek, Modeling Using Timed Petri nets — Event-Driven Simulati<strong>on</strong>,Technical Report No. 9602, Department of Computer Science, Memorial Universityof Newfoundland, St. John’s, Canada, 1996. ftp://ftp.ca.mun.ca/pub/techreports/tr-9602.ps.Z.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!