Packet Classification on Multiple Fields - High Speed Network Lab ...

Table 2:enterprise networks and a total of 41,505 rules. Each network providedup to ten separate classifiers for different services. ttTable 3:~~From ISP3 and going toJ32All other packets -Source Link-layer Address, SourceSource Link-layer Address,Destination Network-Layer Address2 The Problem of <strong>Packet</strong> <strong>Classification</strong><strong>Packet</strong> classification is performed using a packet classifier.also called a policy database, flow classifier, or simply a classifier.A classifier is a collection of rules or policies. Each rule specifies aclass+ that a packet may belong to based on some criterion on Ffields of the packet header. and associates with each class an identifier,classZD. This identifier uniquely specifies the action associatedwith the rule. Each rule has F components. The iflZ component ofrule R, referred to as R[i]. is a regular expression on the i’h field ofthe packet header (in practice, the regular expression is limited bysyntax to a simple address/mask or operator/number(s) specification).A packet P is said to nrntck a particular rule R, if Vi, the i’”field of the header of P satisfies the regular expression R[i]. It isconvenient to think of a rule R as the set of a21 packet headers whichcould ntarch R. When viewed in this way, two distinct rules are saidto be either partially overlapping or non-overlapping, or that one isa subset of the other, with corresponding set-related definitions. Wewill assume throughout this paper that when two rules are notmutually exclusive, the order in which they appear in the classifierwill determine their relative priority. For example. in a packet classifierthat performs longest-prefix address lookups, each destinationprefix is a rule, the corresponding next hop is its action, the pointerto the next-hop is the associated cZusslD, and the classifier is thewhole forwarding table. If we assume that the forwarding table haslonger prefixes appearing before shorter ones, the lookup is anexample of the packet classification problem.2.1 Example of a ClassifierAll examples used in this paper are classifiers from real ISPand enterprise networks. For privacy reasons, we have sanitized theIP addresses and other sensitive information so that the relativestructure in the classifiers is still preserved.$ First, we’ll take a lookat the data and its characteristics. Then we’ll use the data to evaluatethe packet classification algorithm described later.An example of some rules from a classifier is shown in Table3. We collected 793 packet classifiers from 101 different TSP andtFor example, each rule in a flow classifier is a flow specification,where each flow is in a separate class.Network- Networklayerlayer Transport-Destination Source layer(addr/mask) (addr/mask) Destination152.163.190. 152.163.80.1 * *69lO.O.O.O 1/0.0.0.0152.168.3.0/0.0.0.255152.168.3.0/0.0.0.255152.168.3.0/0.0.0.255152.163.198.4/0.0.0.0152.163.198. 152.163.36.0 gt 1023 t cP4/0.0.0.0 10.0.0.255TransportlayerProtocolWe found the classifiers to have the following characteristics:1) The classifiers do not contain a large number of rules. Only0.7% of the classifiers contain more than 1000 rules, with amean of 50 rules. The distribution of number of rules in a classifieris shown in Figure 2. The relatively small number ofrules per classifier should not come as a surprise: in most networkstoday, the rules are configured manually by networkoperators, and it is a non-trivial task to ensure correct behavior.2) The syntax allows a maximum of 8 fields to be specified:source/destination Network-layer address (32-bits). source/destination Transport-layer port numbers (16-bits for TCP andUDP), Type-of-service (TOS) field (8-bits), Protocol field (8-$ We wanted to preserve the properties of set relationship, e.g.inclusion, among the rules, or their fields. The way a 32-bit IPaddress pO.pl.pZ.p3 has been sanitized is as follows: (a) A random32-bit number cO.cl.c2.~3 is first chosen, (b) a randompermutation of the 256 numbers 0...255 is then generated to getperm[0..255/ (c) Another random number S between 0 and 255is generated: these randomly generated numbers are commonfor all the rules in the classifier, (d) The IP address with bytes:perm[(pO A CO + 0 * s) % 2561, perm[(pl A cl + I * s) % 2561.perm[(p2 A c2 + 2 * s) % 2561 and perm[(p3 h c3 + 3 * s) %2.561 is then returned as the sanitized transformation of the originalIP address. where * denotes the exclusive-or operation.tt In the collected data, the classifiers for different services aremade up of one or more ACLs (access control lists). An ACLrule has only two types of actions, “deny” or “permit”. In thisdiscussion, we will assume that each ACL is a separate classifier,a common case in practice.148

3)4)36)7)bits), and Transport-Layer protocol flags @-bits) with a totalof 120 bits. 17% of all rules had 1 field specified, 23% had 3fields specified and 60% had 4 fields specified.’The Transport-layer protocol field is restricted to a small set ofvalues: in all the packet classifiers we examined, it containedonly TCP, UDP, ICMP, IGMP, (E)IGRP, GRE and 1PINIP orthe wildcard ‘*‘, i.e. the set of all transport protocols.The Transport-layer fields have a wide variety of specifications.Many (10.2%) of them are range specifications (e.g. ofthe type gr 1023, i.e. greater than 1023, or rnrzge 20-24). Inparticular, the specification ‘gt 1023’ occurs in about 9% ofthe rules. This has an interesting consequence. It has been suggestedin literature (for example in [ 171) that to ease implementation,ranges could be represented by a series of prefixes.In that case, this common range would require six separateprefixes (1024-2047,2048-4095,4096-8191,8192-16383,16384-32767,32768-65535) resulting in a large increase inthe size of the classifier.About 14% of all the classifiers had a rule with a non-contiguousmask (10.2% of all rules had non-contiguous masks). Forexample, a rule which has a Network-layer address/mask specificationof 137.98.217.Ol8.22.160.80 has a non-contiguousmask (i.e. rwr a simple prefix) in its specification. This observationcame as a surprise. One suggested reason is that somenetwork operators choose a specific numbering/addressingscheme for their routers. It tells us that a packet classificationalgorithm cannot always rely on Network-layer addressesbeing prefixes.It is common for many different rules in the same classifier toshare a number of field specifications (compare, for example,the entries in Table 3). These arise because a network operatorfrequently wants to specify the same policy for a pair of communicatinggroups of hosts or subnetworks (e.g. deny everyhost in group1 to access any host in group2). Hence. given asimple address/mask syntax specification, a separate rule iscommonly written for each pair in the two (or more) groups.We will see later how we make use of this observation in ouralgorithm.We found that 8% of the rules in the classifiers were redundant.We say that a rule R is redundant if one of the followingcondition holds:(a) There exists a rule T appearing earlier than R in the classifiersuch that R is a subset of T. Thus, no packet will evermatch RS and R is redundant. We call this hckward redundancy;4.4% of the rules were backward redundant.(b) There exists a rule Tappearing after R in the classifier suchthat (i) R is a subset of r, (ii) R and T have the same actionsand (iii) For each rule V appearing in between R and Tin theclassifier. either V is disjoint from R, or V has the same actionas R. We call this&ward redundancy; 3.6% of the rules wereforward redundant. In this case, R can be eliminated to obtaina new smaller classifier. A packet matching R in the originalt If a field is not specified, the wildcard specification is assumed.Note that this is affected by the syntax of the rule specificationlanguage.$ Recall that rules are prioritized in order of their appearance inthe classifier.fdFigure 2: The distibution of the total number of rules perclassifier. Note the logarithmic scale on both axes.classifier will match Tin the new classifier, but with the sameaction.3 GoalsIn this section, we highlightpacket-classification scheme:1)2)3)4)5)6)our objectives when designing aThe algorithm should be fast enough to operate at OC48c linerates(2.5Gb/s) and preferably at OC192c linerates (lOGb/s).For applications requiring a deterministic classification time,we need to classify 7.8 million packets/s and 31.2 millionpackets/s respectively (assuming a minimum length IP datagramof 40 bytes). In some applications, an algorithm thatperforms well in the merage case may be acceptable using aqueue prior to the classification engine. For these applications,we need to classify 0.88 million packets/s and 3.53 millionpackets/s respectively (assuming an average Internet packetsize of 354-bytes [ 181).The algorithm should ideally allow matching on arbitraryfields, including Link-layer, Network-layer, Transport-layerand - in some exceptional cases -the Application-layerheaders.++ It makes sense for the algorithm to optimize for thecommonly used header fields, but it should not preclude theuse of other header fields.The algorithm should support general classification rules,including prefixes, operators (like range, less than, greaterthan, equal to, etc.) and wildcards. Non-contiguous masks maybe required.The algorithm should be suitable for implementation in bothhardware and software. Thus it needs to be fairly simple toallow high speed implementations. For the highest speeds (e.g.for OC192c at this time), we expect that hardware implementationwill be necessary and therefore the scheme should beamenable to pipelined implementation.Even though memory prices have continued to fall, the memoryrequirements of the algorithm should not be prohibitivelyexpensive. Furthermore, when implemented in software, amemory-efficient algorithm can make use of caches. Whenimplemented in hardware, the algorithm can benefit fromfaster on-chip memory.The algorithm should scale in terms of both memory andtt That is why packet-classifying routers have been called “layerlessswitches”.149

speed with the size of the classifier. An example from the destinationrouting lookup problem is the popular multiway trie[ 161 which can. in the worst case require enormous amounts ofmemory. but performs very well and with much smaller storageon real-life routing tables. We believe it to be important toevaluate algorithms with realistic data sets.7) In this paper. we will assume that classifiers change infrequently(e.g. when a new classifier is manually added or atrouter boot time). When this assumption holds, an algorithmcould employ reasonably static data structures. Thus, a preprocessingtime of several seconds may be acceptable to calculatethe data structures. Note that this assumption may nothold in some applications, such as when routing tables arechanging frequently, or when fine-granularity flows aredynamically or automatically allocated.Later. we describe a simple heuristic algorithm called RF?(Recursive Flow <strong>Classification</strong>) that seems to work well with aselection of classifiers in use today. It appears practical to use theclassification scheme for OC192c rates in hardware and up toOC48c rates in software. However, it runs into problems with spaceand preprocessing time for big classifiers (more than 6000 ruleswith 4 fields). We describe an optimization which decreases thestorage requirements of the basic RFC scheme and enables it tohandle a classifier with 15,000 rules with 4 fields in less than 4MB.4 Previous WorkWe start with the simplest classification algorithm: for each arrivingpacket, evaluate each rule sequentially until a rule is found thatmatches all the headers of the packet. While simple and efficient inits use of memory. this classifier clearly has poor scaling properties;time to perform a classification grows linearly with the number ofrules.A hardware-only algorithm could employ a ternary CAM (contentaddressablememory). Ternary CAMS store words with three-valueddigits: ‘0’, ‘1’ or ‘X’ (wildcard). The rules are stored in theCAM array in the order of decreasing priority. Given a packetheaderto classify, the CAM performs a comparison against all of itsentries in parallel. and a priority encoder selects the first matchingrule. While simple and flexible, CAMS are currently suitable onlyfor small tables; they are too expensive, too small and consume toomuch power for large classifiers. Futhermore, some operators arenot directly supported, and so the memory array may be used veryinefficiently. For example, the rule ‘gt 1023’ requires six arrayentries to be used. But with continued improvements in semiconductortechnology, large ternary CAMS may become viable in thefuture.A solution called ‘Grid of Tries was proposed in [17]. In thisscheme. the trie data structure is extended to two fields. This is agood solution if the filters are restricted to only two fields, but is noteasily extendible to more fields. In the same paper, a general solutionfor multiple fields called ‘Crossproductitzg is described. Ittakes about 1SMB for 50 rules and for bigger classifiers, theauthors propose a caching technique (on-demand crossproducting)with a non-deterministic classification time.tNot to be confused with “Request For Comments”Another recent proposal [15] describes a scheme optimized forimplementation in hardware. Employing bit-level parallelism tomatch multiple fields concurrently, the scheme is reported to supportup to 512 rules, classifying one million packets per secondwith an FPGA device and five IM-bit SRAMs. As described. thescheme examines five header fields in parallel and uses bit-levelparallelism to complete the operation. In the basic scheme. thememory storage is found to scale quadratically and the memorybandwidth linearly with the size of the classifier. A variation isdescribed that decreases the space requirement at the expense ofhigher execution time. In the same paper, the authors describe analgorithm for the special case of two fields with one field includingonly intervals created by prefixes. This takes O(nrmzber-of-prefklengths+ log”) time and O(n) space for a classifier with n rules.There arc several standard problems in the field of computationalgeometry that resemble packet classification. One example is thepoint location problem in multidimensional space. i.e. the problemof finding the enclosing region of a point, given a set of regions.However, the regions are assumed to be non-overlapping (asopposed to our case). Even for non-overlapping regions, the bestbounds for n rules and F fields, for F > 3 , are O(logn) in time withO(Z) space; or O(lOgFel n) time and O(n) space [6]. Clearly this isimpractical: with just 100 rules and 4 fields, ttF space is about1 OOMB ; and log”-’ n time is about 350 memory accesses.5 Proposed Algorithm RFC (Recursive Flow<strong>Classification</strong>)5.1 Structure of the ClassifiersAs the last example above illustrates, the task of classification isextremely complex in the worst case. However, we can expect thereto be structure in the classifiers which, if exploited, can simplify thetask of the classification algorithm.To illustrate the structure we found in our dataset. let’s start with anexample in just two dimensions, as shown in Figure 3. We can representa classifier with two fields (e.g. source and destination prefixes)in 2-dimensional space, where each rule is represented by arectangle. Figure 3a shows three such rectangles, where each rectanglerepresents a rule with a range of values in each dimension.The classifier contains three explicitly defined rules, and the default(background) rule. Figure 3b shows how three rules can overlap tocreate five regions (each region is shaded differently), and Figure3c shows three rules creating seven regions. A classification algorithmmust keep a record of each region and be able to determinethe region to which each newly arriving packet belongs. Intuitively,the more regions the classifier contains, the more storage isrequired. and the longer it takes to classify a packet.Even though the number of rules is the same in each figure, the taskof the classifcation algorithm becomes progressively harder as itneeds to distinguish more regions. In general, it can be shown thatthe number of regions created by n-rules in F dimensions can be asmuch as O(ttF).We analyzed the structure in our dataset to determine the number ofoverlapping regions, and we found it to be considerably smallerthan the worst case. Specifically for the biggest classifier with 1734rules, we found the number of distinct overlapping regions in four150

9=- 2912Simple One-Step <strong>Classification</strong>(a) 4 regions(b) 5 regions(c) 7 regionsFigure 3: Some possible arrangementsof three rectangles.dimensions to be 4316, compared to a worst possible case ofapproximately JO’“. Similarly we found the number of overlaps tobe relatively small in each of the classifiers we looked at. As wewill see, our algorithm will exploit this structure to simplify its task.5.2 AlgorithmThe problem of packet classification can be viewed as one of mappingS bits in the packet header to T bits of cZusslD, (whereT = 1ogN , and T

24-bits used to express the Transport-layer Destination and Transport-layerProtocol (chunk #6 and #4 respectively) are reduced tojust three bits by Phases 0 and 1 of the RFC algorithm. We startwith chunk #6 that contains the 16-bit Transport-layer Destination.The corresponding column in Table 3 partitions the possible valuesinto four sets: (a) (www=80) (b) (20,21) (c) (>1023) (d) (allremaining numbers in the range O-65535); which can be encodedusing two bits OOb through 11,. We call these two bit values the“equivalence class IDS” (e@s). So, in Phase 0 of the RFC algorithm,the memory corresponding to chunk #6 is indexed using the216 different values of chunk #6. In each location. we place thee9ZD for this Transport-layer Destination. For example, the value inthe memory corresponding to “chunk #6 = 20” is OOh , correspond-ing to the set (20,21). In this way, a 16-bit to two-bit reduction isobtained for chunk #6 in Phase 0. Similarly, the eight-bit TransportlayerProtocol column in Table 3 consists of three sets: (a) (tcp) (b)[udp) (c) (all remaining numbers in the range O-255), which canbe encoded using two-bit e9IDs. And so chunk #4 undergoes aneight-bit to two-bit reduction in Phase 0.In the second phase, we consider the combination of the TransportlayerDestination and Protocol fields. From Table 3 we can see thatthe five sets are: (a) {({SO], (udp))] (b) (((20-211, (udp})) (c)1((80), (tcpl)l (4 I(&$ lO23), (tcpl)l (e) (all remainingcrossproducts}; which can be represented using three-bit e9ZDs.The index into the memory in Phase 1 is constructed from the twotwo-bit e9lDs from Phase 0 (in this case, by concatenating them).Hence, in Phase 1 we have reduced the number of bits from four tothree. If we now consider the combination of both Phase 0 andPhase 1, we find that 24 bits have been reduced to just three bits.In what follows, we will use the term “Chunk Equivalence Set”(CES) to denote a set above, e.g. each of the three sets: (a) (tcp) (b)(udp) (c) (all remaining numbers in the range O-2551 is said to be aCES because if there are two packets with protocol values lying inthe same set and have otherwise identical headers, the rules of theclassifier do not distinguish between them.Each CES can be constructed in the following manner:First Phase (Phase 0): Consider a fixed chunk of size b bits, andthose component(s) of the rules in the classifier corresponding tothis chunk. Project the rules in the classifier on to the number line[0,2’ - I] . Each component projects to a set of (not necessarilycontiguous) intervals on the number line. The end points of all theintervals projected by these components form a set of non-overlappingintervals. Two points in the same interval always belong to thesame equivalence set. Also, two intervals are in the same equivalenceset if exactly the same rules project onto them. As an exampleconsider chunk #6 (Destination port) of the classifier in Table 3.The end-points of the intervals (KX.14) and the constructed equivalencesets (EO..E3) are shown in Figure 7. The RFC table for thischunk is filled with the corresponding e9lDs. Thus. in this example,table(20) = OOb , table(23) = 11 b etc. The pseudocode for com-puting the e9lDs in Phase 0 is shown in Figure 19 in the Appendix.To facilitate the calculation of the e9lDs for subsequent RFCphases, we assign a class bitmap (CBM) for each CES indicatingZO?l 8/l lq23I I I IIO 11 12 13 14EO= {20,21) E2 = ( 1024-65535 }El = {80)E3 = (O-19,22-79,81-1023]Figure 7: An example of computing the four equivalence classesEO...E3 for chunk #6 (corresponding to the ldbit Transport-Layer destination port number) in the classifier of Table 3.which rules in the classifier contain this CES for the correspondingchunk. This bitmap has one bit for each rule in the classifier. Forexample, EO in Figure 7 will have the CBM 101000 indicatingthat the first and the third rules of the classifier in Tab P e 3 containEO in chunk #6. Note that the class bitmap is not physically storedin the lookup table: it is just used to facilitate the calculation of thestored e9ZDs by the preprocessing algorithm.Subsequent Phases: A chunk in a subsequent phase is formed by acombination of two (or more) chunks obtained from memory lookupsin previous phases, with a corresponding CES. If, for example,the resulting chunk is of width b bits, we again create equivalencesets such that two b-bit numbers that are not distinguished by therules of the classifier belong to the same CES. Thus, (20,udp) and(21,udp) will be in the same CES in the classifier of Table 3 inPhase 1. To determine the new equivalence sets for this phase, wecompute all possible intersections of the equivalence sets from theprevious phases being combined. Each distinct intersection is anequivalence set for the newly created chunk. The pseudocode forthis preprocessing is shown in Figure 20 of the Appendix.5.3 A simple complete example of RFCRealizing that the preprocessing steps are involved, we present acomplete example of a classifier, showing how the RFC preprocessingis performed to determine the contents of the memories, andhow a packet can be looked up as part of the RFC operation. Theexample is shown in Figure 22 in the Appendix. It is based on a 4-field classifier of Table 6, also in the Appendix.6 Implementation ResultsIn this section, we consider how the RFC algorithm can be implementedand how it performs. First, we consider the complexity ofpreprocessing and the resulting storage requirements. Then we considerthe lookup performance to determine the rate at which packetscan be classified.6.1 RFC PreprocessingWith our classifiers, we choose to split the 32-bit Network-layersource and destination address fields into two 1Bbit chunks each.These are chunks #O. 1 and #2,3 respectively. As we found a maximumof four fields in our classifiers, this means that Phase 0 ofRFC has six chunks: chunk #4 corresponds to the protocol and protocol-flagsfield and chunk #5 corresponds to the Transport-layer152

Chunk#2345Phase 0 Phase 1 Phase 2Figure 8: Two example reduction trees for P=3 RFC phases.Figure 10: The RFC storage requirements in Megabytes fortwo Phases using the classifiers available to us. This specialcase of RFC is identical to the Crossproducting method of1171.Phase 0 Phase 1 Phase 2 Phase 301tree-B2 ClassID34Chunk# 9Figure 9: Two example reduction trees for P=4 RFC phases.Destination field.The performance of RFC can be tuned with two parameters: (i) Thenumber of phases, P, that we choose to use, and (ii) Given a valueof P, the reduction tree used. For instance, two of the several possiblereduction trees for P=3 and P=4 are shown in Figure 8 and Figure9 respectively. (For P=2, there is only one reduction treepossible). When there is more than one reduction tree possible for agiven value of P, we choose a tree based on two heuristics: (i) wecombine those chunks together which have the most “correlation”e.g. we combine the two 16-bit chunks of Network-layer sourceaddress in the earliest phase possible, and (ii) we combine as manychunks as we can without causing unreasonable memory consumption.Following these heuristics, we find that the “best” reductiontree for P=3 is tree-B in Figure 8, and the “best” reduction tree forP=4 is tree-A in Figure 9.Figure 11: The RFC storage requirements in Kilobytes forthree phases using the classifiers available to us. The reductiontree used is tree-i in Figure 8.Now, let us look at the performance of RFC using our set of classifiers.Our first goal is to keep the total amount of memory reasonablysmall. The memory requirements for each of our classifiers isplotted in Figure 10, Figure 11 and Figure 12 for P=2, 3 and 4phases respectively. The graphs show how the memory usageincreases with the number of rules in each classifier. For practicalpurposes, it is assumed that memory is only available in widths of8, 12 or 16 bits. Hence, an eqfD requiring 13 bits is assumed tooccupy 16 bits in the RFC table.As we might expect, the graphs show that as we increase the numberof phases from three to four, we require a smaller total amountof memory. However, this comes at the expense of two additionalmemory accesses, illustrating the trade-off between memory consumptionand lookup time in RFC. Our second goal is to keep thepreprocessing time small. And so in Figure 13 we plot the preprocessingtime required for both three and four phases of RFC.+The graphs indicate that, for these classifiers, RFC is suitable iftThe case P=2 is not plotted: it was found to take hours of preprocessingtime because of the unwieldy size of the RFC tables.153

Phase0,IPhase 1aicatedSDRAMl400 -,I ,!,I, II.11 ,I3030 a30 .oo KmEL Of FE12c.l ,400 ,600 180Figure 12: The RFC storage requirements in Kilobytes forfour phases using the classifiers available to us. The reductiontree used is tree-A in Figure 9.Figure 13: The preprocessing times for three and four phases inseconds, using the set of classifiers available to us. This data istaken by running the RFC preprocessing code on a 333MHzPentium-II PC running the Linux operating system.(and only if) the rules change relatively slowly; for example, notmore than once every few seconds. Thus, it may be suitable in environmentswhere rules are changed infrequently. for example if theyare added manually, or when a router reboots.For applications where the tables change more frequently, it may bepossible to make incremental changes to the tables. This is a subjectrequiring further investigation.Finally, note that there are some similarities between the RFC algorithmand the bit-level parallel scheme in [ 151; each distinct bitmapin [15] corresponds to a CES in the RFC algorithm. Also, note thatwhen there are just two phases, RFC corresponds to the crossproductingmethod described in [17].6.2 RFC Lookup PerformanceThe RFC lookup operation can be performed both in hardware andin software.+ We will discuss the two cases separately, exploring thePhase 2SDRAM2Figure 14: An exampie hardware design for RFC with threephases. The latches for holding data in the pipeline and the onchipcontrol logic are not shown.This design achieves OC192rates in the worst case for 40Byte packets, assuming that thephases are pipelined with 4 clock cycles (at 125MHz clockrate) per pipeline stage.lookup performance in each case.HardwareAn example hardware implementation for the tree tree-B in Figure8 (three phases) is illustrated in Figure 14 for four fields (six chunksin Phase 0). This design is suitable for all the classifiers in ourdataset, and uses two 4Mbit SRAMs and two 4-bank 64MbitSDRAMs 1191 clocked at 125 MHz.$ The design is pipelined suchthat a new lookup play begin every four clock cycles. The pipelinedRFC lookup proceeds as follows:1) Pipeline Stage 0: Phase 0 (Clock cycles O-3): In the first threeclock cycles, three accesses are made to the two SRAMdevices in parallel to yield the six eqIDs of Phase 0. In thefourth clock cycle, the eqIDs from Phase 0 are combined tocompute the two indices for the next phase.2) Pipeline Stage 1: Phase l(Clock cycles 4-7): The SDRAMdevices can be accessed every two clock cycles, but weassume that a given bank can be accessed again only aftereight clock cycles. By keeping the two memories for Phase 1in different banks of the SDRAM, we can perform the Phase 1lookups in four clock cycles. The data is replicated in the othertwo banks (i.e. two banks of memory hold a fully redundantcopy of the lookup tables for Phase 1). This allows Phase 1lookups to be performed on the next packet as soon as the currentpacket has completed. In this way, any given bank isaccessed once every eight clock cycles.3) Pipeline Stage 2: Phase 2 (Clock cycles 8-11): Only onelookup is to be made. The operation is otherwise identical toPhase 1.Hence, we can see that approximately 30 million packets can betNote that the RFC preprocessing is always performed in software.$ These devices are in production in industry at the time of writingthis paper. In fact, even bigger and faster devices are availabletoday - see [19]1.54

classified per second (to be exact, 3 1.25 million packets per secondwith a 125MHz clock) with a hardware cost of approximately $50.+This is fast enough to process minimum length packets at theOC 192~ rate.SoftwareFigure 21 (Appendix) provides pseudocode to perform RFC lookups.When written in ‘C’, approximately 30 lines of code arerequired to implement RFC. When compiled on a 333Mhz Pentium-IIPC running Windows NT we found that the worst case pathfor the code took(140cZks + 9. fm) for three phases, and(146&s + 11 . fm) for four phases, where t,,l is the memoryaccess timc.$ With tm = 60~1s , this corresponds to 0.98~~ and1.1~s for three and four phases respectively. This implies RFC canperform close to one million packets per second in the worst casefor our classifiers. The average lookup time was found to beapproximately 50% faster than the worst case: Table 4 shows theaverage time taken per lookup for 100,000 randomly generatedpackets for some classifiers.Number of Rules inClassifierTable 4:39 587113 582646 668827 6111112 7331734 621Average Time perlookup (ns)The pseudocode in Figure 21 calculates the indices into each memoryusing multiplication/addition operations on eqIDs from previousphases. Alternatively. the indices can be computed by simpleconcatenation. This has the effect of increasing the memory consumedbecause the tables are then not as tightly packed. Given thesimpler processing, we might expect the classification .fime todecrease at the expense of increased memory usage. Indeed thememory consumed grows approximately two-fold on the classifiers.Surprisingly, we saw no significant reduction in classificationtimes. We believe that this is because the processing time is dominatedby memory access time as opposed to the CPU cycle time.6.3 Larger ClassifiersAs we have seen, RFC performs well on the real-life classifiersavailable to us. But how will RFC perform with larger classifiersthat might appear in the future’? Unfortunately, it is difficult to accuratelypredict the memory consumption of RFC as a function of thesize of the classifier: the performance of RFC is determined by thestructure present in the classifier. With pathological sets of rules,RFC could scale geometrically with the number of rules. Fortunately.such cases do not seem to appear in practice.To estimate how RFC might perform with future. larger classifiers,we synthesized large artificial classifiers. We used two differentways to create large classifiers (given the importance of the structure,it did not seem meaningful to generate rules randomly):1) A large classifier can be created by concatenating the classifiersbelonging to the same network, and treating the result as asingle classifier. Effectively, this means merging together theindividual classifiers for different services. Such an implementationis actually desirable in scenarios where the designermay not want more than one set of RFC tables for the wholenetwork. Jn such cases, the classID obtained would have to becombined with some other information (such as classifier ID)to obtain the correct intended action. By only concatenatingclassifiers from the same network, we were able to create classifiersup to 3,896 rules. For each classifier created. WC performedRFC with both three and four phases. The results areshown in Figure 15.2) To create even larger classifiers, we concatenated all the classifiersof a few (up to ten) different networks. The performanceof RFC with four phases is plotted as the ‘Basic RFC’curve in Figure 18. We found that RFC frequently runs intostorage problems for classifiers with more than 6000 rules.Employing more phases does not help as we must combine atleast two chunks in every phase, and end up with one chunk inthe final phase.” An alternative way to process large classifierswould be to split them into two (or more) parts and constructseparate RFC tables for each part. This would of coursecome at the expense of doubling the number of memoryaccesses.SS7 VariationsSeveral variations and improvements of RFC are possible. First, itshould be easy to see how RFC can be extended to process a largernumber of fields in each packet header.Second, we can possibly speed up RFC by taking advantage ofavailable fast lookup algorithms that find longest matching prefixesin one field. Note that in our examples, we use three memoryaccesses each for the source and destination Network-layer addresst Under the assumption that SDRAMs are now available at $1.50per megabyte, and SRAMs are $12 for a 4Mbyte device.$ The performance of the lookup code was analyzed usingVTune[20], an Intel performance analyzer for processors of thePentium family.tt With six chunks in Phase 0, we could have increased the numberof phases to a maximum of six. However we found noappreciable improvement by doing so.$$ Actually, for Phase 0, we need not lookup memory twice for thesame chunk if we use wide memories. This would help usaccess the contents of both the RFC tables in one memoryaccess.155

0 500 ,ow,503NwnbEtR”b. 25ca m30 35w ,cmFigure 15: The memory consumed by RFC For three andfour phases on classifiers created by merging all the classifiersof one network.Figure 16: The memory consumed with three phases with theadjGrp optimization enabled on the large classifiers created byconcatenating all the classifiers of one networklookups during the first two phases of RFC. This is necessarybecause of the considerable number of non-contiguous address/mask specifications. In the event that only prefixes are present inthe specification, one can use a more sophisticated and faster techniquefor looking up in one dimension e.g. one of the methodsdescribed in [1][5][7][9] or [16].Third, we can employ a technique described below to reduce thememory requirements when processing large classifiers.7.1 Adjacency GroupsSince the size of the RFC tables depends on the number ofchunk equivalence classes, we focus our efforts on trying to reducethis number. This we do by merging two or more rules of the originalclassifier as explained below. We find that each additional phaseof RFC further increases the amount of compaction possible on theoriginal classifier.First we define some notation. We call two distinct rules R andS. with R appearing first. in the classifier to be adjacent in ditnension‘i’ if all of the following three conditions hold: (1) They havethe same action, (2) All but the i’” field have the exact same specificationin the two rules, and (3) All rules appearing in between Rand S in the classifier have either the same action or are disjointfrom R. Two rules are said to be simply adjacent if they are adjacentin some dimension. Thus the second and third rules in the classifierof Table 3 are adjacent+ in the dimension corresponding to theTransport-layer Destination field. Similarly the fifth rule is adjacentto the sixth but not to the fourth. Once we have determined that tworules R and S are ad.jacent, we merge them to form a new rule Twith the same action as R (or S). It has the same specifications asthat of R (or S) for all the fields except that of the i”l which is simplythe logical-OR of the ifh field specifications of R and S. The thirdtAdjacency can be also be looked at this way: treat each rulewith F fields as a boolean expression of F (multi-valued) variables.Initially each rule is a conjunction i.e. a logical-AND ofthese variables. Two rules are defined to be adjacent of they areadjacent vertices in the F-dimensional hypercube created by thesymbolic representation of the F fields.condition above ensures that the relative priority of the rules inbetween R and Swill not be affected by this merging.An adjacency grump (adjGrp) is defined recursively as: (1)Every rule in the original classifier is an adjacency group, and (2)Every merged rule which is created by merging two or more adjacencygroups is an adjacency group.We compact the classifier as follows. Initially, every rule is inits own adjGrp. Next, we combine adjacent rules to create a newsmaller classifier. One simple way of doing this is to iterate over allfields in turn, checking for adjacency in each dimension. Afterthese iterations are completed, the resulting classifier will have nomore adjacent rules. We do similar merging of adjacent rules aftereach RFC phase. As each RFC phase collapses some dimensions,groups which were not adjacent in earlier phases may become so inlater stages. In this way, the number of adjGrps and hence the sizeof the classifier keeps on decreasing with every phase. An exampleof this operation is shown in Figure 17.Note that there is absolutely no change in the actual lookupoperation: the equivalence IDS are now simply pointers to bitmapswhich keep track of adjacency groups rather than the original rules.To demonstrate the benefits of this optimization for both three andfour phases, we plot in Figure 16 the memory consumed with threephases on the 101 large classifiers created by concatenating all theclassifiers belonging to one network; and in Figure 18, the memoryconsumed with four phases on the even larger classifiers created byconcatenating all the classifiers of different networks together. Thefigures show that this optimization helps reduce storage requirements.With this optimization, RFC can now handle a 15,000 ruleclassifier with just 3.85MB. The reason for the reduction in storageis that several rules in the same classifier commonly share a numberof specifications for many fields.However, the space savings come at a cost. For although the classifierwill correctly identify the action for each arriving packet. it cannottell which rule in the original classifier it matched. Because therules have been merged to form adjGrps, the distinction betweeneach rule has been lost. This may be undesirable in applications thatmaintain matching statistics for each rule.1.56

Table 5:SchemeProsConsSequential EvaluationGrid of Tries[ 171Crossproducting[ 171Bit-level Parallelism[ 151Recursive Flow <strong>Classification</strong>Good storage requirements. Works for arbitrary numberof fields.Good storage requirements and fast lookup rates fortwo fields. Suitable for big classifiers.Fast accesses. Suitable for multiple fields. Can beadapted to non-contiguous masks.Suitable for multiple fields. Can be adapted to noncontiguousmasks.Suitable for multiple fields. Works for non-contiguousmasks. Reasonable memory requirements for real-lifeclassifiers. Fast lookup rate.Slow lookup rates.Not easily extendible to more than two fields. Not suitablefor non-contiguous masks.Large memory requirements. Suitable without cachingfor small classifiers up to 50 rules.Large memory bandwidth required. Comparativelyslow lookup rate. Hardware only.Large preprocessing time and memory requirementsfor large classifiers (i.e. having more than 6000 ruleswithout adjacency group optimization).R(al,bl,cl,dl)S(al,bl,d,dl)T(a2,bl,c2,dl)U(a2,bl,cl,dl)V(al,bl,c4,d2)W(al,bLc3.d2)X(a2,bl.c3,d2)Y(a2,bl,c4,d2)Merge alongDimension 3RS(al.bl,cl+c2,dl)TU(a2,bl,cl+c2,dl)VW(al.bl,c3+c4,d2)XY(a2,bl,c3+c4,d2)Merge alongDimension 1RSTU(al+a2,bl,cl+c2,dl)VWXY(al+a2,bl,c3+c4,d2).Carry out an RFC Phase.Assume:chunks 1 and 2 are combinedand also chunks 3 and 4 are combined IRSTUVWXY(ml,nI+n2)RSTU(ml,nl)VWXY(ml,n2)’ Continue with RFC...I,_________________.: (al+a2.b1) reduces to ml :: (cl+c2,dl) reduces to nl :v1 (c3+c4,d2) reduces to n2 :.---____--_____---.Figure 17: Example of Adjacency Groups. Some rules of aclassifier are shown. Each rule is denoted symbolically byRuleName(FieldName1, FieldName2,...). The ‘4 denotes alogical OR. All rules shown are assumed to have the sameaction.8 Comparison with other packet classificationschemesTable 5 shows a qualitative comparison of some of the schemes fordoing packet classification./4m - I. >_.__.__..----*__.._._.. --..--_____... ---.i 35w_.___.... ---; __.... ---p 3oco-Bi 2500 -i-1500i: ..Iwo/.I ,,_/!)T: __..~~~ :-1:5cKl /$./0 0 2wo 4c.30 ecc.3Num:%Rulesj.loo00 1m Iuoo 16woFigure 18: The memory consumed with four phases with theadjGrp optimization enabled on the large classifiers created byconcatenating all the classifiers of a few different networks.Also shown is the memory consumed when the optimization isnot enabled (i.e. the basic RFC) scheme. Notice the absence ofsome points in the Basic RFC curve. For those classifiers, thebasic RFC takes too much memory/preprocessing time.9 ConclusionsIt is relatively simple to perform packet classification at high speedusing large amounts of storage; or at low speed with small amountsof storage. When matching multiple fields (dimensions) simultaneously,it is difficult to achieve both high classification rate andmodest storage in the worst case. We have found that real classifiers(today) exhibit considerable amount of structure and redundancy.This makes possible simple classification schemes that exploit thestructure inherent in the classifier. We have presented one suchalgorithm, called RFC which appears to perform well with theselection of real-life classifiers available to us. For applications inwhich the tables do not change frequently (for example, not morethan once every few seconds) a custom hardware implementationcan achieve OC192c rates with a memory cost of less than $50, anda software implementation can achieve OC48c rates. RFC wasfound to consume too much storage for classifiers with four fields157

and more than 6,000 rules. But by further exploiting structure andredundancy in the classifiers, a modified version of RFC appears tobe practical for up to 15,000 rules.10 AcknowledgmentsWe would like to thank Darren Kerr at Cisco Systems for help inproviding the classifiers used in this paper. We also wish to thankAndrew McRae at Cisco Systems for independently suggesting theuse of bitmaps in storing the colliding rule set, and for useful feedbackabout the RFC algorithm.11 References[l] A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “SmallForwarding Tables for Fast Routing Lookups,” Proc.ACM SIGCOMM 1997, pp. 3-14, Cannes, France.[2] Abhay K. Parekh and Robert G. Gallager, “A generalizedprocessor sharing approach to flow control in integratedservices networks: The single node case,” IEEE/ACMTransactions on Networking, vol. 1, pp. 344-357, June1993.[3] Alan Demers, Srinivasan Keshav, and Scott Shenker,“Analysis and simulation of a fair queueing algorithm,”Internetworking: Research and Experience, vol. 1, pp. 3-26, January 1990.[4] Braden et al., “Resource Reservation Protocol (RSVP) -Version 1 Functional Specification,” RFC 2205, September1997.[5] B. Lampson, V. Srinivasan, and G. Varghese, “IP lookupsusing multiway and multicolumn search,” in Proceedingsof the Conference on Computer Communications (IEEEINFOCOMM), (San Francisco, California), vol. 3, pp.1248-1256, March/April 1998.[6] M.H. Overmars and A.F. van der Stappen, “Rangesearching and point location among fat objects,” in Journalof Algorithms, 21(3), pp. 629-656, 1996.[7] M. Waldvogel, G. Varghese, J. Turner, B. Plattner. “ScalableHigh-Speed IP Routing Lookups,” Proc. ACM SIG-COMM 1997, pp. 25-36, Cannes, France.[8] M. Shreedhar and G. Varghese, “Efficient Fair QueuingUsing Deficit Round-Robin,” IEEE/ACM Transactionson Networking, vol. 4,3, pp. 375-385, 1996.[9] P Gupta, S. Lin, and N. McKeown, “Routing lookups inhardware at memory access speeds,” in Proceedings ofthe Conference on Computer Communications (IEEEINFOCOMM), (San Francisco, California), vol. 3, pp.1241-1248, March/April 1998.[IO] P. Newman, T. Lyon and G.Minshall, “Flow labelled IP:a connectionless approach to ATM’, Proceedings of theConference on Computer Communications (IEEE INFO-COMM), (San Francisco, California), vol. 3, pp 1251-1260, April 1996.[ll] R. Guerin, D. Williams, T. Przygienda, S. Kamat, andA. Orda, “QoS routing mechanisms and OSPF extensions,”Internet Draft, Internet Engineering Task Force,March 1998. Work in progress.[12] Richard Edell, Nick McKeown, and Pravin Varaiya,“Billing Users and Pricing for TCP”, IEEE JSAC SpecialIssue on Advances in the Fundamentals of Networking,September 1995.[ 131 Sally Floyd and Van Jacobson, “Random early detectiongateways for congestion avoidance,” IEEE/ACM Transactionson Networking, vol. 1, pp. 397-413, August1993.[14] Steven McCanne and Van Jacobson, “A BSD <strong>Packet</strong>Filter: A New Architecture for User-level <strong>Packet</strong> Capture,”in Proc. of Usenix Winter Conference, (San Diego,California), pp. 259-269, Usenix, January 1993.[15] T.V. Lakshman and D. Stiliadis, “High-Speed Policybased<strong>Packet</strong> Forwarding Using Efficient Multi-dimensionalRange Matching”, Proc. ACM SIGCOMM, pp.191-202, September1998.[16] V.Srinivasan and G.Varghese, “Fast IP Lookups usingControlled Prefix Expansion”, in Proc. ACM Sigmetrics,June 1998.[17] VSrinivasan, S.Suri, G.Varghese and M.Waldvogel,“Fast and Scalable Layer4 Switching”, in Proc. ACMSIGCOMM, pp. 203-214, September 1998.[ 181 http://www.nlanr.net/N/NA/Leam/plen.970625.hist.[19] http://www.toshiba.com/taec/nonflash/components/memory.html.[20] http://developer.intel.com/vtune/analyzer/index.htm.12 AppendixI* Begin Pseudocode *II* Phase 0, Chunkj of width b bits*/for each rule rl in the classifierbeginproject the ifh component of rl onto the number line (from 0 to Zb-l),marking the start and end points of each of its constituent intervals.endforI* Now scan through the number line looking for distinct equivalenceclasses*/bmp := 0; I* all bits of bmp are initialised to ‘0’ *Ifor n in 0..2’-1beginif (any rule starts or ends at n)beginupdate bmp;if (bmp not seen earlier)begineq := new-Equivalence-Class();eq-xbm := bmp:endifendifelse cq := the equivalence class whose cbm is bmp;table-Oj[n] = eq->ID; /* fill ID in the rfc table*/endfor/* end of pseudocode *!Figure 19: Pseudocode for RFC preprocessing for chunkjPhase 0of158

Table 6:/* Begin Pseudocode */I* Assume that the chunk #i is formed from combining m distinct chunkscl, c2, .... cm of phases ~1.~2, .... pm where pl, p2, .... pm cbm & c2eq->cbm & ... & cmeq->cbm;/* bitwiseANDing */neweq := searchList(listEqs, intersectedBmp);if (not found in IistEqs)begin/* create a new equivalence class *Ineweq := new-Equivalence-Class();neweq->cbm := bmp;add neweq to 1istEqs;cndifI* Fill up the relevant RFC table contents.*/tablej-i[indx] := neweq->ID;indx++;endforI* end of pseudocode *Iindx = indx * (total #equivIDs of chd->chkParents[i]) +eqNums[phaseNum of chd->chkParents[ill[chkNum of chd->chkParents[i]]:/*** Alternatively: indx = (indx chk-Parentsri])) A (eqNums[phaseNum of chkParents[i]J[chkNum of chkParents[i]]***IendforeqNums[phaseNum][chkNum] = contents of appropriate rfctable ataddress indx.endforreturn eqNums[O][numPhases-I]: I* this contains the desired classID*II* end of pseudocode *IFigure 21: Pseudocode for the RFC Lookup operation with thefields of the packet in pktlields.Figure 20: Pseudocode for RFC preprocessing for chunk i of Phase j,.i>O)/* Begin Pseudocode *Ifor (each chunk, chkNum of phase 0)eqNums[O][chkNum] = contents of appropriate rfctable at memory addresspktFieldslchkNum]:for (phaseNum = l..numPhases-1)for (each chunk, chkNum, in Phase phaseNum)beginI* chd stores the number and description about this chunk’s parents chk-Prants[O..numChkParents]*/chd = parent descriptor of (phaseNum, chkNum);indx = eqNums[phaseNum of chkParents[O]][chkNum of chkParents[Ol];for (i=l..chd->numChkParents-1)begin

Phase 0 Phase 1 Phase 265535 ~-T---jChunkifBytes.3 and 4 ‘!of SourceNetwork Address :(1.32) 256INetwork AddresS(0.0) :Destn6553;‘M lun65535 IChunk#2(4.6) : 1 - I I \ 1Transport-layer ;Port number 19(1.32) ;!6553; nChunk#3 -‘:\655i5 BChunkK5co3e0189::Chunk#O001112 10001833 110014 1111100123400cl1IB 2Chunk#I0I2345678910111213141516Chunk#OAccesses made by the lookup of a packet withSrc Network-layer Address = 0.83.1.32Dst Network-layer Address = 0.0.4.6Transport-layer Protocol = 17 (udp)Dst Transport-layer port number = 22Figure 22: This figure shows the contents of RFC tables for the example classifier of Table 6. The sequence of accesses made by theexample packet have also been shown using big gray arrows. The memory locations accessed in this sequence have been marked inbold.Rule/#160

Packet Classification on Multiple Fields - High Speed Network Lab ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?