13.07.2015 Views

Topology, Routing, and Packaging of Efficient Large-Scale Networks

Topology, Routing, and Packaging of Efficient Large-Scale Networks

Topology, Routing, and Packaging of Efficient Large-Scale Networks

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

TTTRTTTTTTTRRTT TTTTRRTTTTT(a) L = 2, S 1 = 2, S 2 = 4, K = 1, T = 4TRTTTTTTRTTRTTTTTTTRRTT T TRTTR(b) L = 2, S 1 = 3, S 2 = 3, K = 1, T = 4Figure 3: Two examples <strong>of</strong> the HyperX topology.Dark circles (marked T) are terminals, light circles(marked R) are switches.A HyperX network needs at least T P/2 bidirectional linkscrossing any bisection in order to be nonblocking. Thus,the ratio β = K mS m/2T measures the relative bisectionb<strong>and</strong>width <strong>of</strong> the architecture. As discussed above, for aregular (L, S, 2, S) HyperX, i.e. a flattened butterfly withdouble-wide switch-switch links, we have β = 1.3.1 Achievable HyperX networksWe consider the problem <strong>of</strong> finding a best possible, bysome criterion, HyperX satisfying switch radix, network size,<strong>and</strong> bisection b<strong>and</strong>width requirements. HyperX is a directnetwork; for switches with a given radix R we must thereforeconform to the bound:LXT + K k (S k − 1) ≤ R . (2)k=1To create a system consisting <strong>of</strong> N terminal nodes, we needQat least this many terminal links. With a total <strong>of</strong> P =Lk=1 S k switches, each having T terminal links, this constraintbecomes!LYT P = T S k ≥ N . (3)k=1With both R <strong>and</strong> N viewed as given, fixed constants, theseequations provide both an upper bound (2) <strong>and</strong> a lowerbound (3) on T for each possible network shape S. Werequirethat there be enough bisection b<strong>and</strong>width. For some specifiedminimum relative b<strong>and</strong>width B we require thatβ ≡ min(K kS k )2TTTTTTTTRTTTRTTRRTTTTTTTTTR≥ B . (4)TTTTT = <strong>of</strong>f−fabric links per switch1201008060402002 dimensional regular HyperX: K = 1, R = 128, N = 2 17Blue squares show best designsT ≤ R − KL(S−1)T ≥ N S −Lβ = 1β = 1/2β = 1/4β = 1/810 20 30 40 50 60S = mesh size per dimensionFigure 4: HyperX Design Space (L = 2, K = 1, R =128, N = 2 17 ). No solutions: the lower bound exceedsthe upper bound on T for all S.Each dimension must consist <strong>of</strong> at least 2 switches:S k ≥ 2 , k = 1, 2, . . . , L . (5)Our objective is to find HyperX networks that satisfy theconstraints above. Among the feasible networks, we preferthose with least cost, which to a first approximation isproportional to the number <strong>of</strong> switches, Q Lk=1 S k. A lowdimensionalsolution (with small L) is also good, since itdecreases the average hop count <strong>and</strong> yields more easily realizedhardware implementations.In contrast to the folded Clos, with switches <strong>of</strong> a givenradix one cannot build an arbitrarily large HyperX. In fact,the largest HyperX network buildable with radix R switchesis an (R − 1)-cube with one terminal per switch, so clearlywe must have N ≤ 2 R−1 . As N approaches this limit, thesmallest feasible L grows, making these HyperX topologiesunattractive; the study in Section 4.2 bears this out. Forhigh-radix switches, fortunately, the HyperX designs achievablein three or at most four dimensions turn out to be largeenough for the exascale systems we have in mind.3.1.1 The space <strong>of</strong> regular HyperX networksSince it is determined by fewer parameters, we can visualizethe regular HyperX design space more easily than thegeneral. In a regular HyperX, the network shape S <strong>and</strong> thetrunking factor K are scalars.Figures 4 <strong>and</strong> 5 show the design space <strong>of</strong> regular HyperXfor a baseline example with R = 128 (typical <strong>of</strong> thehigh-radix switches we expect in the near future) <strong>and</strong> N =128K (a network size consistent with exascale performance).There are four free design space parameters, namely L, K, S,<strong>and</strong> T . We choose to fix the trunking factor K <strong>and</strong> dimensionL <strong>and</strong> think about the resulting (S, T ) space. Hence, weare seeking integer (S, T ) pairs in the region bounded on theleft by the vertical line S = 2; below by the nonlinear networksize bound (3), <strong>and</strong> above by the linear b<strong>and</strong>width (4)<strong>and</strong> switch port (2) bounds that have opposite slopes <strong>and</strong>form a ro<strong>of</strong> over the allowable space. For some (R, N, K, L)combinations, the lower bound lies completely above thesetwo upper bounds, <strong>and</strong> there are no feasible HyperX de-


T = <strong>of</strong>f−fabric links per switch120100806040203 dimensional regular HyperX: K = 1, R = 128, N = 2 17Blue squares show best designs(14,48)(16,32)(20,17)(23,11)T ≤ R − KL(S−1)T ≥ N S −Lβ = 1β = 1/2β = 1/4β = 1/8010 15 20 25 30 35 40S = mesh size per dimensionFigure 5: HyperX Design Space (L = 3, K = 1, R =128, N = 2 17 ).signs (Figure 4). For others, the region <strong>of</strong> allowed designs isbounded by these three constraints, having a leftmost corner,a topmost corner, <strong>and</strong> a rightmost corner (Figure 5).The leftmost is the most interesting, as it is the point <strong>of</strong>least switch count. These points (for several values <strong>of</strong> β) areshown by blue squares <strong>and</strong> labeled with their (T, S) values.3.1.2 Selecting an optimum HyperX: An algorithmWe have implemented an algorithm to find a general HyperXmeeting the requirements <strong>of</strong> Section 3.1 <strong>and</strong> havingfewest switches. Given the parameters N, R, <strong>and</strong> β <strong>of</strong> thedesign problem, our algorithm chooses a design specified bythe dimension, L; the number <strong>of</strong> terminals per switch, T ;the shape <strong>of</strong> the network, S (an L-vector); <strong>and</strong> the trunkingfactors, K (another L-vector); so that the network size (2),switch port (3), <strong>and</strong> bisection b<strong>and</strong>width (4) inequalities aresatisfied <strong>and</strong> the switch count is smallest possible. We dothis by an exhaustive search <strong>of</strong> the design space, but pruneaway designs early that are known to be either infeasibleor suboptimal. The runtime <strong>of</strong> the algorithm for practical,large cases is a few seconds.To restrict the design space to be searched, we exploitcertain symmetries. First, we lose nothing by restricting thesearch to designs for whichS 1 ≤ S 2 ≤ · · · ≤ S L. (6)Since the bisection bound depends on the smallest <strong>of</strong> theS k K k , we also can assume thatK 1 ≥ K 2 ≥ · · · ≥ K L . (7)Once a feasible design is found, its switch count is an upperbound on the best achievable design. Any other designthat cannot be better will be rejected. Let O represent thesmallest switch count for all feasible designs so far discovered.We add the inequalityLYS k ≤ O (8)k=1to further restrict the set <strong>of</strong> designs to be searched. In particular,the network size constraint tells us that if we are seekingto find configurations that reduce O, we can restrict ourattention to designs for which T is large enough to achievethe desired network size given that the number <strong>of</strong> switches inan improved design is going to be less than O; we thereforerestrict the search values <strong>of</strong> T to those that satisfyLYN ≤ T ( S k ) ≤ T O ⇒ T ≥ ⌈N/O⌉ ≡ T min . (9)k=1As for the range <strong>of</strong> dimensions to search, we have 1 ≤ L ≤R − 1, the upper bound being achieved for a hypercube.3.2 Comparing general <strong>and</strong> regular HyperXWe can ask whether or not an irregular HyperX can besignificantly more attractive than a best possible regular HyperXor flattened butterfly. To that end, we conducted anexperiment. For typical values <strong>of</strong> the design problem parameters(N = 2 17 , R = 2 7 , <strong>and</strong> B ∈ {0.125, 0.25, 0.5, 1})we found best (having fewest switches) regular <strong>and</strong> generalHyperX designs. Table 1 shows the results.The irregular HyperX has fewer switches in all cases, thereduction in number ranging from 8 to 28 percent. Interestingly,the best results were always three dimensional inthe general case, whereas the best regular designs were fourdimensional in two cases. Flattened butterflies are regularHyperX networks for which K = 2. If we add this restrictionthen for the full bisection-b<strong>and</strong>width case (β = 1) we get aninferior design (L = 4, S = 11, K = 2, T = 9) having 14,641switches. If we add the additional restriction imposed by theflattened butterfly, namely T = S, then we have a networkwith S = 11 <strong>and</strong> size N = 11 5 = 161, 051 with 29,979 moreterminal ports than needed.3.3 <strong>Routing</strong> algorithms for the HyperXThe set <strong>of</strong> shortest paths between two switches in a HyperXis determined by their coordinate vectors. Considerpaths from switch zero (all its coordinates are 0) to switchI = (I 1, . . . , I L). The length <strong>of</strong> each shortest path is equalto the number <strong>of</strong> nonzero elements <strong>of</strong> I. Let ∆ be the setdimensions k for which I k ≠ 0. We call these the <strong>of</strong>fset dimensions;the others are aligned dimensions. Every shortestpath goes directly (by one step) from the source 0 to the destinationin each <strong>of</strong> the <strong>of</strong>fset dimensions in some sequence.If the number <strong>of</strong> <strong>of</strong>fset dimensions (i.e. the cardinality <strong>of</strong> ∆)is D, then there are D! shortest paths.A minimal deterministic routing method is obvious: routeto the destination by the shortest path determined by movingin the dimensions <strong>of</strong> ∆ in a fixed order; without loss <strong>of</strong>generality, in order <strong>of</strong> increasing dimension number. This isknown as dimension-order routing. Minimal adaptive routingis possible. When one dimension is blocked due to buffercongestion at the downstream switch, we choose another <strong>of</strong>fsetdimension. We call this Min-AD. Min-AD adaptivelyexplores the space <strong>of</strong> D! shortest paths. An oblivious routingalgorithm, inspired by Valiant [19], routes all packetsthrough an intermediate node chosen at r<strong>and</strong>om with uniformprobability.As shown by Kim, Dally, <strong>and</strong> Abts [6], adaptive nonminimalrouting is in general better than either deterministicor oblivious routing for a flattened butterfly, <strong>and</strong> this is truefor the HyperX. They proposed the adaptive Clos-AD algorithm,which chooses, at a packet’s source node, betweendimension ordered minimal routing <strong>and</strong> non-minimal rout-


B Best Regular Switch Count Best General Switch Count0.125 L = 4, S = 7, K = 2, T = 55 2401 S = (5, 19, 19), K = (4, 1, 1), T = 76 18050.25 L = 3, S = 14, K = 2, T = 48 2744 S = (3, 27, 30), K = (9, 1, 1), T = 54 24300.5 L = 3, S = 16, K = 2, T = 32 4096 S = (3, 35, 36), K = (12, 1, 1), T = 35 37801.0 L = 4, S = 10, K = 3, T = 14 10000 S = (5, 38, 38), K = (8, 1, 1), T = 19 7220Table 1: Best regular <strong>and</strong> general HyperX networks for N = 2 17 <strong>and</strong> R = 128.ing, based on estimates <strong>of</strong> queuing delay. If it chooses toroute non-minimally, Clos-AD routes to a r<strong>and</strong>omly chosenlowest common ancestor switch, considering the network asa transformed folded Clos network. The packet thereforetakes dimension ordered minimal paths from source to intermediate<strong>and</strong> from intermediate to destination nodes. Thedecision is made at the source node, <strong>and</strong> no packet can bederouted (take a non-minimal route) twice.We argue that Clos-AD can be improved in two respects.First, by deciding to route minimally or not only at thesource, it loses significant ability to adapt to congestion encountereden route. Second, by making a choice <strong>of</strong> intermediatenode that is informed by a mapping <strong>of</strong> the foldedClos into the HyperX, it introduces dimensional asymmetrywhich has no apparent advantage due to the inherent symmetryin HyperX dimensions. To address the issues raisedabove, we propose the DAL (Dimensionally-Adaptive, Loadbalanced)routing algorithm:1. Mark all <strong>of</strong>fset dimensions as deroutable (unmarked)on creation <strong>of</strong> the packet at its source. On arrival at aswitch:2. Find an <strong>of</strong>fset dimension with an unblocked path to analigned switch in that dimension. If none exist, then:3. Find an unmarked <strong>of</strong>fset dimension <strong>and</strong> unblocked switchthat is <strong>of</strong>fset from the destination <strong>and</strong> if one exists thenroute <strong>and</strong> mark the dimension as no-longer-deroutable;else4. Push the packet into a minimal, dimension-order, deterministicrouted virtual channel.Dimension-order routing, <strong>and</strong> therefore DAL, is deadlockfree; the virtual channel used as a last resort is used toprevent deadlock.Unlike Clos-AD, DAL deroutes in one <strong>of</strong> the HyperX dimensionsat a time, treating each HyperX dimension independently<strong>and</strong> symmetrically. This tends to cause it to takeshorter adaptive paths. While it may deroute just once perdimension, DAL can deroute as many times as there are <strong>of</strong>fsetdimensions. Unlike Clos-AD, DAL routes a packet onlythrough the subnetwork <strong>of</strong> those nodes that are aligned withboth the source <strong>and</strong> the destination. Once a packet comesinto alignment with the destination in a certain dimension itremains there. Unlike Clos-AD, DAL may deroute a packetat each hop on its route, thereby adapting to congestion notvisible at the source node.4. EVALUATING HYPERXWe evaluate DAL routing for the HyperX topology <strong>and</strong>the DAL/HyperX combination in comparison to the foldedClos. We begin by showing simulation results that explorethe performance <strong>of</strong> various routing algorithms on HyperX.They show that DAL is more effective than adaptive minimal,Valiant, <strong>and</strong> Clos-AD routing. Further simulationsshow that HyperX with DAL is generally better than anadaptively routed folded Clos with similar system configurations.We then analytically compare the number <strong>of</strong> switchesin optimum HyperX <strong>and</strong> folded Clos topologies across arange <strong>of</strong> system parameters. For large networks built withradix 32 switches, folded Clos may be best, but HyperX ismore cost effective when higher radix switches are used, evenfor million-node networks.4.1 Performance resultsWe use a cycle-accurate simulator on a variety <strong>of</strong> synthetictraffic patterns. In all cases, switches have a four cycle delay<strong>and</strong> channels have a one cycle channel transmission delay,where we define a cycle as the time for a switch to send orreceive a flit via one port. We define latency as the differencebetween the time a packet is received at a destinationterminal <strong>and</strong> the time this packet is generated at the sourceterminal. We assume single-flit packets. If a packet makesfour hops through the network from source to destination,it will transit 4 channels <strong>and</strong> 3 switches in 16 cycles. Forsimulation, sequential allocation is used for adaptive routecomputation [5]. For HyperX, we assume 6 virtual channels(VCs) per port: each port has 3 sets <strong>of</strong> 2 VCs. Cut-throughflow control is used. One VC in each set is reserved fordeadlock avoidance in the adaptive minimal <strong>and</strong> DAL routingalgorithms. Each VC can hold up to 32 flits. Switchesare input-queued; the iSLIP algorithm [13] is used for switchallocation. A crossbar has input speedup <strong>of</strong> 2.Figure 6 presents a comparison <strong>of</strong> the routing algorithms(described in Section 3.3) on a typical HyperX (regular, withT = 8, S = 8, <strong>and</strong> K = 1) with 32-port switches, 4096terminals, 512 switches, <strong>and</strong> where β is 0.5. We presentthree st<strong>and</strong>ard traffic patterns: Bit Complement; Bit Rotate;<strong>and</strong> Transpose [3], <strong>and</strong> one additional pattern, Swap2.Swap2 is a permutation designed to highlight the differencebetween Clos-AD <strong>and</strong> DAL. In this pattern, the evennumberednodes exchange messages with a peer across thebisection in one dimension <strong>and</strong> the odd-numbered nodes exchangemessages with a peer across the bisection in anotherdimension. These two <strong>of</strong>fset dimensions correspond to thetop two levels <strong>of</strong> the folded Clos that underlies Clos-AD.Valiant routing [19] reliably achieves throughput <strong>of</strong> β:about half <strong>of</strong> all packets traverse any given bisection, twice,when routed via a r<strong>and</strong>om intermediate node (independent<strong>of</strong> traffic pattern). This throughput certainty comes at theprice <strong>of</strong> higher latency for light traffic loads due to the increasedhop count. Min-AD achieves throughput <strong>of</strong> 2β ontraffic patterns that do not cause congestion, such as uniformr<strong>and</strong>om, but throughput can degrade to 1/T on patternsthat cause congestion at the source.The adaptive algorithms, Clos-AD <strong>and</strong> DAL, are betterchoices. For low loads, both algorithms choose a minimal


lat : DALhop : DALValiantValiantClos-ADClos-ADMin-ADMin-ADlat : DALhop : DALValiantValiantClos-ADClos-ADMin-ADMin-AD606606505505Latency (cycle)403020432Hop countLatency (cycle)403020432Hop count10100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0Load(a) Bit Complement10100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0Load(b) Bit Rotatelat : DALhop : DALValiantValiantClos-ADClos-ADMin-ADMin-ADlat : DALhop : DALValiantValiantClos-ADClos-ADMin-ADMin-AD606606505505Latency (cycle)403020432Hop countLatency (cycle)403020432Hop count101101000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(c) Transpose000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(d) Swap2Figure 6: Load-latency graphs on the (a) Bit Complement, (b) Bit Rotate, (c) Transpose, <strong>and</strong> (d) Swap2traffic patterns [3] <strong>of</strong> a regular HyperX network with N = 4096, R = 32, β = 0.5, T = 8, S = 8, K = 1.route <strong>and</strong> achieve low latency, whereas at high load, theyadapt, choosing nonminimal routes to avoid congestion. Theprimary difference between the two algorithms is when theydecide to adapt. Clos-AD estimates, at the source, the delay<strong>of</strong> a minimal route <strong>and</strong> a non-minimal route <strong>and</strong> makes itsadaptation decision eagerly. DAL makes decisions lazily asthe packet traverses the network <strong>and</strong> decides on the routeindependently per dimension. Eager adaptation means thatClos-AD has less information which could be out <strong>of</strong> date,while lazy adaptation allows DAL to be far more nimble.Clos-AD is usually better than Min-AD <strong>and</strong> Valiant, butfor transpose some limitations become apparent. Clos-ADmakes its adaptation decision at the source. With the limitedinformation available, its delay estimate is inherentlyimperfect. The delay estimate will be biased towards minimalpaths for some patterns <strong>and</strong> towards non-minimal pathsfor others. Our delay estimate tends to underestimate thecost <strong>of</strong> the non-minimal path. This bias is the reason forthe higher average hop counts <strong>of</strong> Clos-AD at all load levelswhen compared to DAL. On the Bit Complement pattern,the terminals connected to a given switch communicatewith terminals connected to a single other switch, a patternwhich congests the links on the shortest paths betweenthem. Moreover, these shortest paths span all the dimensions,so DAL will deroute most packets in each dimension.For these patterns, CLOS-AD has the advantage <strong>of</strong> makingthat deroute decision once <strong>and</strong> for all, early, <strong>and</strong> thisimproves latency for low to medium load. On the Bit Rotationpattern, this congestion occurs but on few links. Sothe average latency <strong>of</strong> DAL approaches that <strong>of</strong> CLOS-ADfor medium load.DAL has higher saturation throughput than the alternativeson all patterns that we have tested. This is becauseit will take the minimal path whenever possible, but routearound congestion when necessary. Furthermore DAL assessescongestion at each hop <strong>and</strong> independently for each dimension.Our Swap2 pattern is a good example <strong>of</strong> why DALexcels. The theoretical throughput <strong>of</strong> this pattern should be2β, which DAL is able to achieve. For this pattern, Clos-ADoverlays a folded Clos network on top <strong>of</strong> the HyperX, <strong>and</strong>can therefore pick a route that goes through <strong>of</strong>fset intermediatenodes in already aligned dimensions.We follow our comparison <strong>of</strong> routing algorithms on HyperXby comparing HyperX networks using DAL to a foldedClos network using adaptive routing. Figure 7 presents theresults <strong>of</strong> this comparison across the same set <strong>of</strong> traffic patterns.N = 4096 <strong>and</strong> R = 32 in all three networks <strong>and</strong> weuse the same configuration as in Figure 6 for the HyperXwith β = 0.5. The HyperX with β = 0.25 has T = 14,S = 7, K = 1 so that P = 343. For a tapered folded Clos,we define β asβ ≡number <strong>of</strong> links from core switches.network sizeThe folded Clos (with β = 0.5) needs three levels. There are205 level 1 switches with 10 uplinks <strong>and</strong> 20 downlinks; 130


HyperX, β = 0.5HyperX, β = 0.25Folded Clos, β = 0.5HyperX, β = 0.5HyperX, β = 0.25Folded Clos, β = 0.560605050Latency (cycle)403020Latency (cycle)403020101000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(a) Bit Complement00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(b) Bit RotateHyperX, β = 0.5HyperX, β = 0.25Folded Clos, β = 0.5HyperX, β = 0.5HyperX, β = 0.25Folded Clos, β = 0.560605050Latency (cycle)403020Latency (cycle)403020101000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(c) Transpose00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(d) Swap2Figure 7: Load-latency graphs <strong>of</strong> HyperX networks using DAL routing compared with folded Clos networksusing adaptive routing on the (a) Bit Complement, (b) Bit Rotate, (c) Transpose, <strong>and</strong> (d) Swap2 trafficpatterns when N = 4096, R = 32, <strong>and</strong> β is varied.level 2 switches that have 16 uplinks <strong>and</strong> 16 downlinks; <strong>and</strong>80 level 3 switches with 13 downlinks, 2 <strong>of</strong> them packaged perphysical switch so that P = 415. In all cases, the folded Closachieves a saturation throughput <strong>of</strong> β because the topologyeffectively forces all patterns to behave as if they are routedby Valiant. For the same reason, the hop count on foldedClos is generally higher than HyperX with DAL as seen inthe Valiant routing case <strong>of</strong> Figure 6. The benefit <strong>of</strong> HyperXwith DAL routing is that it is able to achieve a load <strong>of</strong> 2βon patterns that don’t exhibit congestion while maintaininga load <strong>of</strong> β on patterns that do.4.2 Resource comparisonThe previous section shows that DAL routing on HyperXnetworks provides very low network latencies for a range <strong>of</strong>traffic patterns. The focus <strong>of</strong> this section is a cost comparison<strong>of</strong> the folded Clos topology <strong>and</strong> the HyperX topologyfor a wide range <strong>of</strong> topological parameters.In Figure 8, we vary network size (N), switch radix (R),<strong>and</strong> tapering ratio (β). The vertical axis <strong>of</strong> the graphs depictsthe number <strong>of</strong> switches divided by the number <strong>of</strong> terminals<strong>of</strong> the network. This measure can be thought <strong>of</strong> asan estimation <strong>of</strong> networking cost per terminal node. Stepsin the graphs indicate that the dimension <strong>of</strong> the network increased.This is particularly important because an increasein dimension will result in an increase in cost for both HyperX<strong>and</strong> folded Clos. In HyperX networks, an increase indimension effectively allows fewer terminal nodes to be connectedto each switch resulting in an increase <strong>of</strong> the totalnumber <strong>of</strong> switches to achieve a given network size. FoldedClos networks add another complete layer <strong>of</strong> switches to thenetwork with each increase in dimension. In HyperX, becausefewer terminal nodes can be connected to each switchas dimensions are added, the cost <strong>of</strong> each additional dimensionis higher than a folded Clos where additional dimensionsresult in a constant cost increase. This addition is marginalat very high dimensions thus making folded Clos a betterchoice in this regime. Our data does indeed show that atlow radices (such as radix 32) <strong>and</strong> high node counts, thecost <strong>of</strong> a folded Clos with β = x is comparable with that <strong>of</strong>a HyperX with β = 0.5x. The advent <strong>of</strong> high-radix switcheschanges this story because very large networks can be builtwith few dimensions. In the radix 128 <strong>and</strong> 256 cases, thecost <strong>of</strong> a folded Clos with β = x is comparable with that<strong>of</strong> HyperX with β = x on many configurations. We haveshown that a HyperX can, on many traffic patterns, h<strong>and</strong>lea higher load (as much as double) than a folded Clos <strong>of</strong>equal bisection b<strong>and</strong>width; this makes a compelling case forHyperX.5. PACKAGING HYPERXSystem packaging is a central practical concern in networkdesign, especially for systems that occupy warehousesized machine rooms. A good mapping between the networktopology <strong>and</strong> the packaging hierarchy greatly simplifies the


(Number <strong>of</strong> switchs) / (Network size)0.50.450.40.350.30.250.20.150.1(Folded Clos, β =1.0)(Folded Clos, β =0.5)(Folded Clos, β =0.25)(Folded Clos, β =0.125)(HyperX, β =1.0)(HyperX, β =0.5)(HyperX, β =0.25)(HyperX, β =0.125)(Number <strong>of</strong> switchs) / (Network size)0.220.20.180.160.140.120.10.080.060.04(Folded Clos, β =1.0)(Folded Clos, β =0.5)(Folded Clos, β =0.25)(Folded Clos, β =0.125)(HyperX, β =1.0)(HyperX, β =0.5)(HyperX, β =0.25)(HyperX, β =0.125)0.050.0202 8 2 10 2 12 2 14 2 16 2 18 2 20Network size(a) Radix 3202 8 2 10 2 12 2 14 2 16 2 18 2 20Network size(b) Radix 64(Number <strong>of</strong> switchs) / (Network size)0.070.060.050.040.030.020.01(Folded Clos, β =1.0)(Folded Clos, β =0.5)(Folded Clos, β =0.25)(Folded Clos, β =0.125)(HyperX, β =1.0)(HyperX, β =0.5)(HyperX, β =0.25)(HyperX, β =0.125)(Number <strong>of</strong> switchs) / (Network size)0.030.0250.020.0150.010.005(Folded Clos, β =1.0)(Folded Clos, β =0.5)(Folded Clos, β =0.25)(Folded Clos, β =0.125)(HyperX, β =1.0)(HyperX, β =0.5)(HyperX, β =0.25)(HyperX, β =0.125)02 8 2 10 2 12 2 14 2 16 2 18 2 20Network size(c) Radix 12802 8 2 10 2 12 2 14 2 16 2 18 2 20Network size(d) Radix 256Figure 8: The number <strong>of</strong> switches while the network size is varied using switches with radix (a) 32, (b) 64,(c) 128, <strong>and</strong> (d) 256.system wiring. For example an entire level <strong>of</strong> interconnectmay be contained within a subsystem, minimizing the number<strong>of</strong> individual cables between subsystems. Depending onthe topology, the number <strong>of</strong> cable assemblies can be furtherreduced by using cables with multiple parallel connections.This is particularly beneficial in optically connected systemswhere the number <strong>of</strong> parallel connections is not constrainedby considerations <strong>of</strong> cable bulk or connector density. Parallelribbon cables with up to 100 fibers have been demonstratedusing MT connectors [16].Minimizing the total cable length <strong>and</strong> using a regularstructure that is easy to deploy are also key considerations.The Dragonfly [7] is an example <strong>of</strong> a system in which network<strong>and</strong> system packaging are co-designed to provide themost cost effective solution. The key difference between HyperX<strong>and</strong> Dragonfly is that HyperX assumes that all cableswill likely be optical in the future, whereas Dragonfly assumesthat intra-rack cables will be electrical (<strong>and</strong> cheaper)while inter-rack cables will be optical. As data rates increase<strong>and</strong> as optical cables become more commonplace, power <strong>and</strong>cost will likely favor optics even for intra-rack applications.To minimize cabling in folded Clos networks, two packagingstrategies are common. The first strategy embeds theswitch hierarchy within the package hierarchy to exploit localinterconnect within the enclosures. The second strategycombines multiple levels <strong>of</strong> switch components into alarger switch chassis. A 128K port folded Clos network usingradix 128 switches requires three levels <strong>of</strong> switches. Thefirst level can be integrated with groups <strong>of</strong> 64 processing elements.The top two levels can be combined to make 8192-port switch units.In this arrangement only one stage <strong>of</strong> cabling is required,between the 2048, 64-node processing enclosures, <strong>and</strong> the16, 8192-port switches. Each processor enclosure has 4 linksto each <strong>of</strong> the switches. This requires a total <strong>of</strong> 32768 cableseach with 8 fibers. There is a wide range <strong>of</strong> connectionlengths, <strong>and</strong> the distribution <strong>of</strong> cable lengths is only knownafter detailed floor planning <strong>of</strong> the entire system.


S 1 = 32boards16/32 links to processing elementsT = 16 links to processing elements16/32 links to processing elements16/32 links to processing elements3131L1HyperX L1 2 x 15L2HyperX router 2 x 15L2switch L3L32 x 152 x 153131L1HyperX L1 2 x 15L2HyperX router 2 x 15L2router L3L32 x 152 x 15L1 32 way fullyconnected networkL2 link aggregationL3 link aggregationL215 fiber ribbonsL215 fiber ribbonsL315 fiber ribbonsL315 fiber ribbonsK 2 * (S 2 – 1) = 30K 3 * (S 3 – 1) = 30Figure 9: HyperX chassis showing wiring.The flexibility <strong>of</strong> the HyperX topology allows a configurationto be selected such that the different dimensions <strong>of</strong>HyperX connectivity map directly onto the physical dimensions<strong>of</strong> the system. In the case <strong>of</strong> a 128K port system wecan choose, for example, a three-dimensional HyperX withT = 16, S 1 = 32, S 2 = S 3 = 16, K 1 = 1, <strong>and</strong> K 2 = K 3 = 2.The first dimension is routed within the enclosure, <strong>and</strong> thetwo further dimensions correspond to the placement <strong>of</strong> theenclosure within a two-dimensional array in the machineroom.Figure 9 shows the connectivity <strong>of</strong> switches with the enclosure.As HyperX is a direct topology the switch can belocated on the same card as the processing elements. Eachboard thus comprises 16 terminal nodes <strong>and</strong> one switch. The32 boards in an enclosure communicate via an optical backplanethat links all switches in the rack in a fully connectednetwork, forming the first HyperX dimension. This connects`S 1´2 = 496 pairs <strong>of</strong> nodes using 992 waveguides orfibers. The optical backplane further aggregates the second<strong>and</strong> third dimension links to enable the use <strong>of</strong> parallel fiberribbons for interconnect between enclosures. The aggregationnetworks require a further 1920 waveguides per level.Figure 10 shows the placement <strong>of</strong> enclosures in a twodimensionalarray. Each enclosure is connected to each <strong>of</strong>its peers in both dimensions. Each pair <strong>of</strong> enclosures isconnected by two separate fiber ribbons for resiliency. Thewiring therefore consists <strong>of</strong> many replicated instances <strong>of</strong> 16-way all-to-all wiring patterns. Along the mid-line, at everyrow or column <strong>of</strong> racks, there are 64 optical fiber ribbonsconnecting all 8 enclosures on one side to the 8 on the other.Each set <strong>of</strong> all-to-all wiring between rows or columns <strong>of</strong>16 enclosures requires 120 fiber ribbon cables, in 15 distinctlengths. This pattern is repeated 64 times, 32 times in theX dimension <strong>and</strong> 32 times in the Y dimension, giving a total<strong>of</strong> 7680, 64-way fiber ribbon cables. Assuming a pitchbetween enclosures <strong>of</strong> 1.2m in each dimension with an additionalmeter <strong>of</strong> overhead, cable lengths range from 2.2m to(S 2 = 16) way all to all L2 wiring x 2(S3 = 16) way all to all L3 wiring x 2Maximum bisection64 cables per all-to-all(S 2 x S 3 = 16 x 16) array <strong>of</strong>(T x S 1 = 512) processor racksFigure 10: HyperX system floor plan.19m. The total length <strong>of</strong> 64 way optical fiber ribbon is 60km.The direct correspondence between the HyperX dimensions<strong>and</strong> the physical dimensions <strong>of</strong> the enclosure arrays greatlysimplifies the task <strong>of</strong> connecting the system. It also avoidswiring congestion points typical <strong>of</strong> folded Clos wiring at thecore level.6. FAULT TOLERANCECables <strong>and</strong> connectors cause most <strong>of</strong> the failures in largescale networks. The move to optics will change the faultcharacteristics in some respects, but the dominant failuremechanism will remain individual links rather than switches.Timeout <strong>and</strong> retry mechanisms can be used to h<strong>and</strong>le transientfailures, however a network with half a million connectionsis likely to suffer hard link failures.The massive path diversity in folded Clos networks createsthe potential for a high degree <strong>of</strong> fault tolerance. Although


there are many alternate uplinks to common ancestors, thechoice <strong>of</strong> downlink is entirely deterministic <strong>and</strong> will block atany failed link; with oblivious routing a packet can thereforeget stuck at a failed link. In this case the packet must beretried from the source. In order to minimize performancedegradation in the presence <strong>of</strong> failed links it is highly desirableto prohibit failed routes from being selected as part <strong>of</strong>the routing process.Adaptive routing in folded Clos networks has the advantagethat it can route around failed uplinks. Links withhard errors are simply eliminated from the possible routingalternates. In order to extend this to route around faileddownlinks a mechanism is required which checks whether agiven choice <strong>of</strong> uplink would cause a failed downlink to beselected in the down path. If this is the case an alternateuplink is selected. The mechanism for detecting potentialdownlink errors uses the packet’s destination to look up atable <strong>of</strong> disallowed uplinks for this destination. The tablesize can be compressed by recognizing that due to the hierarchicaltopology contiguous ranges <strong>of</strong> destinations can beh<strong>and</strong>led by a single entry.HyperX networks <strong>of</strong>fer more possibilities for routing aroundfailed links, depending upon the routing algorithm beingused. In the case <strong>of</strong> DAL routing, a packet can only beblocked by a single failed link when there is only one remaining<strong>of</strong>fset dimension, <strong>and</strong> the deroute has already beentaken in that dimension. A simple solution is to permit afurther deroute. Provided there is only a single failed linkwithin a fully connected subnetwork this will not livelock.In order to avoid selecting a nonminimal route that wouldlead to a failed link, the same technique <strong>of</strong> filters on r<strong>and</strong>omroutes proposed for folded Clos networks can be applied.Under DAL routing, the failure information required to beheld on each switch is more localized, since each switch needonly be concerned with failed links attached to its peers ineach dimension.7. CONCLUSIONTo design a network for an exascale system requires trade<strong>of</strong>fsin performance, power, wiring complexity, <strong>and</strong> fault tolerance.In this paper, we investigated the impact on thosetrade<strong>of</strong>fs <strong>of</strong> high-radix switches <strong>and</strong> the topologies that theyenable.We show that the HyperX topology is an attractive c<strong>and</strong>idatefor exascale networks because <strong>of</strong> its ability to achievea favorable balance among these trade<strong>of</strong>fs. We compare theHyperX to the folded Clos <strong>and</strong> show several design pointswhere the HyperX uses significantly fewer routers to achievesimilar b<strong>and</strong>width on benign patterns. For applications thatdem<strong>and</strong> full bisection b<strong>and</strong>width, the folded Clos is generallymore efficient in terms <strong>of</strong> router components. We furtherinvestigate the routing algorithms possible on HyperX <strong>and</strong>compare our new DAL algorithm to other algorithms forthe folded Clos <strong>and</strong> flattened butterfly <strong>and</strong> show it achievessuperior performance because <strong>of</strong> its ability to adapt to congestionon a per-hop basis. We compare the physical packaging<strong>of</strong> HyperX <strong>and</strong> the folded Clos, <strong>and</strong> show how paralleloptical fibers can be used to minimize the number <strong>of</strong> discretecable connections in each case such that large 100,000networks can be constructed with fewer than 10,000 cables.When these high density connections are available, HyperXfurther allows switches to be colocated with the processingelements in these large computer systems, leading toa simpler structure with a smaller set <strong>of</strong> component partscompared to folded Clos networks.8. REFERENCES[1] “Top500 List,” http://www.top500.org/list.[2] M. Al-Fares, A. Loukissas, <strong>and</strong> A. Vahdat, “AScalable, Commodity Data Center NetworkArchitecture,” in SIGCOMM, Jun 2008.[3] W. J. Dally <strong>and</strong> B. Towles, Principles <strong>and</strong> Practices <strong>of</strong>Interconnection <strong>Networks</strong>. Morgan KaufmannPublishers Inc., 2003.[4] P. Ge<strong>of</strong>fray <strong>and</strong> T. Hoefler, “Adaptive <strong>Routing</strong>Strategies for Modern High Performance <strong>Networks</strong>,” inHigh Performance Interconnects, 2008, pp. 165–172.[5] J. Kim, W. J. Dally, <strong>and</strong> D. Abts, “Adaptive <strong>Routing</strong>in High-Radix Clos Network,” in SC’06, Nov 2006.[6] ——, “Flattened Butterfly: a Cost-efficient <strong>Topology</strong>for High-Radix <strong>Networks</strong>,” in ISCA, Jun 2007.[7] J. Kim, W. J. Dally, S. Scott, <strong>and</strong> D. Abts,“Technology-Driven, Highly-Scalable Dragonfly<strong>Topology</strong>,” in ISCA, Jun 2008.[8] J. Kim, W. J. Dally, B. Towles, <strong>and</strong> A. K. Gupta,“Microarchitecture <strong>of</strong> a High-Radix Router,” in ISCA,Jun 2005.[9] N. Kirman, et al., “Leveraging Optical Technology inFuture Bus-based Chip Multiprocessors,” in MICRO,Nov 2006.[10] A. Komornicki, G. Mullen-Schulz, <strong>and</strong> D. L<strong>and</strong>on,“Roadrunner: Hardware <strong>and</strong> S<strong>of</strong>tware Overview,” IBMRedpaper, 2009.[11] C. Leiserson, “Fat-trees: Universal <strong>Networks</strong> forHardware-<strong>Efficient</strong> Supercomputing,” IEEETransactions on Computers, vol. 34, pp. 892–901,1985.[12] X.-Y. Lin, Y.-C. Chung, <strong>and</strong> T.-Y. Huang, “AMultiple LID <strong>Routing</strong> Scheme for Fat-tree BasedInfiniB<strong>and</strong> <strong>Networks</strong>,” in IPDPS, Apr 2004.[13] N. McKeown, A. Mekkittikul, V. Anantharam, <strong>and</strong>J. Walr<strong>and</strong>, “Achieving 100 throughput in aninput-queued switch,” Communications, IEEETransactions on, vol. 47, no. 8, Aug 1999.[14] National Center for Computational Sciences, “Jaguar,”http://www.nccs.gov/computing-resources/jaguar.[15] S. R. Ohring, M. Ibel, S. K. Das, <strong>and</strong> M. J. Kumar,“Generalized Fat Trees,” in IEEE IPPS, 1995, pp.37–44.[16] T. Sabano, et al., “Development <strong>of</strong> Reference MTFerrule using Insert-molded Metal Plate,”OFC/NFOEC, Feb 2008.[17] S. Scott, D. Abts, J. Kim, <strong>and</strong> W. J. Dally, “TheBlackWidow High-Radix Clos Network,” in ISCA, Jun2006.[18] Semiconductor Industries Association, “InternationalTechnology Roadmap for Semiconductors,”http://www.itrs.net, 2007 Edition.[19] L. G. Valiant, “A Scheme for Fast ParallelCommunication,” SIAM Journal on Computing,vol. 11, no. 2, 1982.[20] D. Vantrease, et al., “Corona: System Implications <strong>of</strong>Emerging Nanophotonic Technology,” in ISCA, Jun2008.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!