Topology, Routing, and Packaging of Efficient Large-Scale Networks

TTTRTTTTTTTRRTT TTTTRRTTTTT(a) L = 2, S 1 = 2, S 2 = 4, K = 1, T = 4TRTTTTTTRTTRTTTTTTTRRTT T TRTTR(b) L = 2, S 1 = 3, S 2 = 3, K = 1, T = 4Figure 3: Two examples of the HyperX topology.Dark circles (marked T) are terminals, light circles(marked R) are switches.A HyperX network needs at least T P/2 bidirectional linkscrossing any bisection in order to be nonblocking. Thus,the ratio β = K mS m/2T measures the relative bisectionbandwidth of the architecture. As discussed above, for aregular (L, S, 2, S) HyperX, i.e. a flattened butterfly withdouble-wide switch-switch links, we have β = 1.3.1 Achievable HyperX networksWe consider the problem of finding a best possible, bysome criterion, HyperX satisfying switch radix, network size,and bisection bandwidth requirements. HyperX is a directnetwork; for switches with a given radix R we must thereforeconform to the bound:LXT + K k (S k − 1) ≤ R . (2)k=1To create a system consisting of N terminal nodes, we needQat least this many terminal links. With a total of P =Lk=1 S k switches, each having T terminal links, this constraintbecomes!LYT P = T S k ≥ N . (3)k=1With both R and N viewed as given, fixed constants, theseequations provide both an upper bound (2) and a lowerbound (3) on T for each possible network shape S. Werequirethat there be enough bisection bandwidth. For some specifiedminimum relative bandwidth B we require thatβ ≡ min(K kS k )2TTTTTTTTRTTTRTTRRTTTTTTTTTR≥ B . (4)TTTTT = off−fabric links per switch1201008060402002 dimensional regular HyperX: K = 1, R = 128, N = 2 17Blue squares show best designsT ≤ R − KL(S−1)T ≥ N S −Lβ = 1β = 1/2β = 1/4β = 1/810 20 30 40 50 60S = mesh size per dimensionFigure 4: HyperX Design Space (L = 2, K = 1, R =128, N = 2 17 ). No solutions: the lower bound exceedsthe upper bound on T for all S.Each dimension must consist of at least 2 switches:S k ≥ 2 , k = 1, 2, . . . , L . (5)Our objective is to find HyperX networks that satisfy theconstraints above. Among the feasible networks, we preferthose with least cost, which to a first approximation isproportional to the number of switches, Q Lk=1 S k. A lowdimensionalsolution (with small L) is also good, since itdecreases the average hop count and yields more easily realizedhardware implementations.In contrast to the folded Clos, with switches of a givenradix one cannot build an arbitrarily large HyperX. In fact,the largest HyperX network buildable with radix R switchesis an (R − 1)-cube with one terminal per switch, so clearlywe must have N ≤ 2 R−1 . As N approaches this limit, thesmallest feasible L grows, making these HyperX topologiesunattractive; the study in Section 4.2 bears this out. Forhigh-radix switches, fortunately, the HyperX designs achievablein three or at most four dimensions turn out to be largeenough for the exascale systems we have in mind.3.1.1 The space of regular HyperX networksSince it is determined by fewer parameters, we can visualizethe regular HyperX design space more easily than thegeneral. In a regular HyperX, the network shape S and thetrunking factor K are scalars.Figures 4 and 5 show the design space of regular HyperXfor a baseline example with R = 128 (typical of thehigh-radix switches we expect in the near future) and N =128K (a network size consistent with exascale performance).There are four free design space parameters, namely L, K, S,and T . We choose to fix the trunking factor K and dimensionL and think about the resulting (S, T ) space. Hence, weare seeking integer (S, T ) pairs in the region bounded on theleft by the vertical line S = 2; below by the nonlinear networksize bound (3), and above by the linear bandwidth (4)and switch port (2) bounds that have opposite slopes andform a roof over the allowable space. For some (R, N, K, L)combinations, the lower bound lies completely above thesetwo upper bounds, and there are no feasible HyperX de-

T = off−fabric links per switch120100806040203 dimensional regular HyperX: K = 1, R = 128, N = 2 17Blue squares show best designs(14,48)(16,32)(20,17)(23,11)T ≤ R − KL(S−1)T ≥ N S −Lβ = 1β = 1/2β = 1/4β = 1/8010 15 20 25 30 35 40S = mesh size per dimensionFigure 5: HyperX Design Space (L = 3, K = 1, R =128, N = 2 17 ).signs (Figure 4). For others, the region of allowed designs isbounded by these three constraints, having a leftmost corner,a topmost corner, and a rightmost corner (Figure 5).The leftmost is the most interesting, as it is the point ofleast switch count. These points (for several values of β) areshown by blue squares and labeled with their (T, S) values.3.1.2 Selecting an optimum HyperX: An algorithmWe have implemented an algorithm to find a general HyperXmeeting the requirements of Section 3.1 and havingfewest switches. Given the parameters N, R, and β of thedesign problem, our algorithm chooses a design specified bythe dimension, L; the number of terminals per switch, T ;the shape of the network, S (an L-vector); and the trunkingfactors, K (another L-vector); so that the network size (2),switch port (3), and bisection bandwidth (4) inequalities aresatisfied and the switch count is smallest possible. We dothis by an exhaustive search of the design space, but pruneaway designs early that are known to be either infeasibleor suboptimal. The runtime of the algorithm for practical,large cases is a few seconds.To restrict the design space to be searched, we exploitcertain symmetries. First, we lose nothing by restricting thesearch to designs for whichS 1 ≤ S 2 ≤ · · · ≤ S L. (6)Since the bisection bound depends on the smallest of theS k K k , we also can assume thatK 1 ≥ K 2 ≥ · · · ≥ K L . (7)Once a feasible design is found, its switch count is an upperbound on the best achievable design. Any other designthat cannot be better will be rejected. Let O represent thesmallest switch count for all feasible designs so far discovered.We add the inequalityLYS k ≤ O (8)k=1to further restrict the set of designs to be searched. In particular,the network size constraint tells us that if we are seekingto find configurations that reduce O, we can restrict ourattention to designs for which T is large enough to achievethe desired network size given that the number of switches inan improved design is going to be less than O; we thereforerestrict the search values of T to those that satisfyLYN ≤ T ( S k ) ≤ T O ⇒ T ≥ ⌈N/O⌉ ≡ T min . (9)k=1As for the range of dimensions to search, we have 1 ≤ L ≤R − 1, the upper bound being achieved for a hypercube.3.2 Comparing general and regular HyperXWe can ask whether or not an irregular HyperX can besignificantly more attractive than a best possible regular HyperXor flattened butterfly. To that end, we conducted anexperiment. For typical values of the design problem parameters(N = 2 17 , R = 2 7 , and B ∈ {0.125, 0.25, 0.5, 1})we found best (having fewest switches) regular and generalHyperX designs. Table 1 shows the results.The irregular HyperX has fewer switches in all cases, thereduction in number ranging from 8 to 28 percent. Interestingly,the best results were always three dimensional inthe general case, whereas the best regular designs were fourdimensional in two cases. Flattened butterflies are regularHyperX networks for which K = 2. If we add this restrictionthen for the full bisection-bandwidth case (β = 1) we get aninferior design (L = 4, S = 11, K = 2, T = 9) having 14,641switches. If we add the additional restriction imposed by theflattened butterfly, namely T = S, then we have a networkwith S = 11 and size N = 11 5 = 161, 051 with 29,979 moreterminal ports than needed.3.3 Routing algorithms for the HyperXThe set of shortest paths between two switches in a HyperXis determined by their coordinate vectors. Considerpaths from switch zero (all its coordinates are 0) to switchI = (I 1, . . . , I L). The length of each shortest path is equalto the number of nonzero elements of I. Let ∆ be the setdimensions k for which I k ≠ 0. We call these the offset dimensions;the others are aligned dimensions. Every shortestpath goes directly (by one step) from the source 0 to the destinationin each of the offset dimensions in some sequence.If the number of offset dimensions (i.e. the cardinality of ∆)is D, then there are D! shortest paths.A minimal deterministic routing method is obvious: routeto the destination by the shortest path determined by movingin the dimensions of ∆ in a fixed order; without loss ofgenerality, in order of increasing dimension number. This isknown as dimension-order routing. Minimal adaptive routingis possible. When one dimension is blocked due to buffercongestion at the downstream switch, we choose another offsetdimension. We call this Min-AD. Min-AD adaptivelyexplores the space of D! shortest paths. An oblivious routingalgorithm, inspired by Valiant [19], routes all packetsthrough an intermediate node chosen at random with uniformprobability.As shown by Kim, Dally, and Abts [6], adaptive nonminimalrouting is in general better than either deterministicor oblivious routing for a flattened butterfly, and this is truefor the HyperX. They proposed the adaptive Clos-AD algorithm,which chooses, at a packet’s source node, betweendimension ordered minimal routing and non-minimal rout-

B Best Regular Switch Count Best General Switch Count0.125 L = 4, S = 7, K = 2, T = 55 2401 S = (5, 19, 19), K = (4, 1, 1), T = 76 18050.25 L = 3, S = 14, K = 2, T = 48 2744 S = (3, 27, 30), K = (9, 1, 1), T = 54 24300.5 L = 3, S = 16, K = 2, T = 32 4096 S = (3, 35, 36), K = (12, 1, 1), T = 35 37801.0 L = 4, S = 10, K = 3, T = 14 10000 S = (5, 38, 38), K = (8, 1, 1), T = 19 7220Table 1: Best regular and general HyperX networks for N = 2 17 and R = 128.ing, based on estimates of queuing delay. If it chooses toroute non-minimally, Clos-AD routes to a randomly chosenlowest common ancestor switch, considering the network asa transformed folded Clos network. The packet thereforetakes dimension ordered minimal paths from source to intermediateand from intermediate to destination nodes. Thedecision is made at the source node, and no packet can bederouted (take a non-minimal route) twice.We argue that Clos-AD can be improved in two respects.First, by deciding to route minimally or not only at thesource, it loses significant ability to adapt to congestion encountereden route. Second, by making a choice of intermediatenode that is informed by a mapping of the foldedClos into the HyperX, it introduces dimensional asymmetrywhich has no apparent advantage due to the inherent symmetryin HyperX dimensions. To address the issues raisedabove, we propose the DAL (Dimensionally-Adaptive, Loadbalanced)routing algorithm:1. Mark all offset dimensions as deroutable (unmarked)on creation of the packet at its source. On arrival at aswitch:2. Find an offset dimension with an unblocked path to analigned switch in that dimension. If none exist, then:3. Find an unmarked offset dimension and unblocked switchthat is offset from the destination and if one exists thenroute and mark the dimension as no-longer-deroutable;else4. Push the packet into a minimal, dimension-order, deterministicrouted virtual channel.Dimension-order routing, and therefore DAL, is deadlockfree; the virtual channel used as a last resort is used toprevent deadlock.Unlike Clos-AD, DAL deroutes in one of the HyperX dimensionsat a time, treating each HyperX dimension independentlyand symmetrically. This tends to cause it to takeshorter adaptive paths. While it may deroute just once perdimension, DAL can deroute as many times as there are offsetdimensions. Unlike Clos-AD, DAL routes a packet onlythrough the subnetwork of those nodes that are aligned withboth the source and the destination. Once a packet comesinto alignment with the destination in a certain dimension itremains there. Unlike Clos-AD, DAL may deroute a packetat each hop on its route, thereby adapting to congestion notvisible at the source node.4. EVALUATING HYPERXWe evaluate DAL routing for the HyperX topology andthe DAL/HyperX combination in comparison to the foldedClos. We begin by showing simulation results that explorethe performance of various routing algorithms on HyperX.They show that DAL is more effective than adaptive minimal,Valiant, and Clos-AD routing. Further simulationsshow that HyperX with DAL is generally better than anadaptively routed folded Clos with similar system configurations.We then analytically compare the number of switchesin optimum HyperX and folded Clos topologies across arange of system parameters. For large networks built withradix 32 switches, folded Clos may be best, but HyperX ismore cost effective when higher radix switches are used, evenfor million-node networks.4.1 Performance resultsWe use a cycle-accurate simulator on a variety of synthetictraffic patterns. In all cases, switches have a four cycle delayand channels have a one cycle channel transmission delay,where we define a cycle as the time for a switch to send orreceive a flit via one port. We define latency as the differencebetween the time a packet is received at a destinationterminal and the time this packet is generated at the sourceterminal. We assume single-flit packets. If a packet makesfour hops through the network from source to destination,it will transit 4 channels and 3 switches in 16 cycles. Forsimulation, sequential allocation is used for adaptive routecomputation [5]. For HyperX, we assume 6 virtual channels(VCs) per port: each port has 3 sets of 2 VCs. Cut-throughflow control is used. One VC in each set is reserved fordeadlock avoidance in the adaptive minimal and DAL routingalgorithms. Each VC can hold up to 32 flits. Switchesare input-queued; the iSLIP algorithm [13] is used for switchallocation. A crossbar has input speedup of 2.Figure 6 presents a comparison of the routing algorithms(described in Section 3.3) on a typical HyperX (regular, withT = 8, S = 8, and K = 1) with 32-port switches, 4096terminals, 512 switches, and where β is 0.5. We presentthree standard traffic patterns: Bit Complement; Bit Rotate;and Transpose [3], and one additional pattern, Swap2.Swap2 is a permutation designed to highlight the differencebetween Clos-AD and DAL. In this pattern, the evennumberednodes exchange messages with a peer across thebisection in one dimension and the odd-numbered nodes exchangemessages with a peer across the bisection in anotherdimension. These two offset dimensions correspond to thetop two levels of the folded Clos that underlies Clos-AD.Valiant routing [19] reliably achieves throughput of β:about half of all packets traverse any given bisection, twice,when routed via a random intermediate node (independentof traffic pattern). This throughput certainty comes at theprice of higher latency for light traffic loads due to the increasedhop count. Min-AD achieves throughput of 2β ontraffic patterns that do not cause congestion, such as uniformrandom, but throughput can degrade to 1/T on patternsthat cause congestion at the source.The adaptive algorithms, Clos-AD and DAL, are betterchoices. For low loads, both algorithms choose a minimal

lat : DALhop : DALValiantValiantClos-ADClos-ADMin-ADMin-ADlat : DALhop : DALValiantValiantClos-ADClos-ADMin-ADMin-AD606606505505Latency (cycle)403020432Hop countLatency (cycle)403020432Hop count10100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0Load(a) Bit Complement10100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0Load(b) Bit Rotatelat : DALhop : DALValiantValiantClos-ADClos-ADMin-ADMin-ADlat : DALhop : DALValiantValiantClos-ADClos-ADMin-ADMin-AD606606505505Latency (cycle)403020432Hop countLatency (cycle)403020432Hop count101101000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(c) Transpose000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(d) Swap2Figure 6: Load-latency graphs on the (a) Bit Complement, (b) Bit Rotate, (c) Transpose, and (d) Swap2traffic patterns [3] of a regular HyperX network with N = 4096, R = 32, β = 0.5, T = 8, S = 8, K = 1.route and achieve low latency, whereas at high load, theyadapt, choosing nonminimal routes to avoid congestion. Theprimary difference between the two algorithms is when theydecide to adapt. Clos-AD estimates, at the source, the delayof a minimal route and a non-minimal route and makes itsadaptation decision eagerly. DAL makes decisions lazily asthe packet traverses the network and decides on the routeindependently per dimension. Eager adaptation means thatClos-AD has less information which could be out of date,while lazy adaptation allows DAL to be far more nimble.Clos-AD is usually better than Min-AD and Valiant, butfor transpose some limitations become apparent. Clos-ADmakes its adaptation decision at the source. With the limitedinformation available, its delay estimate is inherentlyimperfect. The delay estimate will be biased towards minimalpaths for some patterns and towards non-minimal pathsfor others. Our delay estimate tends to underestimate thecost of the non-minimal path. This bias is the reason forthe higher average hop counts of Clos-AD at all load levelswhen compared to DAL. On the Bit Complement pattern,the terminals connected to a given switch communicatewith terminals connected to a single other switch, a patternwhich congests the links on the shortest paths betweenthem. Moreover, these shortest paths span all the dimensions,so DAL will deroute most packets in each dimension.For these patterns, CLOS-AD has the advantage of makingthat deroute decision once and for all, early, and thisimproves latency for low to medium load. On the Bit Rotationpattern, this congestion occurs but on few links. Sothe average latency of DAL approaches that of CLOS-ADfor medium load.DAL has higher saturation throughput than the alternativeson all patterns that we have tested. This is becauseit will take the minimal path whenever possible, but routearound congestion when necessary. Furthermore DAL assessescongestion at each hop and independently for each dimension.Our Swap2 pattern is a good example of why DALexcels. The theoretical throughput of this pattern should be2β, which DAL is able to achieve. For this pattern, Clos-ADoverlays a folded Clos network on top of the HyperX, andcan therefore pick a route that goes through offset intermediatenodes in already aligned dimensions.We follow our comparison of routing algorithms on HyperXby comparing HyperX networks using DAL to a foldedClos network using adaptive routing. Figure 7 presents theresults of this comparison across the same set of traffic patterns.N = 4096 and R = 32 in all three networks and weuse the same configuration as in Figure 6 for the HyperXwith β = 0.5. The HyperX with β = 0.25 has T = 14,S = 7, K = 1 so that P = 343. For a tapered folded Clos,we define β asβ ≡number of links from core switches.network sizeThe folded Clos (with β = 0.5) needs three levels. There are205 level 1 switches with 10 uplinks and 20 downlinks; 130

HyperX, β = 0.5HyperX, β = 0.25Folded Clos, β = 0.5HyperX, β = 0.5HyperX, β = 0.25Folded Clos, β = 0.560605050Latency (cycle)403020Latency (cycle)403020101000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(a) Bit Complement00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(b) Bit RotateHyperX, β = 0.5HyperX, β = 0.25Folded Clos, β = 0.5HyperX, β = 0.5HyperX, β = 0.25Folded Clos, β = 0.560605050Latency (cycle)403020Latency (cycle)403020101000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(c) Transpose00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Load(d) Swap2Figure 7: Load-latency graphs of HyperX networks using DAL routing compared with folded Clos networksusing adaptive routing on the (a) Bit Complement, (b) Bit Rotate, (c) Transpose, and (d) Swap2 trafficpatterns when N = 4096, R = 32, and β is varied.level 2 switches that have 16 uplinks and 16 downlinks; and80 level 3 switches with 13 downlinks, 2 of them packaged perphysical switch so that P = 415. In all cases, the folded Closachieves a saturation throughput of β because the topologyeffectively forces all patterns to behave as if they are routedby Valiant. For the same reason, the hop count on foldedClos is generally higher than HyperX with DAL as seen inthe Valiant routing case of Figure 6. The benefit of HyperXwith DAL routing is that it is able to achieve a load of 2βon patterns that don’t exhibit congestion while maintaininga load of β on patterns that do.4.2 Resource comparisonThe previous section shows that DAL routing on HyperXnetworks provides very low network latencies for a range oftraffic patterns. The focus of this section is a cost comparisonof the folded Clos topology and the HyperX topologyfor a wide range of topological parameters.In Figure 8, we vary network size (N), switch radix (R),and tapering ratio (β). The vertical axis of the graphs depictsthe number of switches divided by the number of terminalsof the network. This measure can be thought of asan estimation of networking cost per terminal node. Stepsin the graphs indicate that the dimension of the network increased.This is particularly important because an increasein dimension will result in an increase in cost for both HyperXand folded Clos. In HyperX networks, an increase indimension effectively allows fewer terminal nodes to be connectedto each switch resulting in an increase of the totalnumber of switches to achieve a given network size. FoldedClos networks add another complete layer of switches to thenetwork with each increase in dimension. In HyperX, becausefewer terminal nodes can be connected to each switchas dimensions are added, the cost of each additional dimensionis higher than a folded Clos where additional dimensionsresult in a constant cost increase. This addition is marginalat very high dimensions thus making folded Clos a betterchoice in this regime. Our data does indeed show that atlow radices (such as radix 32) and high node counts, thecost of a folded Clos with β = x is comparable with that ofa HyperX with β = 0.5x. The advent of high-radix switcheschanges this story because very large networks can be builtwith few dimensions. In the radix 128 and 256 cases, thecost of a folded Clos with β = x is comparable with thatof HyperX with β = x on many configurations. We haveshown that a HyperX can, on many traffic patterns, handlea higher load (as much as double) than a folded Clos ofequal bisection bandwidth; this makes a compelling case forHyperX.5. PACKAGING HYPERXSystem packaging is a central practical concern in networkdesign, especially for systems that occupy warehousesized machine rooms. A good mapping between the networktopology and the packaging hierarchy greatly simplifies the

(Number of switchs) / (Network size)0.50.450.40.350.30.250.20.150.1(Folded Clos, β =1.0)(Folded Clos, β =0.5)(Folded Clos, β =0.25)(Folded Clos, β =0.125)(HyperX, β =1.0)(HyperX, β =0.5)(HyperX, β =0.25)(HyperX, β =0.125)(Number of switchs) / (Network size)0.220.20.180.160.140.120.10.080.060.04(Folded Clos, β =1.0)(Folded Clos, β =0.5)(Folded Clos, β =0.25)(Folded Clos, β =0.125)(HyperX, β =1.0)(HyperX, β =0.5)(HyperX, β =0.25)(HyperX, β =0.125)0.050.0202 8 2 10 2 12 2 14 2 16 2 18 2 20Network size(a) Radix 3202 8 2 10 2 12 2 14 2 16 2 18 2 20Network size(b) Radix 64(Number of switchs) / (Network size)0.070.060.050.040.030.020.01(Folded Clos, β =1.0)(Folded Clos, β =0.5)(Folded Clos, β =0.25)(Folded Clos, β =0.125)(HyperX, β =1.0)(HyperX, β =0.5)(HyperX, β =0.25)(HyperX, β =0.125)(Number of switchs) / (Network size)0.030.0250.020.0150.010.005(Folded Clos, β =1.0)(Folded Clos, β =0.5)(Folded Clos, β =0.25)(Folded Clos, β =0.125)(HyperX, β =1.0)(HyperX, β =0.5)(HyperX, β =0.25)(HyperX, β =0.125)02 8 2 10 2 12 2 14 2 16 2 18 2 20Network size(c) Radix 12802 8 2 10 2 12 2 14 2 16 2 18 2 20Network size(d) Radix 256Figure 8: The number of switches while the network size is varied using switches with radix (a) 32, (b) 64,(c) 128, and (d) 256.system wiring. For example an entire level of interconnectmay be contained within a subsystem, minimizing the numberof individual cables between subsystems. Depending onthe topology, the number of cable assemblies can be furtherreduced by using cables with multiple parallel connections.This is particularly beneficial in optically connected systemswhere the number of parallel connections is not constrainedby considerations of cable bulk or connector density. Parallelribbon cables with up to 100 fibers have been demonstratedusing MT connectors [16].Minimizing the total cable length and using a regularstructure that is easy to deploy are also key considerations.The Dragonfly [7] is an example of a system in which networkand system packaging are co-designed to provide themost cost effective solution. The key difference between HyperXand Dragonfly is that HyperX assumes that all cableswill likely be optical in the future, whereas Dragonfly assumesthat intra-rack cables will be electrical (and cheaper)while inter-rack cables will be optical. As data rates increaseand as optical cables become more commonplace, power andcost will likely favor optics even for intra-rack applications.To minimize cabling in folded Clos networks, two packagingstrategies are common. The first strategy embeds theswitch hierarchy within the package hierarchy to exploit localinterconnect within the enclosures. The second strategycombines multiple levels of switch components into alarger switch chassis. A 128K port folded Clos network usingradix 128 switches requires three levels of switches. Thefirst level can be integrated with groups of 64 processing elements.The top two levels can be combined to make 8192-port switch units.In this arrangement only one stage of cabling is required,between the 2048, 64-node processing enclosures, and the16, 8192-port switches. Each processor enclosure has 4 linksto each of the switches. This requires a total of 32768 cableseach with 8 fibers. There is a wide range of connectionlengths, and the distribution of cable lengths is only knownafter detailed floor planning of the entire system.

S 1 = 32boards16/32 links to processing elementsT = 16 links to processing elements16/32 links to processing elements16/32 links to processing elements3131L1HyperX L1 2 x 15L2HyperX router 2 x 15L2switch L3L32 x 152 x 153131L1HyperX L1 2 x 15L2HyperX router 2 x 15L2router L3L32 x 152 x 15L1 32 way fullyconnected networkL2 link aggregationL3 link aggregationL215 fiber ribbonsL215 fiber ribbonsL315 fiber ribbonsL315 fiber ribbonsK 2 * (S 2 – 1) = 30K 3 * (S 3 – 1) = 30Figure 9: HyperX chassis showing wiring.The flexibility of the HyperX topology allows a configurationto be selected such that the different dimensions ofHyperX connectivity map directly onto the physical dimensionsof the system. In the case of a 128K port system wecan choose, for example, a three-dimensional HyperX withT = 16, S 1 = 32, S 2 = S 3 = 16, K 1 = 1, and K 2 = K 3 = 2.The first dimension is routed within the enclosure, and thetwo further dimensions correspond to the placement of theenclosure within a two-dimensional array in the machineroom.Figure 9 shows the connectivity of switches with the enclosure.As HyperX is a direct topology the switch can belocated on the same card as the processing elements. Eachboard thus comprises 16 terminal nodes and one switch. The32 boards in an enclosure communicate via an optical backplanethat links all switches in the rack in a fully connectednetwork, forming the first HyperX dimension. This connects`S 1´2 = 496 pairs of nodes using 992 waveguides orfibers. The optical backplane further aggregates the secondand third dimension links to enable the use of parallel fiberribbons for interconnect between enclosures. The aggregationnetworks require a further 1920 waveguides per level.Figure 10 shows the placement of enclosures in a twodimensionalarray. Each enclosure is connected to each ofits peers in both dimensions. Each pair of enclosures isconnected by two separate fiber ribbons for resiliency. Thewiring therefore consists of many replicated instances of 16-way all-to-all wiring patterns. Along the mid-line, at everyrow or column of racks, there are 64 optical fiber ribbonsconnecting all 8 enclosures on one side to the 8 on the other.Each set of all-to-all wiring between rows or columns of16 enclosures requires 120 fiber ribbon cables, in 15 distinctlengths. This pattern is repeated 64 times, 32 times in theX dimension and 32 times in the Y dimension, giving a totalof 7680, 64-way fiber ribbon cables. Assuming a pitchbetween enclosures of 1.2m in each dimension with an additionalmeter of overhead, cable lengths range from 2.2m to(S 2 = 16) way all to all L2 wiring x 2(S3 = 16) way all to all L3 wiring x 2Maximum bisection64 cables per all-to-all(S 2 x S 3 = 16 x 16) array of(T x S 1 = 512) processor racksFigure 10: HyperX system floor plan.19m. The total length of 64 way optical fiber ribbon is 60km.The direct correspondence between the HyperX dimensionsand the physical dimensions of the enclosure arrays greatlysimplifies the task of connecting the system. It also avoidswiring congestion points typical of folded Clos wiring at thecore level.6. FAULT TOLERANCECables and connectors cause most of the failures in largescale networks. The move to optics will change the faultcharacteristics in some respects, but the dominant failuremechanism will remain individual links rather than switches.Timeout and retry mechanisms can be used to handle transientfailures, however a network with half a million connectionsis likely to suffer hard link failures.The massive path diversity in folded Clos networks createsthe potential for a high degree of fault tolerance. Although

there are many alternate uplinks to common ancestors, thechoice of downlink is entirely deterministic and will block atany failed link; with oblivious routing a packet can thereforeget stuck at a failed link. In this case the packet must beretried from the source. In order to minimize performancedegradation in the presence of failed links it is highly desirableto prohibit failed routes from being selected as part ofthe routing process.Adaptive routing in folded Clos networks has the advantagethat it can route around failed uplinks. Links withhard errors are simply eliminated from the possible routingalternates. In order to extend this to route around faileddownlinks a mechanism is required which checks whether agiven choice of uplink would cause a failed downlink to beselected in the down path. If this is the case an alternateuplink is selected. The mechanism for detecting potentialdownlink errors uses the packet’s destination to look up atable of disallowed uplinks for this destination. The tablesize can be compressed by recognizing that due to the hierarchicaltopology contiguous ranges of destinations can behandled by a single entry.HyperX networks offer more possibilities for routing aroundfailed links, depending upon the routing algorithm beingused. In the case of DAL routing, a packet can only beblocked by a single failed link when there is only one remainingoffset dimension, and the deroute has already beentaken in that dimension. A simple solution is to permit afurther deroute. Provided there is only a single failed linkwithin a fully connected subnetwork this will not livelock.In order to avoid selecting a nonminimal route that wouldlead to a failed link, the same technique of filters on randomroutes proposed for folded Clos networks can be applied.Under DAL routing, the failure information required to beheld on each switch is more localized, since each switch needonly be concerned with failed links attached to its peers ineach dimension.7. CONCLUSIONTo design a network for an exascale system requires tradeoffsin performance, power, wiring complexity, and fault tolerance.In this paper, we investigated the impact on thosetradeoffs of high-radix switches and the topologies that theyenable.We show that the HyperX topology is an attractive candidatefor exascale networks because of its ability to achievea favorable balance among these tradeoffs. We compare theHyperX to the folded Clos and show several design pointswhere the HyperX uses significantly fewer routers to achievesimilar bandwidth on benign patterns. For applications thatdemand full bisection bandwidth, the folded Clos is generallymore efficient in terms of router components. We furtherinvestigate the routing algorithms possible on HyperX andcompare our new DAL algorithm to other algorithms forthe folded Clos and flattened butterfly and show it achievessuperior performance because of its ability to adapt to congestionon a per-hop basis. We compare the physical packagingof HyperX and the folded Clos, and show how paralleloptical fibers can be used to minimize the number of discretecable connections in each case such that large 100,000networks can be constructed with fewer than 10,000 cables.When these high density connections are available, HyperXfurther allows switches to be colocated with the processingelements in these large computer systems, leading toa simpler structure with a smaller set of component partscompared to folded Clos networks.8. REFERENCES[1] “Top500 List,” http://www.top500.org/list.[2] M. Al-Fares, A. Loukissas, and A. Vahdat, “AScalable, Commodity Data Center NetworkArchitecture,” in SIGCOMM, Jun 2008.[3] W. J. Dally and B. Towles, Principles and Practices ofInterconnection Networks. Morgan KaufmannPublishers Inc., 2003.[4] P. Geoffray and T. Hoefler, “Adaptive RoutingStrategies for Modern High Performance Networks,” inHigh Performance Interconnects, 2008, pp. 165–172.[5] J. Kim, W. J. Dally, and D. Abts, “Adaptive Routingin High-Radix Clos Network,” in SC’06, Nov 2006.[6] ——, “Flattened Butterfly: a Cost-efficient Topologyfor High-Radix Networks,” in ISCA, Jun 2007.[7] J. Kim, W. J. Dally, S. Scott, and D. Abts,“Technology-Driven, Highly-Scalable DragonflyTopology,” in ISCA, Jun 2008.[8] J. Kim, W. J. Dally, B. Towles, and A. K. Gupta,“Microarchitecture of a High-Radix Router,” in ISCA,Jun 2005.[9] N. Kirman, et al., “Leveraging Optical Technology inFuture Bus-based Chip Multiprocessors,” in MICRO,Nov 2006.[10] A. Komornicki, G. Mullen-Schulz, and D. Landon,“Roadrunner: Hardware and Software Overview,” IBMRedpaper, 2009.[11] C. Leiserson, “Fat-trees: Universal Networks forHardware-Efficient Supercomputing,” IEEETransactions on Computers, vol. 34, pp. 892–901,1985.[12] X.-Y. Lin, Y.-C. Chung, and T.-Y. Huang, “AMultiple LID Routing Scheme for Fat-tree BasedInfiniBand Networks,” in IPDPS, Apr 2004.[13] N. McKeown, A. Mekkittikul, V. Anantharam, andJ. Walrand, “Achieving 100 throughput in aninput-queued switch,” Communications, IEEETransactions on, vol. 47, no. 8, Aug 1999.[14] National Center for Computational Sciences, “Jaguar,”http://www.nccs.gov/computing-resources/jaguar.[15] S. R. Ohring, M. Ibel, S. K. Das, and M. J. Kumar,“Generalized Fat Trees,” in IEEE IPPS, 1995, pp.37–44.[16] T. Sabano, et al., “Development of Reference MTFerrule using Insert-molded Metal Plate,”OFC/NFOEC, Feb 2008.[17] S. Scott, D. Abts, J. Kim, and W. J. Dally, “TheBlackWidow High-Radix Clos Network,” in ISCA, Jun2006.[18] Semiconductor Industries Association, “InternationalTechnology Roadmap for Semiconductors,”http://www.itrs.net, 2007 Edition.[19] L. G. Valiant, “A Scheme for Fast ParallelCommunication,” SIAM Journal on Computing,vol. 11, no. 2, 1982.[20] D. Vantrease, et al., “Corona: System Implications ofEmerging Nanophotonic Technology,” in ISCA, Jun2008.

Topology, Routing, and Packaging of Efficient Large-Scale Networks

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?