DGX/HGX GPU Cluster Network Topologies: Fat-Tree, Spine-Leaf, and Dragonfly+ Compared
Share
Introduction
Selecting the right network topology is one of the most critical decisions when designing GPU clusters for AI training. The topology determines bandwidth availability, latency characteristics, scalability limits, and total cost of ownership. This article provides an in-depth comparison of the three dominant topologies for DGX and HGX clusters: Fat-Tree, Spine-Leaf (CLOS), and Dragonfly+.
Topology Fundamentals
Network topology defines how switches and compute nodes are interconnected. For AI clusters, the ideal topology must provide:
- High bisection bandwidth: Any half of the cluster can communicate with the other half at full speed
- Low diameter: Minimum number of hops between any two nodes
- Scalability: Ability to grow from hundreds to tens of thousands of nodes
- Fault tolerance: Multiple paths between endpoints for redundancy
- Cost efficiency: Optimal balance of performance and capital expenditure
Fat-Tree Topology
Architecture
Fat-Tree is a multi-rooted tree where bandwidth increases toward the core. A typical 3-tier Fat-Tree consists of:
- Edge/Leaf Layer: Switches directly connected to GPU servers
- Aggregation/Spine Layer: Intermediate switches connecting leaf switches
- Core Layer: Top-tier switches providing inter-pod connectivity (for very large deployments)
In a pure Fat-Tree, every leaf switch connects to every spine switch, creating a non-blocking fabric with full bisection bandwidth.
Key Characteristics
- Bisection Bandwidth: 100% (non-blocking)
- Diameter: 4-6 hops (leaf → spine → spine → leaf)
- Scalability: Up to 100,000+ endpoints with 3-tier design
- Redundancy: N paths between any two servers (N = number of spine switches)
Advantages
- Predictable, deterministic performance
- Well-understood design patterns and operational practices
- Full bisection bandwidth eliminates network bottlenecks
- Excellent for all-to-all communication (gradient synchronization)
Disadvantages
- High cable count: O(N²) cables for N switches
- Expensive: requires many high-radix switches
- Power consumption scales linearly with cluster size
- Physical cabling complexity in large deployments
Best Use Cases
- Clusters with 100-5,000 GPUs
- Workloads requiring guaranteed bandwidth (LLM training)
- Environments where predictability trumps cost
Spine-Leaf (CLOS) Topology
Architecture
Spine-Leaf is a 2-tier CLOS fabric, a generalization of Fat-Tree optimized for data center deployments:
- Leaf Layer: Top-of-Rack (ToR) switches connecting servers
- Spine Layer: Aggregation switches providing inter-leaf connectivity
Every leaf connects to every spine, but unlike Fat-Tree, Spine-Leaf allows for asymmetric designs (e.g., different port counts, oversubscription ratios).
Key Characteristics
- Bisection Bandwidth: 50-100% (configurable via oversubscription)
- Diameter: 2 hops (leaf → spine → leaf)
- Scalability: 10,000-100,000 endpoints
- Flexibility: Supports tapered designs (2:1, 4:1 oversubscription)
Advantages
- Lower latency than Fat-Tree (fewer hops)
- Flexible oversubscription allows cost optimization
- Industry-standard design with broad vendor support
- Easier to scale incrementally (add spine switches as needed)
Disadvantages
- Oversubscribed designs can create bottlenecks
- Requires careful traffic engineering to avoid hotspots
- Still requires significant cabling (though less than Fat-Tree)
Best Use Cases
- General-purpose GPU clusters (mixed training/inference)
- Deployments prioritizing cost-performance balance
- Clusters with locality-aware workload placement
DGX SuperPOD Example
NVIDIA's DGX SuperPOD uses a Spine-Leaf design with InfiniBand:
- Leaf Switches: NVIDIA Quantum-2 QM8700 (64 ports @ 400Gbps)
- Spine Switches: NVIDIA Quantum-2 QM9700 (64 ports @ 400Gbps)
- Configuration: 20 DGX A100 systems per leaf, 8 uplinks per leaf to spine
- Bisection Bandwidth: 25.6Tbps per SuperPOD (non-blocking)
Dragonfly+ Topology
Architecture
Dragonfly+ is a hierarchical topology designed for extreme-scale systems (10,000+ nodes). It organizes nodes into groups with all-to-all intra-group connectivity and sparse inter-group links:
- Intra-Group: All switches within a group are fully connected
- Inter-Group: Each switch has links to switches in other groups
- Hierarchical: Can be extended to multiple levels (groups of groups)
Key Characteristics
- Bisection Bandwidth: 40-60% (lower than Fat-Tree, but sufficient for most workloads)
- Diameter: 3 hops (local switch → global link → remote group → destination)
- Scalability: 100,000+ endpoints with 2-level hierarchy
- Cable Efficiency: O(N^1.5) vs. O(N²) for Fat-Tree
Advantages
- Dramatically reduced cable count (50-70% fewer than Fat-Tree)
- Lower cost per port at extreme scale
- Excellent for workloads with locality (model parallelism within groups)
- Lower power consumption due to fewer switches
Disadvantages
- Complex routing algorithms required (adaptive routing essential)
- Performance depends heavily on traffic patterns
- Less predictable than Fat-Tree for all-to-all traffic
- Requires sophisticated workload placement strategies
Best Use Cases
- Extreme-scale clusters (10,000+ GPUs)
- Workloads with strong locality (pipeline parallelism, federated learning)
- Cost-sensitive deployments where 100% bisection bandwidth isn't required
Topology Comparison Table
| Dimension | Fat-Tree | Spine-Leaf | Dragonfly+ |
|---|---|---|---|
| Bisection BW | 100% | 50-100% | 40-60% |
| Diameter | 4-6 hops | 2 hops | 3 hops |
| Scalability | 100K nodes | 100K nodes | 1M+ nodes |
| Cable Count | Very High | High | Medium |
| Cost (relative) | Highest | Medium | Lowest |
| Complexity | Low | Low | High |
| Predictability | Excellent | Good | Fair |
Choosing the Right Topology
For Small-Medium Clusters (100-1,000 GPUs)
Recommendation: Spine-Leaf (2-tier CLOS)
- Optimal balance of cost, performance, and simplicity
- 2-hop latency ideal for training workloads
- Easy to deploy and operate
For Large Clusters (1,000-10,000 GPUs)
Recommendation: Fat-Tree or Spine-Leaf with minimal oversubscription
- Full bisection bandwidth critical at this scale
- Predictable performance justifies higher cost
- Operational maturity of these topologies reduces risk
For Extreme-Scale Clusters (10,000+ GPUs)
Recommendation: Dragonfly+ or multi-tier CLOS
- Cable reduction becomes critical at this scale
- Workload placement strategies can mitigate lower bisection bandwidth
- Cost savings of 30-50% vs. Fat-Tree
Hybrid Approaches
Many deployments use hybrid topologies:
- Intra-Pod Fat-Tree + Inter-Pod Dragonfly: Full bandwidth within training pods, sparse connectivity between pods
- Spine-Leaf with Rail Optimization: Separate fabrics for compute, storage, and management traffic
- Hierarchical CLOS: Multiple spine layers for mega-scale deployments
Conclusion
There is no one-size-fits-all topology for GPU clusters. Fat-Tree and Spine-Leaf dominate the 100-10,000 GPU range due to their predictability and operational maturity. Dragonfly+ emerges as the cost-effective choice for extreme-scale deployments where workload locality can be exploited.
When selecting a topology, consider:
- Cluster size and growth trajectory
- Workload characteristics (all-to-all vs. localized communication)
- Budget constraints (CapEx and OpEx)
- Operational expertise and tooling
For most organizations deploying DGX or HGX clusters today, a 2-tier Spine-Leaf fabric with 400G/800G optics and 1:1 or 2:1 oversubscription represents the sweet spot of performance, cost, and operational simplicity.