DGX/HGX GPU Cluster Network Topologies: Fat-Tree, Spine-Leaf, and Dragonfly+ Compared

Introduction

Selecting the right network topology is one of the most critical decisions when designing GPU clusters for AI training. The topology determines bandwidth availability, latency characteristics, scalability limits, and total cost of ownership. This article provides an in-depth comparison of the three dominant topologies for DGX and HGX clusters: Fat-Tree, Spine-Leaf (CLOS), and Dragonfly+.

Topology Fundamentals

Network topology defines how switches and compute nodes are interconnected. For AI clusters, the ideal topology must provide:

  • High bisection bandwidth: Any half of the cluster can communicate with the other half at full speed
  • Low diameter: Minimum number of hops between any two nodes
  • Scalability: Ability to grow from hundreds to tens of thousands of nodes
  • Fault tolerance: Multiple paths between endpoints for redundancy
  • Cost efficiency: Optimal balance of performance and capital expenditure

Fat-Tree Topology

Fat-tree network topology three-tier architecture

Architecture

Fat-Tree is a multi-rooted tree where bandwidth increases toward the core. A typical 3-tier Fat-Tree consists of:

  • Edge/Leaf Layer: Switches directly connected to GPU servers
  • Aggregation/Spine Layer: Intermediate switches connecting leaf switches
  • Core Layer: Top-tier switches providing inter-pod connectivity (for very large deployments)

In a pure Fat-Tree, every leaf switch connects to every spine switch, creating a non-blocking fabric with full bisection bandwidth.

Key Characteristics

  • Bisection Bandwidth: 100% (non-blocking)
  • Diameter: 4-6 hops (leaf → spine → spine → leaf)
  • Scalability: Up to 100,000+ endpoints with 3-tier design
  • Redundancy: N paths between any two servers (N = number of spine switches)

Advantages

  • Predictable, deterministic performance
  • Well-understood design patterns and operational practices
  • Full bisection bandwidth eliminates network bottlenecks
  • Excellent for all-to-all communication (gradient synchronization)

Disadvantages

  • High cable count: O(N²) cables for N switches
  • Expensive: requires many high-radix switches
  • Power consumption scales linearly with cluster size
  • Physical cabling complexity in large deployments

Best Use Cases

  • Clusters with 100-5,000 GPUs
  • Workloads requiring guaranteed bandwidth (LLM training)
  • Environments where predictability trumps cost

Spine-Leaf (CLOS) Topology

Spine-Leaf CLOS network topology diagram

Architecture

Spine-Leaf is a 2-tier CLOS fabric, a generalization of Fat-Tree optimized for data center deployments:

  • Leaf Layer: Top-of-Rack (ToR) switches connecting servers
  • Spine Layer: Aggregation switches providing inter-leaf connectivity

Every leaf connects to every spine, but unlike Fat-Tree, Spine-Leaf allows for asymmetric designs (e.g., different port counts, oversubscription ratios).

Key Characteristics

  • Bisection Bandwidth: 50-100% (configurable via oversubscription)
  • Diameter: 2 hops (leaf → spine → leaf)
  • Scalability: 10,000-100,000 endpoints
  • Flexibility: Supports tapered designs (2:1, 4:1 oversubscription)

Advantages

  • Lower latency than Fat-Tree (fewer hops)
  • Flexible oversubscription allows cost optimization
  • Industry-standard design with broad vendor support
  • Easier to scale incrementally (add spine switches as needed)

Disadvantages

  • Oversubscribed designs can create bottlenecks
  • Requires careful traffic engineering to avoid hotspots
  • Still requires significant cabling (though less than Fat-Tree)

Best Use Cases

  • General-purpose GPU clusters (mixed training/inference)
  • Deployments prioritizing cost-performance balance
  • Clusters with locality-aware workload placement

DGX SuperPOD Example

NVIDIA DGX SuperPOD reference architecture

NVIDIA's DGX SuperPOD uses a Spine-Leaf design with InfiniBand:

  • Leaf Switches: NVIDIA Quantum-2 QM8700 (64 ports @ 400Gbps)
  • Spine Switches: NVIDIA Quantum-2 QM9700 (64 ports @ 400Gbps)
  • Configuration: 20 DGX A100 systems per leaf, 8 uplinks per leaf to spine
  • Bisection Bandwidth: 25.6Tbps per SuperPOD (non-blocking)

Dragonfly+ Topology

Dragonfly+ network topology hierarchical structure

Architecture

Dragonfly+ is a hierarchical topology designed for extreme-scale systems (10,000+ nodes). It organizes nodes into groups with all-to-all intra-group connectivity and sparse inter-group links:

  • Intra-Group: All switches within a group are fully connected
  • Inter-Group: Each switch has links to switches in other groups
  • Hierarchical: Can be extended to multiple levels (groups of groups)

Key Characteristics

  • Bisection Bandwidth: 40-60% (lower than Fat-Tree, but sufficient for most workloads)
  • Diameter: 3 hops (local switch → global link → remote group → destination)
  • Scalability: 100,000+ endpoints with 2-level hierarchy
  • Cable Efficiency: O(N^1.5) vs. O(N²) for Fat-Tree

Advantages

  • Dramatically reduced cable count (50-70% fewer than Fat-Tree)
  • Lower cost per port at extreme scale
  • Excellent for workloads with locality (model parallelism within groups)
  • Lower power consumption due to fewer switches

Disadvantages

  • Complex routing algorithms required (adaptive routing essential)
  • Performance depends heavily on traffic patterns
  • Less predictable than Fat-Tree for all-to-all traffic
  • Requires sophisticated workload placement strategies

Best Use Cases

  • Extreme-scale clusters (10,000+ GPUs)
  • Workloads with strong locality (pipeline parallelism, federated learning)
  • Cost-sensitive deployments where 100% bisection bandwidth isn't required

Topology Comparison Table

Dimension Fat-Tree Spine-Leaf Dragonfly+
Bisection BW 100% 50-100% 40-60%
Diameter 4-6 hops 2 hops 3 hops
Scalability 100K nodes 100K nodes 1M+ nodes
Cable Count Very High High Medium
Cost (relative) Highest Medium Lowest
Complexity Low Low High
Predictability Excellent Good Fair

Choosing the Right Topology

For Small-Medium Clusters (100-1,000 GPUs)

Recommendation: Spine-Leaf (2-tier CLOS)

  • Optimal balance of cost, performance, and simplicity
  • 2-hop latency ideal for training workloads
  • Easy to deploy and operate

For Large Clusters (1,000-10,000 GPUs)

Recommendation: Fat-Tree or Spine-Leaf with minimal oversubscription

  • Full bisection bandwidth critical at this scale
  • Predictable performance justifies higher cost
  • Operational maturity of these topologies reduces risk

For Extreme-Scale Clusters (10,000+ GPUs)

Recommendation: Dragonfly+ or multi-tier CLOS

  • Cable reduction becomes critical at this scale
  • Workload placement strategies can mitigate lower bisection bandwidth
  • Cost savings of 30-50% vs. Fat-Tree

Hybrid Approaches

Many deployments use hybrid topologies:

  • Intra-Pod Fat-Tree + Inter-Pod Dragonfly: Full bandwidth within training pods, sparse connectivity between pods
  • Spine-Leaf with Rail Optimization: Separate fabrics for compute, storage, and management traffic
  • Hierarchical CLOS: Multiple spine layers for mega-scale deployments

Conclusion

There is no one-size-fits-all topology for GPU clusters. Fat-Tree and Spine-Leaf dominate the 100-10,000 GPU range due to their predictability and operational maturity. Dragonfly+ emerges as the cost-effective choice for extreme-scale deployments where workload locality can be exploited.

When selecting a topology, consider:

  • Cluster size and growth trajectory
  • Workload characteristics (all-to-all vs. localized communication)
  • Budget constraints (CapEx and OpEx)
  • Operational expertise and tooling

For most organizations deploying DGX or HGX clusters today, a 2-tier Spine-Leaf fabric with 400G/800G optics and 1:1 or 2:1 oversubscription represents the sweet spot of performance, cost, and operational simplicity.

Back to blog