Data Center East-West Traffic: Optical Module Requirements for Modern Workloads

Introduction

The traditional data center traffic model—where most communication flows north-south between clients and servers—has fundamentally shifted. Modern applications, particularly AI training, distributed databases, microservices architectures, and hyper-converged infrastructure, generate massive east-west traffic flows between servers within the data center. This transformation has profound implications for optical module selection, network architecture, and capacity planning. Understanding east-west traffic patterns and their impact on optical networking is essential for designing efficient, scalable AI data centers.

Understanding East-West vs North-South Traffic

Traditional North-South Model

Characteristics: In traditional three-tier architectures (access-distribution-core), traffic primarily flows vertically between end users and centralized servers. A typical ratio was 80% north-south (client-server) to 20% east-west (server-server).

Network Design: Optimized for north-south bandwidth with oversubscribed east-west paths. Core and distribution layers had high bandwidth, while server-to-server communication traversed multiple hops with limited capacity.

Optical Module Deployment: High-speed modules concentrated at core and distribution layers (40G, 100G), while access layer used lower speeds (1G, 10G).

Modern East-West Dominance

Traffic Shift: Modern data centers exhibit 70-90% east-west traffic, with some AI training clusters approaching 95% east-west during training operations.

Drivers:

  • Distributed Computing: MapReduce, Spark, and other frameworks distribute computation across hundreds or thousands of servers
  • Microservices: Applications decomposed into dozens or hundreds of services communicating constantly
  • AI Training: Gradient synchronization requires all-to-all communication between GPUs
  • Distributed Storage: Ceph, HDFS, and other systems replicate data across multiple nodes
  • Virtual Machine Migration: Live migration moves VMs between hosts, generating large data transfers

Network Implications: Requires non-blocking or minimally oversubscribed east-west bandwidth, fundamentally changing network topology and optical module requirements.

AI Training: The Ultimate East-West Workload

Communication Patterns in Distributed Training

Data Parallelism: The most common training strategy splits data across GPUs, each processing different batches:

  • Forward Pass: Minimal communication, each GPU processes independently
  • Backward Pass: Compute gradients locally
  • Gradient Synchronization: All-reduce operation exchanges gradients between all GPUs—pure east-west traffic
  • Traffic Volume: For 175B parameter model (350GB gradients), 1024 GPUs must exchange 350GB each iteration
  • Frequency: 5-20 iterations per second, generating continuous east-west traffic bursts

Model Parallelism: Large models split across GPUs create different patterns:

  • Pipeline Parallelism: Sequential stages pass activations forward and gradients backward—linear east-west traffic pattern
  • Tensor Parallelism: Layers split across GPUs require frequent all-reduce within each layer—extremely high east-west bandwidth
  • Mixture-of-Experts: Routing mechanism creates dynamic east-west traffic to different expert GPUs

Bandwidth Requirements: For optimal GPU utilization, network bandwidth must match or exceed GPU computation speed. NVIDIA H100 with 1000 TFLOPS requires approximately 400-800 Gbps network bandwidth per GPU to avoid communication bottlenecks in large-scale training.

Optical Module Implications

Server Connectivity:

  • Single-GPU Servers: 200G or 400G NIC sufficient
  • 8-GPU Servers: 2×400G or 8×400G (rail-optimized) required
  • Form Factor: QSFP-DD or OSFP depending on thermal and density requirements
  • Latency: <500ns module latency critical for maintaining GPU utilization

Switch Infrastructure:

  • Leaf Switches: 400G or 800G server-facing ports
  • Spine Switches: 800G or 1.6T for aggregation
  • Oversubscription: 1:1 (non-blocking) to 2:1 maximum for AI training
  • Total Modules: 10,000 GPU cluster requires 10,000-20,000 optical modules depending on architecture

Microservices and Container Networking

Service Mesh Communication

Architecture: Modern applications consist of hundreds of microservices, each running in containers, communicating via service mesh (Istio, Linkerd, Consul).

Traffic Characteristics:

  • High Connection Count: Thousands of concurrent TCP connections between services
  • Small Messages: Many requests are small (kilobytes), but high frequency
  • Unpredictable Patterns: Traffic flows change dynamically based on user requests and service dependencies
  • East-West Dominance: 80-90% of traffic is service-to-service within the data center

Network Requirements:

  • Low Latency: Service-to-service latency must be <1ms to maintain application responsiveness
  • High Packet Rate: Millions of packets per second (Mpps) capacity needed
  • Bandwidth: Aggregate bandwidth more important than per-flow bandwidth
  • Quality of Service: Differentiate latency-sensitive services from batch workloads

Optical Module Selection:

  • Server NICs: 25G or 100G sufficient for most microservices workloads
  • Aggregation: 400G for leaf-spine links to handle aggregate traffic
  • Latency Optimization: Use low-latency modules (LPO, SR8) for latency-critical services
  • Cost Optimization: Microservices don't require 800G per server, allowing cost-effective 100G deployment

Kubernetes Networking

Pod-to-Pod Communication: Kubernetes networking creates overlay networks (Calico, Flannel, Cilium) for pod communication:

  • Encapsulation Overhead: VXLAN or other tunneling adds 50-100 bytes per packet, increasing bandwidth requirements
  • Network Policies: Firewall rules processed in software can add latency
  • Service Discovery: DNS and service mesh add communication overhead

Optimization Strategies:

  • SR-IOV: Direct hardware access bypasses software networking stack, reducing latency and CPU overhead
  • DPDK: User-space networking for high packet rates
  • eBPF: Efficient packet processing in kernel for network policies
  • Optical Module Impact: High-performance NICs with SR-IOV require 100G or 200G optical modules to fully utilize hardware capabilities

Distributed Storage Systems

Object Storage (Ceph, MinIO)

Replication Traffic: Object storage systems replicate data across multiple nodes for durability:

  • Write Amplification: 3× replication means each write generates 3× network traffic
  • Rebalancing: Adding or removing nodes triggers massive data movement
  • Erasure Coding: More efficient than replication but still generates significant east-west traffic

Bandwidth Requirements:

  • Storage Nodes: 25G or 100G per node depending on disk count and performance tier
  • Aggregation: 400G for storage cluster aggregation switches
  • Separation: Dedicated storage network fabric isolates storage traffic from compute traffic

Example Deployment: 1000-node Ceph cluster with 100TB per node:

  • Each node: 2×25G (50G total) for redundancy
  • Leaf switches: 48×25G server ports, 4×400G uplinks
  • Spine switches: 64×400G ports
  • Total optical modules: 2,000×25G + 256×400G

Distributed File Systems (HDFS, GlusterFS)

Data Locality: Distributed file systems attempt to place computation near data, but still generate east-west traffic:

  • Block Replication: HDFS typically uses 3× replication
  • MapReduce Shuffle: Intermediate data transferred between map and reduce tasks
  • Data Skew: Uneven data distribution creates hotspots

Network Design:

  • Rack Awareness: Place replicas in different racks to survive rack failures
  • Bandwidth Provisioning: Ensure sufficient inter-rack bandwidth for replication and shuffle
  • Optical Modules: 100G or 200G server connections, 400G inter-rack links

Network Topology Optimization for East-West Traffic

Spine-Leaf (Clos) Architecture

Design Principles:

  • Two-Tier: Leaf switches connect to servers, spine switches provide interconnection
  • Full Mesh: Every leaf connects to every spine
  • Equal-Cost Paths: Multiple paths between any two servers for load balancing
  • Scalability: Add spine switches to increase bandwidth, add leaf switches to increase server count

Optical Module Deployment:

  • Leaf-to-Server: 400G or 800G depending on server requirements
  • Leaf-to-Spine: 800G or 1.6T for maximum bisection bandwidth
  • Oversubscription: 1:1 (non-blocking) for AI, 2:1 or 3:1 acceptable for general workloads

Example: 1024-Server AI Cluster

  • Servers: 1024 × 2×400G NICs = 2,048×400G modules
  • Leaf switches: 32 switches × 64×400G server ports + 16×800G uplinks = 2,048×400G + 512×800G
  • Spine switches: 16 switches × 64×800G ports = 1,024×800G
  • Total: 4,096×400G + 1,536×800G optical modules
  • Bisection bandwidth: 409.6 Tbps (non-blocking)

Fat-Tree Topology

Characteristics: Generalization of Clos network with multiple tiers:

  • Three-Tier: Access, aggregation, core layers
  • Oversubscription: Typically 4:1 or 8:1 at aggregation layer
  • Cost Optimization: Reduces optical module count compared to non-blocking Clos

Suitability: Appropriate for mixed workloads where not all traffic is east-west. AI training clusters require lower oversubscription (2:1 maximum).

Dragonfly and Dragonfly+

Design: Hierarchical topology with groups of switches, optimized for high-radix switches:

  • Intra-Group: All-to-all connections within each group
  • Inter-Group: Sparse connections between groups
  • Routing: Adaptive routing to balance load across paths

Advantages:

  • Scalability: Can scale to 100,000+ servers with fewer switch tiers
  • Diameter: Low network diameter (2-3 hops) reduces latency
  • Cost: Fewer optical modules than full Clos at large scale

Challenges:

  • Complexity: Requires sophisticated routing algorithms
  • Hotspots: Inter-group links can become bottlenecks
  • Adoption: Less common than Clos in commercial data centers

Traffic Engineering and Load Balancing

ECMP (Equal-Cost Multi-Path)

Mechanism: Distribute traffic across multiple equal-cost paths using hash-based selection:

  • Hash Function: Typically 5-tuple (src IP, dst IP, src port, dst port, protocol)
  • Per-Flow: All packets in a flow take the same path to avoid reordering
  • Load Distribution: Ideally uniform, but hash collisions can cause imbalance

Limitations:

  • Elephant Flows: Large flows can saturate individual paths
  • Hash Polarization: Multiple switches using same hash can create persistent imbalances
  • Adaptation: Cannot react to congestion or link failures quickly

Optical Module Impact: ECMP effectiveness depends on having sufficient parallel paths. More optical modules (higher port count switches) enable better load distribution.

Adaptive Routing

Congestion-Aware Routing: Dynamically select paths based on real-time congestion:

  • Mechanisms: Monitor queue depths, packet loss, or explicit congestion signals
  • Rerouting: Move flows from congested to underutilized paths
  • Granularity: Per-flow or per-packet rerouting

Technologies:

  • CONGA: Congestion-aware load balancing for data centers
  • HULA: Hop-by-hop load balancing using in-network telemetry
  • Letflow: Flowlet-based adaptive routing

Benefits for East-West Traffic: Adaptive routing can improve utilization of optical module capacity by 20-40% compared to static ECMP, effectively increasing bisection bandwidth without additional hardware.

Monitoring and Visibility

Traffic Telemetry

Flow-Level Monitoring:

  • sFlow/NetFlow: Sample traffic flows to understand patterns
  • Granularity: 1-in-1000 or 1-in-10000 sampling for high-speed links
  • Analysis: Identify top talkers, traffic matrices, application breakdown

Optical Module Telemetry:

  • DDM (Digital Diagnostics Monitoring): Temperature, optical power, voltage, current
  • Error Counters: FEC corrected errors, uncorrectable errors, symbol errors
  • Utilization: Bandwidth utilization per module and per lane

Correlation: Correlate traffic patterns with optical module performance to identify:

  • Overutilized links requiring capacity upgrades
  • Underutilized links indicating routing inefficiencies
  • Optical module degradation causing packet loss or retransmissions

Capacity Planning

Traffic Growth Modeling:

  • Historical Analysis: Analyze traffic growth over past 6-12 months
  • Workload Forecasting: Project future AI training, storage, and application traffic
  • Headroom: Maintain 30-50% headroom on east-west links for bursts and growth

Optical Module Procurement:

  • Lead Time: 8-16 weeks for large optical module orders
  • Inventory: Maintain 10-15% spare inventory for rapid deployment
  • Phased Deployment: Deploy capacity in phases aligned with workload growth

Cost Optimization Strategies

Workload Segmentation

Tiered Network Design: Not all workloads require the same east-west bandwidth:

  • Tier 1 (AI Training): 800G per server, 1:1 oversubscription, premium optical modules
  • Tier 2 (Inference, Databases): 400G per server, 2:1 oversubscription, standard modules
  • Tier 3 (Web Servers, Batch): 100G per server, 4:1 oversubscription, cost-optimized modules

Cost Impact: For 10,000 server data center:

  • Uniform 800G: 20,000×800G modules = $24M
  • Tiered (2000 Tier 1, 5000 Tier 2, 3000 Tier 3): 4,000×800G + 10,000×400G + 6,000×100G = $11.6M (52% savings)

Incremental Capacity Expansion

Just-in-Time Deployment: Deploy optical modules as needed rather than all upfront:

  • Phase 1: Deploy 70% of planned capacity at launch
  • Phase 2: Add 20% when utilization exceeds 60%
  • Phase 3: Add final 10% when utilization exceeds 75%

Benefits:

  • Spread capital costs over time
  • Take advantage of price declines (10-20% annually for new technologies)
  • Align capacity with actual demand

Risks:

  • Supply chain delays can prevent timely expansion
  • Price increases if market tightens
  • Operational complexity of multiple deployment phases

Future Trends in East-West Networking

Optical Circuit Switching

Concept: Dynamically reconfigure optical paths for predictable traffic patterns:

  • AI Training: All-reduce operations follow predictable patterns, can be scheduled on optical circuits
  • Bulk Data Transfer: Large dataset movements between storage and compute
  • Benefits: Near-zero switching latency, no packet processing overhead

Technology:

  • MEMS Switches: Mechanically reconfigurable, 1-10ms switching time
  • Silicon Photonic Switches: Electronically reconfigurable, 10-100ns switching time
  • Hybrid Networks: Combine packet switching for control plane with circuit switching for data plane

In-Network Computing

Aggregation in Network: Perform gradient aggregation within switches rather than at endpoints:

  • Mechanism: Programmable switches (P4) or specialized ASICs perform sum/average operations
  • Benefit: Reduces east-west traffic by 50-90% for all-reduce operations
  • Example: SwitchML achieves 5-10× faster all-reduce for small messages

Optical Module Impact: In-network computing reduces bandwidth requirements, potentially allowing use of 400G instead of 800G modules for same workload, or enabling larger clusters with same optical module count.

Conclusion

The shift from north-south to east-west traffic dominance has fundamentally transformed data center network design and optical module requirements. Modern AI workloads, distributed applications, and hyper-converged infrastructure demand high-bandwidth, low-latency east-west connectivity that was unimaginable a decade ago.

Key Takeaways:

  • East-West Dominance: 70-95% of traffic in modern data centers is server-to-server
  • AI as Driver: AI training represents the most demanding east-west workload, requiring 400-800G per server
  • Architecture Evolution: Spine-leaf topologies with minimal oversubscription are essential
  • Optical Module Scale: Large deployments require tens of thousands of high-speed modules
  • Cost Optimization: Tiered approaches and phased deployment can reduce costs while maintaining performance

High-speed optical modules—400G, 800G, and beyond—are the critical enablers of east-west traffic at scale. Their importance in modern data center architecture cannot be overstated. As workloads continue to evolve toward more distributed, communication-intensive patterns, the role of optical modules in providing the high-bandwidth, low-latency east-west connectivity will only grow. Organizations that understand these traffic patterns and design their optical networking infrastructure accordingly will be best positioned to support the demanding applications of today and tomorrow.

Back to blog