Data Center East-West Traffic: Optical Module Requirements for Modern Workloads

November 17, 2025

Introduction

The traditional data center traffic model—where most communication flows north-south between clients and servers—has fundamentally shifted. Modern applications, particularly AI training, distributed databases, microservices architectures, and hyper-converged infrastructure, generate massive east-west traffic flows between servers within the data center. This transformation has profound implications for optical module selection, network architecture, and capacity planning. Understanding east-west traffic patterns and their impact on optical networking is essential for designing efficient, scalable AI data centers.

Understanding East-West vs North-South Traffic

Traditional North-South Model

Characteristics: In traditional three-tier architectures (access-distribution-core), traffic primarily flows vertically between end users and centralized servers. A typical ratio was 80% north-south (client-server) to 20% east-west (server-server).

Network Design: Optimized for north-south bandwidth with oversubscribed east-west paths. Core and distribution layers had high bandwidth, while server-to-server communication traversed multiple hops with limited capacity.

Optical Module Deployment: High-speed modules concentrated at core and distribution layers (40G, 100G), while access layer used lower speeds (1G, 10G).

Modern East-West Dominance

Traffic Shift: Modern data centers exhibit 70-90% east-west traffic, with some AI training clusters approaching 95% east-west during training operations.

Drivers:

Distributed Computing: MapReduce, Spark, and other frameworks distribute computation across hundreds or thousands of servers
Microservices: Applications decomposed into dozens or hundreds of services communicating constantly
AI Training: Gradient synchronization requires all-to-all communication between GPUs
Distributed Storage: Ceph, HDFS, and other systems replicate data across multiple nodes
Virtual Machine Migration: Live migration moves VMs between hosts, generating large data transfers

Network Implications: Requires non-blocking or minimally oversubscribed east-west bandwidth, fundamentally changing network topology and optical module requirements.

AI Training: The Ultimate East-West Workload

Communication Patterns in Distributed Training

Data Parallelism: The most common training strategy splits data across GPUs, each processing different batches:

Forward Pass: Minimal communication, each GPU processes independently
Backward Pass: Compute gradients locally
Gradient Synchronization: All-reduce operation exchanges gradients between all GPUs—pure east-west traffic
Traffic Volume: For 175B parameter model (350GB gradients), 1024 GPUs must exchange 350GB each iteration
Frequency: 5-20 iterations per second, generating continuous east-west traffic bursts

Model Parallelism: Large models split across GPUs create different patterns:

Pipeline Parallelism: Sequential stages pass activations forward and gradients backward—linear east-west traffic pattern
Tensor Parallelism: Layers split across GPUs require frequent all-reduce within each layer—extremely high east-west bandwidth
Mixture-of-Experts: Routing mechanism creates dynamic east-west traffic to different expert GPUs

Bandwidth Requirements: For optimal GPU utilization, network bandwidth must match or exceed GPU computation speed. NVIDIA H100 with 1000 TFLOPS requires approximately 400-800 Gbps network bandwidth per GPU to avoid communication bottlenecks in large-scale training.

Optical Module Implications

Server Connectivity:

Single-GPU Servers: 200G or 400G NIC sufficient
8-GPU Servers: 2×400G or 8×400G (rail-optimized) required
Form Factor: QSFP-DD or OSFP depending on thermal and density requirements
Latency: <500ns module latency critical for maintaining GPU utilization

Switch Infrastructure:

Leaf Switches: 400G or 800G server-facing ports
Spine Switches: 800G or 1.6T for aggregation
Oversubscription: 1:1 (non-blocking) to 2:1 maximum for AI training
Total Modules: 10,000 GPU cluster requires 10,000-20,000 optical modules depending on architecture

Microservices and Container Networking

Service Mesh Communication

Architecture: Modern applications consist of hundreds of microservices, each running in containers, communicating via service mesh (Istio, Linkerd, Consul).

Traffic Characteristics:

High Connection Count: Thousands of concurrent TCP connections between services
Small Messages: Many requests are small (kilobytes), but high frequency
Unpredictable Patterns: Traffic flows change dynamically based on user requests and service dependencies
East-West Dominance: 80-90% of traffic is service-to-service within the data center

Network Requirements:

Low Latency: Service-to-service latency must be <1ms to maintain application responsiveness
High Packet Rate: Millions of packets per second (Mpps) capacity needed
Bandwidth: Aggregate bandwidth more important than per-flow bandwidth
Quality of Service: Differentiate latency-sensitive services from batch workloads

Optical Module Selection:

Server NICs: 25G or 100G sufficient for most microservices workloads
Aggregation: 400G for leaf-spine links to handle aggregate traffic
Latency Optimization: Use low-latency modules (LPO, SR8) for latency-critical services
Cost Optimization: Microservices don't require 800G per server, allowing cost-effective 100G deployment

Kubernetes Networking

Pod-to-Pod Communication: Kubernetes networking creates overlay networks (Calico, Flannel, Cilium) for pod communication:

Encapsulation Overhead: VXLAN or other tunneling adds 50-100 bytes per packet, increasing bandwidth requirements
Network Policies: Firewall rules processed in software can add latency
Service Discovery: DNS and service mesh add communication overhead

Optimization Strategies:

SR-IOV: Direct hardware access bypasses software networking stack, reducing latency and CPU overhead
DPDK: User-space networking for high packet rates
eBPF: Efficient packet processing in kernel for network policies
Optical Module Impact: High-performance NICs with SR-IOV require 100G or 200G optical modules to fully utilize hardware capabilities

Distributed Storage Systems

Object Storage (Ceph, MinIO)

Replication Traffic: Object storage systems replicate data across multiple nodes for durability:

Write Amplification: 3× replication means each write generates 3× network traffic
Rebalancing: Adding or removing nodes triggers massive data movement
Erasure Coding: More efficient than replication but still generates significant east-west traffic

Bandwidth Requirements:

Storage Nodes: 25G or 100G per node depending on disk count and performance tier
Aggregation: 400G for storage cluster aggregation switches
Separation: Dedicated storage network fabric isolates storage traffic from compute traffic

Example Deployment: 1000-node Ceph cluster with 100TB per node:

Each node: 2×25G (50G total) for redundancy
Leaf switches: 48×25G server ports, 4×400G uplinks
Spine switches: 64×400G ports
Total optical modules: 2,000×25G + 256×400G

Distributed File Systems (HDFS, GlusterFS)

Data Locality: Distributed file systems attempt to place computation near data, but still generate east-west traffic:

Block Replication: HDFS typically uses 3× replication
MapReduce Shuffle: Intermediate data transferred between map and reduce tasks
Data Skew: Uneven data distribution creates hotspots

Network Design:

Rack Awareness: Place replicas in different racks to survive rack failures
Bandwidth Provisioning: Ensure sufficient inter-rack bandwidth for replication and shuffle
Optical Modules: 100G or 200G server connections, 400G inter-rack links

Network Topology Optimization for East-West Traffic

Spine-Leaf (Clos) Architecture

Design Principles:

Two-Tier: Leaf switches connect to servers, spine switches provide interconnection
Full Mesh: Every leaf connects to every spine
Equal-Cost Paths: Multiple paths between any two servers for load balancing
Scalability: Add spine switches to increase bandwidth, add leaf switches to increase server count

Optical Module Deployment:

Leaf-to-Server: 400G or 800G depending on server requirements
Leaf-to-Spine: 800G or 1.6T for maximum bisection bandwidth
Oversubscription: 1:1 (non-blocking) for AI, 2:1 or 3:1 acceptable for general workloads

Example: 1024-Server AI Cluster

Servers: 1024 × 2×400G NICs = 2,048×400G modules
Leaf switches: 32 switches × 64×400G server ports + 16×800G uplinks = 2,048×400G + 512×800G
Spine switches: 16 switches × 64×800G ports = 1,024×800G
Total: 4,096×400G + 1,536×800G optical modules
Bisection bandwidth: 409.6 Tbps (non-blocking)

Fat-Tree Topology

Characteristics: Generalization of Clos network with multiple tiers:

Three-Tier: Access, aggregation, core layers
Oversubscription: Typically 4:1 or 8:1 at aggregation layer
Cost Optimization: Reduces optical module count compared to non-blocking Clos

Suitability: Appropriate for mixed workloads where not all traffic is east-west. AI training clusters require lower oversubscription (2:1 maximum).

Dragonfly and Dragonfly+

Design: Hierarchical topology with groups of switches, optimized for high-radix switches:

Intra-Group: All-to-all connections within each group
Inter-Group: Sparse connections between groups
Routing: Adaptive routing to balance load across paths

Advantages:

Scalability: Can scale to 100,000+ servers with fewer switch tiers
Diameter: Low network diameter (2-3 hops) reduces latency
Cost: Fewer optical modules than full Clos at large scale

Challenges:

Complexity: Requires sophisticated routing algorithms
Hotspots: Inter-group links can become bottlenecks
Adoption: Less common than Clos in commercial data centers

Traffic Engineering and Load Balancing

ECMP (Equal-Cost Multi-Path)

Mechanism: Distribute traffic across multiple equal-cost paths using hash-based selection:

Hash Function: Typically 5-tuple (src IP, dst IP, src port, dst port, protocol)
Per-Flow: All packets in a flow take the same path to avoid reordering
Load Distribution: Ideally uniform, but hash collisions can cause imbalance

Limitations:

Elephant Flows: Large flows can saturate individual paths
Hash Polarization: Multiple switches using same hash can create persistent imbalances
Adaptation: Cannot react to congestion or link failures quickly

Optical Module Impact: ECMP effectiveness depends on having sufficient parallel paths. More optical modules (higher port count switches) enable better load distribution.

Adaptive Routing

Congestion-Aware Routing: Dynamically select paths based on real-time congestion:

Mechanisms: Monitor queue depths, packet loss, or explicit congestion signals
Rerouting: Move flows from congested to underutilized paths
Granularity: Per-flow or per-packet rerouting

Technologies:

CONGA: Congestion-aware load balancing for data centers
HULA: Hop-by-hop load balancing using in-network telemetry
Letflow: Flowlet-based adaptive routing

Benefits for East-West Traffic: Adaptive routing can improve utilization of optical module capacity by 20-40% compared to static ECMP, effectively increasing bisection bandwidth without additional hardware.

Monitoring and Visibility

Traffic Telemetry

Flow-Level Monitoring:

sFlow/NetFlow: Sample traffic flows to understand patterns
Granularity: 1-in-1000 or 1-in-10000 sampling for high-speed links
Analysis: Identify top talkers, traffic matrices, application breakdown

Optical Module Telemetry:

DDM (Digital Diagnostics Monitoring): Temperature, optical power, voltage, current
Error Counters: FEC corrected errors, uncorrectable errors, symbol errors
Utilization: Bandwidth utilization per module and per lane

Correlation: Correlate traffic patterns with optical module performance to identify:

Overutilized links requiring capacity upgrades
Underutilized links indicating routing inefficiencies
Optical module degradation causing packet loss or retransmissions

Capacity Planning

Traffic Growth Modeling:

Historical Analysis: Analyze traffic growth over past 6-12 months
Workload Forecasting: Project future AI training, storage, and application traffic
Headroom: Maintain 30-50% headroom on east-west links for bursts and growth

Optical Module Procurement:

Lead Time: 8-16 weeks for large optical module orders
Inventory: Maintain 10-15% spare inventory for rapid deployment
Phased Deployment: Deploy capacity in phases aligned with workload growth

Cost Optimization Strategies

Workload Segmentation

Tiered Network Design: Not all workloads require the same east-west bandwidth:

Tier 1 (AI Training): 800G per server, 1:1 oversubscription, premium optical modules
Tier 2 (Inference, Databases): 400G per server, 2:1 oversubscription, standard modules
Tier 3 (Web Servers, Batch): 100G per server, 4:1 oversubscription, cost-optimized modules

Cost Impact: For 10,000 server data center:

Uniform 800G: 20,000×800G modules = $24M
Tiered (2000 Tier 1, 5000 Tier 2, 3000 Tier 3): 4,000×800G + 10,000×400G + 6,000×100G = $11.6M (52% savings)

Incremental Capacity Expansion

Just-in-Time Deployment: Deploy optical modules as needed rather than all upfront:

Phase 1: Deploy 70% of planned capacity at launch
Phase 2: Add 20% when utilization exceeds 60%
Phase 3: Add final 10% when utilization exceeds 75%

Benefits:

Spread capital costs over time
Take advantage of price declines (10-20% annually for new technologies)
Align capacity with actual demand

Risks:

Supply chain delays can prevent timely expansion
Price increases if market tightens
Operational complexity of multiple deployment phases

Future Trends in East-West Networking

Optical Circuit Switching

Concept: Dynamically reconfigure optical paths for predictable traffic patterns:

AI Training: All-reduce operations follow predictable patterns, can be scheduled on optical circuits
Bulk Data Transfer: Large dataset movements between storage and compute
Benefits: Near-zero switching latency, no packet processing overhead

Technology:

MEMS Switches: Mechanically reconfigurable, 1-10ms switching time
Silicon Photonic Switches: Electronically reconfigurable, 10-100ns switching time
Hybrid Networks: Combine packet switching for control plane with circuit switching for data plane

In-Network Computing

Aggregation in Network: Perform gradient aggregation within switches rather than at endpoints:

Mechanism: Programmable switches (P4) or specialized ASICs perform sum/average operations
Benefit: Reduces east-west traffic by 50-90% for all-reduce operations
Example: SwitchML achieves 5-10× faster all-reduce for small messages

Optical Module Impact: In-network computing reduces bandwidth requirements, potentially allowing use of 400G instead of 800G modules for same workload, or enabling larger clusters with same optical module count.

Conclusion

The shift from north-south to east-west traffic dominance has fundamentally transformed data center network design and optical module requirements. Modern AI workloads, distributed applications, and hyper-converged infrastructure demand high-bandwidth, low-latency east-west connectivity that was unimaginable a decade ago.

Key Takeaways:

East-West Dominance: 70-95% of traffic in modern data centers is server-to-server
AI as Driver: AI training represents the most demanding east-west workload, requiring 400-800G per server
Architecture Evolution: Spine-leaf topologies with minimal oversubscription are essential
Optical Module Scale: Large deployments require tens of thousands of high-speed modules
Cost Optimization: Tiered approaches and phased deployment can reduce costs while maintaining performance

High-speed optical modules—400G, 800G, and beyond—are the critical enablers of east-west traffic at scale. Their importance in modern data center architecture cannot be overstated. As workloads continue to evolve toward more distributed, communication-intensive patterns, the role of optical modules in providing the high-bandwidth, low-latency east-west connectivity will only grow. Organizations that understand these traffic patterns and design their optical networking infrastructure accordingly will be best positioned to support the demanding applications of today and tomorrow.

Back to blog