Scenarios and Deployment Considerations for High Performance Wide Area Network

Internet-Draft	Scenarios and Deployment Considerations	October 2024
Zhao & Xiong	Expires 24 April 2025	[Page]

Abstract

This document describes the typical scenarios and deployment considerations for High Performance Wide Area Networks (HP-WANs). It also provides simulation results for data transmission in WANs and analyses the impacts on throughput..¶

1. Introduction

As per [I-D.xiong-hpwan-uc-req-problem], High Performance Wide Area Network (HP-WAN) puts forward higher performance requirements for WANs. The high performance data transmission should provide the advantages of low latency, high throughput and low CPU utilization, which can significantly improve the performance and efficiency of the intra-DC and DC interconnection network. At present, the tests and deployments of long-distance, high-performance data transmission have been carried out among the operators WAN, cloud service providers DC interconnection network and research institutions private network. However, there are still challenges in providing high performance in long-distance and wide area networks deployment:¶

the high utilization and high throughput capabilities for long-distance links;¶
the efficient congestion control mechanisms to avoid packet loss;¶
fair sharing of link bandwidth resources among multiple concurrent applications;¶
the packet ACK delay increases exponentially with distance, which will be challenging for high-performance applications, especially distributed processing models.¶

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶

3. Typical Scenarios for HP-WANs

According to different transmission distances and deployment requirements, the high-throughput transmission includes two types of scenarios: high volume data transmission over thousands of kilometers in WANs and the collaborative data transmission over hundreds of kilometers in MANs.¶

3.1. Long-distance Data Transmission

There are two types of scenarios: massive research data transmission between HPCs and data transmission of training samples between the DCs for AI. The long-distance data transmission scenario is shown in Figure 1, where the data flows are transmitted between two sites or DCs, with a location distance ranging from 100km to 1000km.¶


                    +---100km~1000km---+
                    |                  |
    +--------+      |                  |      +--------+
    | Host A |------+       WAN        +------| Host B |
    +--------+      |                  |      +--------+
     Site/DC        |                  |       Site/DC
                    +------------------+

Figure 1: Long-distance Data Transmission over WANs

Massive research data transmission between HPCs: The scenario of thousands of kilometers of big data migration mainly refers to the high-throughput transmission of massive data between scientific research institutions. At present, research institutions in some countries, such as the US ESnet6 and the EU EuroHPC program, are deploying wide area high-performance networks to support the construction and operation of high-performance computing and data interconnection infrastructure. In this scenario, data transmission is usually carried out regularly or in demand, with each transmission ranging from a few terabytes to several hundred terabytes, Data transmission costs and security is required to balance.¶

Data transmission of training samples between the DCs for AI: The construction of the large-scale DC for AI is limited by energy and land resources. Allocating training tasks to data centers with lower computing power and electricity prices has become a cost-effective option. When the distance between data DCs over 1000km, a wide area high-performance network is required to transmit high-throughput training samples and corpus data. Usually, training large models in the billions or trillions tokens requires several hundred terabytes to over P of corpus data, with a large amount of data transmission per session, which places high demands on transmission throughput and stability.¶

3.2. Collaborative and Interactive Data Transmission

There are two types of scenarios: data transmission between storage and computing separation data centers and high-throughput data transmission between DCs under distributed intelligent computing. The collaborative and interactive data transmission scenario is shown in Figure 2, where data flows are transmitted between two or more DCs, with a location distance ranging from 80km to 100km.¶


         +-------------80km~100km-----------------+
         |                                        |
    +----+----+                              +----+----+
    | Core DC |                              | Core DC |
    +----+----+            MAN               +----+----+
         |                                        |
         |                                        |
    +----+----+          +---------+         +----+----+
    | Edge DC +----------+ Edge DC +---------+ Edge DC |
    +---------+          +---------+         +---------+

Figure 2: Collaborative and Interactive Data Transmission over MANs

Storage and computing separation scenario: the cloud services providers deploy multiple data center with storage and intelligent computing devices deployed separately in MAN (under 100km). By extending the high-performance transmission technology used within the original DC to across data centers, the DC cluster with the separated storage and computing is constructed. In 2023, Amazon has implemented a Storage and computing separation data center for high-throughput data transmission on the MAN with a speed of 100Gbps and 100 kilometers. In addition, the training sample of customers in industries such as government and finance is "sensitive data", and the consequences of data leakage are very serious. The sample data needs to be storage in the customer's private DC and connected to the cloud service provider's DC for AI through a wide area high-performance network.¶

Distributed coordination reasoning scenario: in order to improve the user experience of computing services, the architecture with centralized training and distributed reasoning is deployed. The training is carried out at core computing nodes that are far away from the user, the inference is respond to the user at distributed edge nodes with closer distance, shorter latency, and better experience. Local sample data needs to be transmitted back between the core and edge DCs through a high-performance MAN to fine tune and optimize the trained model. In addition, user inference requests and response data require low latency transmission.¶

4. Deployment Considerations for HP-WANs

4.1. Host Optimization Deployment

The host optimization deployment mainly adopts the improved transport layer protocol on the NIC of host server to achieve long-distance and efficient transmission based on lossy networks. The optimization of the transport layer protocol may involve caching and resembling for out of order packages, packet loss tolerant and error correction mechanism based on lossy network, etc. The host optimization deployment is as Figure 3 shown.¶

         +--------------+      +---------------+      +--------------+
         |              |      |               |      |              |
    +----+----+         |      |      WAN      |      |         +----+----+
    | Host A  |         +------+     (lossy)   +------+         | Host B  |
    +----+----+         |      |               |      |         +----+----+
         |DCN or        |      |               |      |DCN or        |
         |dedicated line|      |               |      |dedicated line|
         +--------------+      +---------------+      +--------------+
   The NIC with transport                             The NIC with transport
   protocol optimization                              protocol optimization

Figure 3: Host Optimization Deployment Consideration

4.2. WAN optimization Deployment

The WAN optimize the performance of packet loss, bandwidth utilization, and latency to provide high-throughput data transmission between DCs. The optimization of wide area networks may involve path selection, congestion control and flow control etc. The deterministic forwarding may also reduce the packet loss ratio, latency, and jitter in wide area networks. The WAN optimization deployment is as Figure 4 shown.¶

      +--------------+      +------------------+      +--------------+
      |              |      |                  |      |              |
 +----+----+         |      |      WAN         |      |         +----+----+
 | Host A  |         +------+(High performance)+------+         | Host B  |
 +----+----+         |      |                  |      |         +----+----+
      |DCN or        |      |                  |      |DCN or        |
      |dedicated line|      |                  |      |dedicated line|
      +--------------+      +------------------+      +--------------+
                             The optimization of
                             packet loss, bandwidth
                             utilization, and latency
                             in WAN

Figure 4: Host Optimization Deployment Consideration

4.3. Gateway Deployment

The solution requires the deployment of gateway devices at the DC edge to isolate or relay traffic within the data center and wide area network. The gateway devices should support high-performance services packet caching, buffering, and retransmission, and implement The collaboration and Interaction between gateway and WAN through running optimized high-performance transport layer protocols, including high-performance services intelligence sensitive, routing selection and congestion control. In addition, the gateway also needs to have mapping and conversion of different high-performance protocols running in the data center and WAN. The gateway deployment is as Figure 5 shown.¶

                            +-------------+
+---------+   +---------+   |             |   +---------+   +---------+
| Host A  +---+ Gateway +---+   WAN       +---+ Gateway +---+ Host B  |
+---------+   +---------+   |  (Lossy)    |   +---------+   +---------+
                            +-------------+

Figure 5: Gateway Deployment Consideration

5. Simulation Results

5.1. The Impact of Long-distance Delay

Based on the current implementation over 100km, the selection of delay parameters in this experiment is mainly aimed at wide area scenarios of 100~2000 km, with round trip time (RTT) of 1-20ms. In terms of parameter selection, this experiment is based on the superposition verification from 100km (1ms delay) to 2000km (20ms delay).¶

The impact of long-distance delay on throughput is shown as Figure 6.¶


 +-------------+--------------------+---------------+--------------------+
 |RTT latency  |message length(byte)|  distance     |Throughput(Gbps)    |
 +-------------+--------------------+---------------+--------------------+
 |less than 1ms|less than 1024      |less than 100km|more than90%@100Gbps|
 +-------------+--------------------+---------------+--------------------+
 |     1ms     |   256K             |     100km     |more than90%@100Gbps|
 +-------------+--------------------+---------------+--------------------+
 |     2ms     |   512K             |     200km     |more than90%@100Gbps|
 +-------------+--------------------+---------------+--------------------+
 |     5ms     |   1M               |     500km     |more than90%@100Gbps|
 +-------------+--------------------+---------------+--------------------+
 |     10ms    |   8M               |     1000km    |more than90%@100Gbps|
 +-------------+--------------------+---------------+--------------------+

Figure 6: The Impact of Long-distance Delay on Throughput

The transmission performance of RDMA in different network environments is Verified. The impact of long distance and latency on throughput performance is shown in Table 1. As latency increases (1~20ms), the RDMA message size needs to be continuously increased to achieve high-performance transmission with 100% throughput. Due to the maximum message length of 2GB, a bandwidth of 100Gbit/s can be achieved without loss, satisfying the throughput theoretical calculation equation.¶

Throughput = Window_Size/RTT (1)¶

The overall analysis shows that by adjusting RDMA parameters (such as message length), high-performance transmission of 1000km (with over 90% throughput) can be achieved; The message length setting is actually related to the specific network application, device cache space, and cache threshold settings, and the increase of message length is unlimited.¶

5.2. The Impact of Packet Loss

The traditional RDMA adopts the Go-Back-N retransmission mechanism, which retransmits all data packets after the dropped data packet N. Loss of packets can cause significant performance degradation in RDMA. However, TCP only needs to retransmit lost individual packets, and the latest RDMA network cards have started using selective repeat. Therefore, the calculation formulas for TCP packet loss rate (p), message size (MSS), latency (RTT) and bandwidth capacity (C) can be referred to:¶

Throughput = Min{MSS/RTT*C*(1/p)} (2)¶

The actual testing performance of RDMA differs from that of TCP, and the main impact of wide area networks is latency, with retransmission and congestion control algorithm models being similar. Therefore, the theoretical rate of RDMA is empirically judged by adjusting the value of parameter C in equation (2). (TCP empirical value C = 1.0)¶

When both bigger delay and packet loss coexist and over 80% throughput of a 100G link, the packet loss rate in the data center must be less than 0.005%; In the scenario of wide area interconnection in DCs, due to the increase in retransmission cost and response time caused by propagation link delay, the packet loss threshold is more strict and harsh in the data center, requiring the network to achieve lossless as much as possible. In a wide area scenario, even with the optimization algorithm of selective retransmission, it is difficult to achieve a bandwidth utilization rate of over 70% when the packet loss rate is less than 0.001%.¶

In general, the network performance indicators for RDMA over a wide area of 1000 kilometers are as follows: the throughput of RDMA over a wide area is directly proportional to the length of message size, and inversely proportional to the network packet loss rate and latency. To ensure 80% throughput of links over 100Gbps and 1000 kilometers, the message length needs to be greater than 512KB, resulting in extremely strict packet loss rate indicators due to increased latency.¶

9. References

9.1. Normative References

[I-D.xiong-hpwan-uc-req-problem]: Xiong, Q., Yao, K., Huang, C., Zhengxin, H., and J. Zhao, "Use Cases, Requirements and Problems for High Performance Wide Area Network", Work in Progress, Internet-Draft, draft-xiong-hpwan-uc-req-problem-00, 12 October 2024, <https://datatracker.ietf.org/doc/html/draft-xiong-hpwan-uc-req-problem-00>.
[RFC2119]: Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>.
[RFC3168]: Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, <https://www.rfc-editor.org/info/rfc3168>.
[RFC7424]: Krishnan, R., Yong, L., Ghanwani, A., So, N., and B. Khasnabish, "Mechanisms for Optimizing Link Aggregation Group (LAG) and Equal-Cost Multipath (ECMP) Component Link Utilization in Networks", RFC 7424, DOI 10.17487/RFC7424, January 2015, <https://www.rfc-editor.org/info/rfc7424>.
[RFC8174]: Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>.
[RFC8664]: Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W., and J. Hardwick, "Path Computation Element Communication Protocol (PCEP) Extensions for Segment Routing", RFC 8664, DOI 10.17487/RFC8664, December 2019, <https://www.rfc-editor.org/info/rfc8664>.
[RFC9232]: Song, H., Qin, F., Martinez-Julia, P., Ciavaglia, L., and A. Wang, "Network Telemetry Framework", RFC 9232, DOI 10.17487/RFC9232, May 2022, <https://www.rfc-editor.org/info/rfc9232>.
[RFC9438]: Xu, L., Ha, S., Rhee, I., Goel, V., and L. Eggert, Ed., "CUBIC for Fast and Long-Distance Networks", RFC 9438, DOI 10.17487/RFC9438, August 2023, <https://www.rfc-editor.org/info/rfc9438>.

Scenarios and Deployment Considerations for High Performance Wide Area Network

Abstract

Status of This Memo

Copyright Notice

Table of Contents