Internet-Draft | Scenarios and Deployment Considerations | October 2024 |
Zhao & Xiong | Expires 24 April 2025 | [Page] |
This document describes the typical scenarios and deployment considerations for High Performance Wide Area Networks (HP-WANs). It also provides simulation results for data transmission in WANs and analyses the impacts on throughput..¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 24 April 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
As per [I-D.xiong-hpwan-uc-req-problem], High Performance Wide Area Network (HP-WAN) puts forward higher performance requirements for WANs. The high performance data transmission should provide the advantages of low latency, high throughput and low CPU utilization, which can significantly improve the performance and efficiency of the intra-DC and DC interconnection network. At present, the tests and deployments of long-distance, high-performance data transmission have been carried out among the operators WAN, cloud service providers DC interconnection network and research institutions private network. However, there are still challenges in providing high performance in long-distance and wide area networks deployment:¶
This document describes the typical scenarios and deployment considerations for High Performance Wide Area Networks (HP-WANs). It also provides simulation results for data transmission in WANs and analyses the impacts on throughput.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The terminology is defined as [I-D.xiong-hpwan-uc-req-problem].¶
According to different transmission distances and deployment requirements, the high-throughput transmission includes two types of scenarios: high volume data transmission over thousands of kilometers in WANs and the collaborative data transmission over hundreds of kilometers in MANs.¶
There are two types of scenarios: massive research data transmission between HPCs and data transmission of training samples between the DCs for AI. The long-distance data transmission scenario is shown in Figure 1, where the data flows are transmitted between two sites or DCs, with a location distance ranging from 100km to 1000km.¶
Massive research data transmission between HPCs: The scenario of thousands of kilometers of big data migration mainly refers to the high-throughput transmission of massive data between scientific research institutions. At present, research institutions in some countries, such as the US ESnet6 and the EU EuroHPC program, are deploying wide area high-performance networks to support the construction and operation of high-performance computing and data interconnection infrastructure. In this scenario, data transmission is usually carried out regularly or in demand, with each transmission ranging from a few terabytes to several hundred terabytes, Data transmission costs and security is required to balance.¶
Data transmission of training samples between the DCs for AI: The construction of the large-scale DC for AI is limited by energy and land resources. Allocating training tasks to data centers with lower computing power and electricity prices has become a cost-effective option. When the distance between data DCs over 1000km, a wide area high-performance network is required to transmit high-throughput training samples and corpus data. Usually, training large models in the billions or trillions tokens requires several hundred terabytes to over P of corpus data, with a large amount of data transmission per session, which places high demands on transmission throughput and stability.¶
There are two types of scenarios: data transmission between storage and computing separation data centers and high-throughput data transmission between DCs under distributed intelligent computing. The collaborative and interactive data transmission scenario is shown in Figure 2, where data flows are transmitted between two or more DCs, with a location distance ranging from 80km to 100km.¶
Storage and computing separation scenario: the cloud services providers deploy multiple data center with storage and intelligent computing devices deployed separately in MAN (under 100km). By extending the high-performance transmission technology used within the original DC to across data centers, the DC cluster with the separated storage and computing is constructed. In 2023, Amazon has implemented a Storage and computing separation data center for high-throughput data transmission on the MAN with a speed of 100Gbps and 100 kilometers. In addition, the training sample of customers in industries such as government and finance is "sensitive data", and the consequences of data leakage are very serious. The sample data needs to be storage in the customer's private DC and connected to the cloud service provider's DC for AI through a wide area high-performance network.¶
Distributed coordination reasoning scenario: in order to improve the user experience of computing services, the architecture with centralized training and distributed reasoning is deployed. The training is carried out at core computing nodes that are far away from the user, the inference is respond to the user at distributed edge nodes with closer distance, shorter latency, and better experience. Local sample data needs to be transmitted back between the core and edge DCs through a high-performance MAN to fine tune and optimize the trained model. In addition, user inference requests and response data require low latency transmission.¶
The host optimization deployment mainly adopts the improved transport layer protocol on the NIC of host server to achieve long-distance and efficient transmission based on lossy networks. The optimization of the transport layer protocol may involve caching and resembling for out of order packages, packet loss tolerant and error correction mechanism based on lossy network, etc. The host optimization deployment is as Figure 3 shown.¶
The WAN optimize the performance of packet loss, bandwidth utilization, and latency to provide high-throughput data transmission between DCs. The optimization of wide area networks may involve path selection, congestion control and flow control etc. The deterministic forwarding may also reduce the packet loss ratio, latency, and jitter in wide area networks. The WAN optimization deployment is as Figure 4 shown.¶
The solution requires the deployment of gateway devices at the DC edge to isolate or relay traffic within the data center and wide area network. The gateway devices should support high-performance services packet caching, buffering, and retransmission, and implement The collaboration and Interaction between gateway and WAN through running optimized high-performance transport layer protocols, including high-performance services intelligence sensitive, routing selection and congestion control. In addition, the gateway also needs to have mapping and conversion of different high-performance protocols running in the data center and WAN. The gateway deployment is as Figure 5 shown.¶
Based on the current implementation over 100km, the selection of delay parameters in this experiment is mainly aimed at wide area scenarios of 100~2000 km, with round trip time (RTT) of 1-20ms. In terms of parameter selection, this experiment is based on the superposition verification from 100km (1ms delay) to 2000km (20ms delay).¶
The impact of long-distance delay on throughput is shown as Figure 6.¶
The transmission performance of RDMA in different network environments is Verified. The impact of long distance and latency on throughput performance is shown in Table 1. As latency increases (1~20ms), the RDMA message size needs to be continuously increased to achieve high-performance transmission with 100% throughput. Due to the maximum message length of 2GB, a bandwidth of 100Gbit/s can be achieved without loss, satisfying the throughput theoretical calculation equation.¶
Throughput = Window_Size/RTT (1)¶
The overall analysis shows that by adjusting RDMA parameters (such as message length), high-performance transmission of 1000km (with over 90% throughput) can be achieved; The message length setting is actually related to the specific network application, device cache space, and cache threshold settings, and the increase of message length is unlimited.¶
The traditional RDMA adopts the Go-Back-N retransmission mechanism, which retransmits all data packets after the dropped data packet N. Loss of packets can cause significant performance degradation in RDMA. However, TCP only needs to retransmit lost individual packets, and the latest RDMA network cards have started using selective repeat. Therefore, the calculation formulas for TCP packet loss rate (p), message size (MSS), latency (RTT) and bandwidth capacity (C) can be referred to:¶
Throughput = Min{MSS/RTT*C*(1/p)} (2)¶
The actual testing performance of RDMA differs from that of TCP, and the main impact of wide area networks is latency, with retransmission and congestion control algorithm models being similar. Therefore, the theoretical rate of RDMA is empirically judged by adjusting the value of parameter C in equation (2). (TCP empirical value C = 1.0)¶
When both bigger delay and packet loss coexist and over 80% throughput of a 100G link, the packet loss rate in the data center must be less than 0.005%; In the scenario of wide area interconnection in DCs, due to the increase in retransmission cost and response time caused by propagation link delay, the packet loss threshold is more strict and harsh in the data center, requiring the network to achieve lossless as much as possible. In a wide area scenario, even with the optimization algorithm of selective retransmission, it is difficult to achieve a bandwidth utilization rate of over 70% when the packet loss rate is less than 0.001%.¶
In general, the network performance indicators for RDMA over a wide area of 1000 kilometers are as follows: the throughput of RDMA over a wide area is directly proportional to the length of message size, and inversely proportional to the network packet loss rate and latency. To ensure 80% throughput of links over 100Gbps and 1000 kilometers, the message length needs to be greater than 512KB, resulting in extremely strict packet loss rate indicators due to increased latency.¶
TBA¶
This document makes no requests for IANA action.¶
TBA¶