Congestion Control Working Group                                   S. Ji
Internet-Draft                                                     C. Li
Intended status: Standards Track                          Chinat Telecom
Expires: 24 April 2025                                            K. Zhu
                                                     Huawei Technologies
                                                         21 October 2024


   A congestion control mechanism based on distributed AIDC lossless
                                network
            draft-ji-ccwg-distributed-lossless-mechanism-00

Abstract

   This document proposes a congestion control mechanism based on
   distributed AIDC lossless network.  It can effectively solve the
   problem of declining model training performance due to congestion and
   packet loss on long-distance links when training large models across
   multiple data centers within a region.  In addition, this document
   outlines the practice scenario of this congestion control mechanism.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 24 April 2025.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components



Ji, et al.                Expires 24 April 2025                 [Page 1]

Internet-Draft   Congestion Control for Distributed AIDC    October 2024


   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Congestion Control Mechanism  . . . . . . . . . . . . . . . .   3
     2.1.  Congestion Control Principle  . . . . . . . . . . . . . .   3
     2.2.  Congestion Control Process  . . . . . . . . . . . . . . .   4
   3.  Practice Scenario . . . . . . . . . . . . . . . . . . . . . .   5
   4.  Conclusion  . . . . . . . . . . . . . . . . . . . . . . . . .   6
   5.  Security Considerations . . . . . . . . . . . . . . . . . . .   6
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   6
   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   6
     7.1.  Normative References  . . . . . . . . . . . . . . . . . .   6
     7.2.  Informative References  . . . . . . . . . . . . . . . . .   7
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   7

1.  Introduction

   With the rapid development of big data and artificial intelligence
   (AI) technology, it is getting more clear that AI solutions
   represented by large models have gradually penetrated into various
   industries, and the demands for computing power is increasing.  A
   large-scale GPU cluster is a necessary condition for large model
   training.  However, when deploying a cluster with 10,000 or even
   100,000 GPUs, the computing power of a single intelligent DC is
   limited due to the issues such as insufficient space/power and heat
   dissipation of the computer room.  In order to solve this problem,
   multiple intelligent DCs within a region can be interconnected into a
   large virtual intelligent computing cluster, which realizes
   collaborative computing among multiple intelligent DCs through
   distributed AIDC lossless network (also known as RDMA remote).  It
   meets the demands for high computing power.

   However, in the process of exploring using multiple intelligent DCs
   to build a larger-scale intelligent computing cluster, we have
   encountered many challenges.  For example, RDMA remote will generate
   traffic flows across long distances.  If congestion occurs on long-
   distance links, traditional congestion control mechanisms such as
   PFC/ECN may become invalid because of longer congestion feedback
   time, resulting in insufficient buffer of network devices and packet
   loss eventually.





Ji, et al.                Expires 24 April 2025                 [Page 2]

Internet-Draft   Congestion Control for Distributed AIDC    October 2024


   In order to solve the problems of congestion and packet loss in
   interconnection of DCs across long distances, this document proposes
   a congestion control mechanism that effectively alleviates network
   congestion by shortening the congestion feedback time and adjusting
   the flow rate of the transmitting node based on the congestion
   degree.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

1.2.  Terminology

   The following terms are used in this document:

   RDMA remote: Interconnect multiple intelligent DCs within a region
   into a large virtual intelligent computing cluster, realizing
   collaborative computing among multiple intelligent DCs.

   PFC(Priority-based Flow Control): It can provide priority-based flow
   control hop-by-hop, enabling multiple types of traffic flows to run
   on Ethernet links without affecting each other.

   ECN(Explicit Congestion Notification): A congestion control mechanism
   that reduces the flow rate of the transmitting node by sending CNPs
   from the receiving node to the transmitting node, achieving end-to-
   end congestion management.

   CNP:Congestion Notification Packet.

2.  Congestion Control Mechanism

2.1.  Congestion Control Principle

   At present, the most widely used congestion control mechanism in RoCE
   network is ECN.  When congestion occurs on the network device, the
   device sends a packet with an ECN label to the receiving node, and
   the receiving node then sends CNPs to the transmitting node to notify
   the node to reduce the transmitting rate of the packets, thus
   alleviating network congestion.  However, in distributed AIDC
   lossless networks, training large models in cooperation across
   multiple DCs generates long-distance data transmission.  If
   congestion occurs on the long-distance link, the CNP packets
   generated by the traditional ECN mechanism has a longer feedback



Ji, et al.                Expires 24 April 2025                 [Page 3]

Internet-Draft   Congestion Control for Distributed AIDC    October 2024


   path, which may cause the flow rate of the transmitting node not to
   be reduced in time, resulting in packet loss and affecting the
   training performance of the large models eventually.  To meet the
   lossless requirements of distributed AIDC networks, this document
   proposes a congestion control mechanism that transfers the
   "congestion point" occurs on the long-distance link to the network
   device closest to the transmitting node, thus dealing with congestion
   problems over long distances with low latency.

2.2.  Congestion Control Process

   Figure 1 shows the specific process of congestion control mechanism.
   H1 and H2 are respectively the transmitting node and receiving node,
   R11 is the next-hop device closest to the transmitting node (known as
   proximal device) , R12 is the device on the long-distance link, and
   the distance between R11 and R12 is in the range of hundreds of
   kilometers.


                  1.notification message
                  <-------------------
   +-------+     +------+  120km +------+     +-------+
   |  H1   #-----#  R11 #--------#  R12 #-----#   H2  |
   +-------+     +------+        +------+     +-------+
        2.flow-control
        protocol packets
      <--------------

   Figure 1: The Process of Congestion Control Mechanism

   • First, each device monitors the network state, including the queue
   accumulation condition and buffer usage of each port, determining
   whether congestion occurs on the link;

   • If congestion occurs on the link, and the congested device (R12) is
   not the proximal device (R11) of the transmitting node, R12 will send
   a notification message to R11.  The notification message contains
   information such as the port number where congestion occurs, the
   queue depth and the buffer usage of the congested port;

   • R11 determines the congestion degree of the device based on the
   content of the notification message, and calculates the number of CNP
   packets or other flow-control protocol packets that need to be sent.
   The flow-control protocol packets contain information about the
   congested traffic flows;






Ji, et al.                Expires 24 April 2025                 [Page 4]

Internet-Draft   Congestion Control for Distributed AIDC    October 2024


   • After receiving the flow-control protocol packets, H1 reduces the
   transmitting rate of the corresponding congested traffic flows to
   alleviate the congestion of network devices.

   The traffic flow of large models has a characteristic of periodicity,
   that is, if a certain flow is congested in the first training period,
   it will be congested in every subsequent period.  Therefore, this
   document designs the network devices to record the information of the
   forwarding packets in the flow table entry, including which flows are
   congested.  When the congested flow occurs periodically, R11 directly
   sends CNP or other flow-control protocol packets to H1 based on the
   learned flow table entries for transmitting rate control.  The remote
   congested device (R12) does not need to send notification message any
   more.  In this way, after obtaining the congestion information of the
   entire network in the first training period, the traffic flows can be
   lossless in remaining periods.

3.  Practice Scenario

   The lossless interconnection technology for distributed AIDC lossless
   networks is a research hotspot in recent years.  At present, the
   congestion control mechanism proposed in this document has been
   applied in the testing environment of the current network.

   Figure 2 and Figure 3 show the test environments of two AI training
   clusters, where each cluster deploy 512 GPUs respectively.  The
   distance between cluster A and cluster B is 120km, and the spine
   switches in two clusters are interconnected through wavelength
   division equipments with the capacity of 25.6T to train large models
   with billions of parameters collaboratively.

               +-------------+                +-------------+
               |    Spine1   |                |    Spine2   |
               +-+---+--+--+-+                +--+---+--+-+-+
                /    |  |  |                     |   |  |  |
               /     |  |  |                     |   |  |  |
              /   +--+--+--+---------------------+   |  +  +
             /   /   |  |  |   +---------------------+ /    \
            /   /    |  |  +---|-----------------+----/----+ \
           /   /     +  +------|----------+          /      \ \
          /   /       \        |          |         /        \ \
  +------+---++      +-+-------+-+      +-+-------+-+      +--+-+------+
  |   leaf1   |      |   leaf2   |      |   leaf3   | .... |   leaf16  |
  +--+----+---+      +--+----+---+      +--+----+---+      +--+----+---+
     |    |             |    |             |    |             |    |
     H1...H4           H5...H8            H9...H12           H61...H64

                       Figure 2:   Cluster A



Ji, et al.                Expires 24 April 2025                 [Page 5]

Internet-Draft   Congestion Control for Distributed AIDC    October 2024


               +-------------+                +-------------+
               |    Spine3   |                |    Spine4   |
               +-+---+--+--+-+                +--+---+--+-+-+
                /    |  |  |                     |   |  |  |
               /     |  |  |                     |   |  |  |
              /   +--+--+--+---------------------+   |  +  +
             /   /   |  |  |   +---------------------+ /    \
            /   /    |  |  +---|-----------------+----/----+ \
           /   /     +  +------|----------+          /      \ \
          /   /       \        |          |         /        \ \
  +------+---++      +-+-------+-+      +-+-------+-+      +--+-+------+
  |   leaf17  |      |   leaf18  |      |   leaf19  | .... |   leaf32  |
  +--+----+---+      +--+----+---+      +--+----+---+      +--+----+---+
     |    |             |    |             |    |             |    |
    H65...H68           H69...H72         H73...H76          H125...H128

                        Figure 3:   Cluster B

   The experimental results show that the training performance of
   distributed intelligent DCs reaches over 90% of that of the
   centralized single intelligent DC under the same number of GPUs,
   proving the feasibility of distributed AIDC lossless network scheme
   and the proposed congestion control mechanism.

4.  Conclusion

   Building distributed AI training clusters across multiple data
   centers is one of the important research directions for the future of
   AIDC lossless networks.  The congestion control mechanism proposed in
   this document can effectively solve the problems of congestion and
   packet loss in long-distance DCs interconnection by shortening the
   congestion feedback time and adjusting the flow rate of the
   transmitting node reasonably based on the congestion degree.  It
   plays a positive role in promoting the construction of distributed
   AIDC lossless networks.

5.  Security Considerations

   There is no additional security risk introduced by this design.

6.  IANA Considerations

   This document introduces no additional considerations for IANA.

7.  References

7.1.  Normative References




Ji, et al.                Expires 24 April 2025                 [Page 6]

Internet-Draft   Congestion Control for Distributed AIDC    October 2024


   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
              of Explicit Congestion Notification (ECN) to IP",
              RFC 3168, DOI 10.17487/RFC3168, September 2001,
              <https://www.rfc-editor.org/info/rfc3168>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

7.2.  Informative References

   [I-D.hcl-rtgwg-ai-network-problem]
              Huo, P., Chen, G., Lin, C., and Z. Jiang, "Gap Analysis,
              Problem Statement, and Requirements in AI Networks", Work
              in Progress, Internet-Draft, draft-hcl-rtgwg-ai-network-
              problem-01, 23 August 2024,
              <https://datatracker.ietf.org/doc/html/draft-hcl-rtgwg-ai-
              network-problem-01>.

   [I-D.he-huang-rtgwg-wan-lossless-framework]
              He, T., Huang, H., Zhengxin, H., Wang, N., and T. Zhou,
              "Framework for Implementing Lossless Techniques in Wide
              Area Networks", Work in Progress, Internet-Draft, draft-
              he-huang-rtgwg-wan-lossless-framework-00, 5 July 2024,
              <https://datatracker.ietf.org/doc/html/draft-he-huang-
              rtgwg-wan-lossless-framework-00>.

   [I-D.huang-rtgwg-wan-lossless-uc]
              Zhengxin, H., He, T., Huang, H., and T. Zhou, "Use Cases
              and Requirements for Implementing Lossless Techniques in
              Wide Area Networks", Work in Progress, Internet-Draft,
              draft-huang-rtgwg-wan-lossless-uc-01, 8 July 2024,
              <https://datatracker.ietf.org/doc/html/draft-huang-rtgwg-
              wan-lossless-uc-01>.

Authors' Addresses

   Siwei Ji
   Chinat Telecom
   Beiqijia Town, Changping District
   Beijing, 102209
   China
   Email: jisw@chinatelecom.cn



Ji, et al.                Expires 24 April 2025                 [Page 7]

Internet-Draft   Congestion Control for Distributed AIDC    October 2024


   Cong Li
   Chinat Telecom
   Beiqijia Town, Changping District
   Beijing, 102209
   China
   Email: licong@chinatelecom.cn


   Keyi Zhu
   Huawei Technologies
   Huawei Campus, No.156 Beiqing Road
   Beijing, 100095
   China
   Email: zhukeyi@huawei.com





































Ji, et al.                Expires 24 April 2025                 [Page 8]