IP ECMP Load Balancing

Equal-Cost Multipath Protocol (ECMP) refers to the distribution of packets over two or more outgoing links that share the same routing cost. Static, IS-IS, OSPF, and BGP routes to IPv4 and IPv6 destinations can be programmed into the datapath by their respective applications, with multiple IP ECMP next hops.

The SR Linux device load-balances traffic over multiple equal-cost links with a hashing algorithm that uses header fields from incoming packets to calculate which link to use. When an IPv4 or IPv6 packet is received on a subinterface, and it matches a route with a number of IP ECMP next hops, the next hop that forwards the packet is selected based on a computation using this hashing algorithm. The goal of the hash computation is to keep packets in the same flow on the same network path, while distributing traffic proportionally across the ECMP next hops, so that each of the N ECMP next hops carries approximately 1/Nth of the load.

The hash computation takes various key and packet header field values as inputs and returns a value that indicates the next hop. The key and field values that can be used by the hash computation depend on the platform, packet type, and configuration options, as follows:

On 7250 IXR platforms, the following can be used in the hash computation:

  • Hash-seed (0 to 65535)

    On 7250 IXR-6, 7250 IXR-10, 7250 IXR-X1b, and 7250 IXR-X3b devices, the hash-seed can be system-generated (the default) or user-specified. If the hash-seed is system-generated, SR Linux generates a hash-seed using the least-significant 16 bits of the base chassis MAC address.

    On 7250 IXR-6e, 7250 IXR-10e, and 7250 IXR-18e devices with 36 x 800 IMM, the system randomly generates a per-interface hash-seed using the chassis base MAC address and the 64-bit port ID as inputs; this ensures that the hash-seed is the same after every restart of the port, LAG, or IMM.

  • For IPv4 TCP/UDP non-fragmented packets: source IPv4 address, destination IPv4 address, IP protocol, Layer 4 source port, Layer 4 destination port. The algorithm is asymmetric; that is, inverting source and destination pairs does not produce the same result.
  • For IPv6 TCP/UDP non-fragmented packets: source IPv6 address, destination IPv6 address, IPv6 flow label (even if it is 0), IP protocol (IPv6 next-header value in the last extension header), Layer 4 source port, Layer 4 destination port. The algorithm is symmetric; that is, inverting source and destination pairs produces the same result.
  • For all other packets: source IPv4 or IPv6 address, destination IPv4 or IPv6 address.

On 7250 IXR, 7220 IXR-H4, and 7220 IXR-H5 devices, if an IP packet being forwarded has a UDP destination port of 4791, indicating it is carrying an RDMA over Converged Ethernet version 2 (ROCEv2) payload, the 24-bit Dest Queue-pair value in the ROCEv2 header (BTH+) is added to the hash algorithm. In this case, hashing is based on the existing 5-tuple flow and the new Dest Queue-pair value. The system determines an IP packet's Dest Queue-pair value based on the format of the BTH+ header by looking for the 24-bit value that is 5-bytes offset from the end of the UDP header. On 7220 IXR-H4 and 7220 IXR-H5 devices, the UDF mechanism is used to match on qualifying packets and extract the Dest Queue-pair value from the specified offset.

On the 7220 IXR-D1, 7220 IXR-D2, 7220 IXR-D3, 7220 IXR-H2, and 7220 IXR-H3 the following can be used in the hash computation:

  • Hash-seed (0 to 65535), which can be system-generated (the default) or user-specified. If the hash-seed is system-generated SR Linux generates a hash-seed using the least-significant 16 bits of the base chassis MAC address.
  • For IPv4 TCP/UDP non-fragmented packets: VLAN ID, source IPv4 address, destination IPv4 address, IP protocol, Layer 4 source port, Layer 4 destination port. The algorithm is asymmetric.
  • For IPv6 TCP/UDP non-fragmented packets: VLAN ID, source IPv6 address, destination IPv6 address, IPv6 flow label (even if it is 0), IP protocol (IPv6 next-header value in the last extension header), Layer 4 source port, Layer 4 destination port.
  • For all other packets: source IPv4 or IPv6 address, destination IPv4 or IPv6 address.

On 7215 IXS platforms, the following can be used in the hash computation:

  • Source IP address
  • Destination IP address
  • Layer 4 source port
  • Layer 4 destination port
  • Hash seed
  • IPv6 flow label
  • Received MPLS labels (terminated and non-terminated)
  • IP protocol number

Avoiding hash polarization

Hash polarization is when the hash algorithm selects ECMP next-hops inefficiently; for example, when the system always chooses the same next-hop for specific packet flows. Hash polarization can occur when adjacent routers use the same hash-seed.

To avoid hash polarization effects, ensure that directly connected nodes have unique hash-seeds. You can do this by explicitly configuring the hash-seeds, or by verifying that the state value of system-generated hash-seeds is different on adjacent routers.

To check the state value of system-generated hash-seeds, use the info from state command.

The following example displays the system-wide hash-seed (either user-configured or system-generated) on 7220 IXR and 7250 IXR platforms:

--{ + running }--[  ]--
# info with-context from state system load-balancing hash-options
    system {
        load-balancing {
            hash-options {
                hash-seed 2203
            }
        }
    }

The following example displays the system-generated, interface-specific hash-seed on 7250 IXR-6e, 7250 IXR-10e, and 7250 IXR-18e platforms:

--{ + running }--[  ]--
# info with-context from state interface ethernet-1/1 load-balancing hash-seed
    interface ethernet-1/1 {
        load-balancing {
            hash-seed 41521
        }
    }

Checking hash polynomials (7250 IXR platforms)

On 7250 IXR platforms, the system computes a set of load-balancing keys for each received packet. Packets that belong to the same 5-tuple flow have the same load-balancing keys, ensuring they follow the same path through the network and do not get misordered.

For each received packet, the system computes load-balancing keys for the following clients or "hash-users":

  • key 1 is used to select an ECMP level 1 FEC member
  • key 2 is used to select an ECMP level 2 FEC member
  • key 3 is used to select an ECMP level 3 FEC member
  • key 4 is used to select a LAG member
  • key 5 is used to generate a value to be stamped into a network header at egress (for example, IPv6 flow label or VXLAN UDP source port)

To create the load-balancing keys, the system sends a master key (computed from the combined CRC values of each of the packet header layers), along with the user-configured or system-generated hash-seed, to one of eight polynomial functions available on the device. Each hash-user is assigned one of the eight polynomial functions. The system uses the master key and the hash-seed as input to the polynomial function, which returns the load-balancing key for the hash-user as output.

There is a greater risk of hash polarization if two adjacent routers use the same polynomial function for the same hash-user, instead of two different polynomial functions. You can use an info from state command to display the polynomial function assigned to the each hash-user; for example:

--{ + running }--[  ]--
# info with-context from state platform linecard 1 forwarding-complex 1 load-balancing hash-user * hash-polynomial
    platform {
        linecard 1 {
            forwarding-complex 1 {
                load-balancing {
                    hash-user level-1-fec {
                        hash-polynomial 1
                    }
                    hash-user level-2-fec {
                        hash-polynomial 2
                    }
                    hash-user level-3-fec {
                        hash-polynomial 3
                    }
                    hash-user lag {
                        hash-polynomial 4
                    }
                    hash-user network-header {
                        hash-polynomial 5
                    }
                }
            }
        }
    }

If adjacent routers use the same hash-polynomial function for the same hash-user, you can avoid potential hash polarization by changing the hash-seed on one of the routers.

Configuring IP ECMP load balancing

To configure IP ECMP load balancing, you specify hash-options that are used as input fields for the hash calculation, which determines the next hop for packets matching routes with multiple ECMP hops.

Configure hash options for IP ECMP load balancing

The following example configures hash options for IP ECMP load balancing, including a hash-seed and packet header field values to be used in the hash computation.

--{ * candidate shared default }--[  ]--
# info with-context system load-balancing
    system {
        load-balancing {
            hash-options {
                hash-seed 128
                ipv6-flow-label false
            }
        }
    }

On 7250 IXR-6, 7250 IXR-10, 7250 IXR-X1b, and 7250 IXR-X3b devices, if no value is configured for the hash-seed, the default is for the system to generate a hash-seed using the least-significant 16 bits of the base chassis MAC address. If a hash-option is not specifically configured either true or false, the default for the hash option is true. The user-configured or system-generated hash-seed applies system-wide.

On 7250 IXR-6e, 7250 IXR-10e, and 7250 IXR-18e devices, SR Linux randomly generates a per-interface hash-seed using the chassis base MAC address and the 64-bit port ID as inputs.

On 7250 IXR devices, if source-address is configured as a hash option, the destination-address must also be configured as a hash option. Similarly, if source-port is configured as a hash option, the destination-port must also be configured as a hash option.

Configure IP ECMP load balancing based only on IPv4 source and destination address (7250 IXR-X1b only)

The following example configures the hash options so that the SR Linux device load-balances IPv4 traffic using only the IPv4 source and destination address. In this example, fields such as Layer 4 protocol and Layer 4 source/destination ports are not used in the load-balancing calculation. The result of this configuration is that all IPv4 traffic with same source address and destination address pair is forwarded to the same next-hop and on the same port always.

--{ * candidate shared default }--[  ]--
# info with-context system load-balancing hash-options
    system {
        load-balancing {
            hash-options {
                destination-address true
                destination-port false
                ipv6-flow-label false
                protocol false
                source-address true
                source-port false
                mpls-label-stack false
            }
        }
    }

Dynamic Load Balancing

7220 IXR-Hx platforms support dynamic load balancing for ECMP distribution of packets over outgoing links. Dynamic load balancing improves on hash-based load balancing by considering the state of aggregate ECMP group members when assigning flows to groups.

At packet ingress, a flow is identified, and the state of the flow is evaluated. Based on this evaluation, the flow is assigned to an aggregate ECMP group. The dynamic load balancing algorithm analyzes the aggregate ECMP groups and detects ECMP load imbalances among the egress paths based on three load-balancing factors: total egress port queue fill size, Ingress Traffic Manager (ITM) port queue size, and egress port utilization. When the algorithm detects an ECMP load imbalance, it can reassign flows to different aggregate ECMP groups so that balance is restored.

Dynamic load balancing is enabled for specific prefixes in a network-instance. The following options are configurable:

  • flowset-size: the number of flow entries reserved for each aggregate ECMP group.
  • inactivity-timer: the amount of time a flow must be idle before it is eligible for reassignment to a different aggregate ECMP group member.
  • mode: whether the system can reassign a flow to a different aggregate ECMP group (flow-dynamic) after the initial assignment when the flow is inactive, or whether the flows do not move after the initial assigment (flow-fixed)
  • link-quality-sampling-interval: how often the system samples the link quality.
  • weighting-factor: the relative weight the dynamic load balancing algorithm gives to the total egress port queue fill size, ITM port queue size, and egress port utilization.

Dynamic load balancing is supported for physical interfaces only; it is not supported for LAG interfaces. Dynamic load balancing is limited to unicast traffic only.

Configuring dynamic load balancing

You can configure options to adjust how the dynamic load balancing algorithm balances traffic, the network prefixes for which traffic is load balanced, and thresholds for monitoring resources used by dynamic load balancing.

Configure system options for dynamic load balancing

The following example configures system-level parameters to adjust the dynamic load balancing algorithm.

--{ +* candidate shared default }--[  ]--
# info with-context system load-balancing dynamic
    system {
        load-balancing {
            dynamic {
                flowset-size 512
                inactivity-timer 100
                link-quality-sampling-interval 7
                weighting-factor {
                    port-utilization 75
                    queue-utilization 10
                    itm-utilization 15
                }
            }
        }
    }

In this example, the number of movable flows per ECMP group is set to 512 flows. A flow must be inactive for a minimum of 100 milliseconds to be moved to a better quality interface. The quality of the link is evaluated at an interval of 7 milliseconds. The dynamic load balancing algorithm is configured so that port-utilization is weighted heaviest

Configure network prefixes for dynamic load balancing

Dynamic load balancing is enabled on a IP prefix-by-prefix basis within a network-instance. Routes matching the specified prefix have dynamic load balancing enabled on their associated ECMP next-hop group.

The following example enables dynamic load balancing for a prefix within the default network-instance.

--{ +* candidate shared default }--[  ]--
# info with-context network-instance default ip-load-balancing dynamic-load-balancing
    network-instance default {
        ip-load-balancing {
            dynamic-load-balancing {
                prefix 10.101.12.0/24 {
                }
            }
        }
    }

Configure Resource monitoring thresholds for dynamic load balancing

The following example configures the system to generate a warning message when the usage level for dynamic load balancing ECMP groups exceeds a threshold, and a notice message when the usage level for dynamic load balancing ECMP groups usage drops below a threshold.

--{ +* candidate shared default }--[  ]--
# info with-context platform resource-monitoring datapath asic resource dynamic-load-balancing-ecmp-groups
    platform {
        resource-monitoring {
            datapath {
                asic {
                    resource dynamic-load-balancing-ecmp-groups {
                        upper-threshold-set 75
                        upper-threshold-clear 50
                    }
                }
            }
        }
    }

In this example, a warning message is generated and used-upper-threshold-exceeded is set to true whenever utilization of the datapath resource in any line card, forwarding complex, or pipeline reaches 75% in a rising direction. A notice message is generated and used-upper-threshold-exceeded is set to false whenever utilization of the datapath resource in any line card, forwarding complex, or pipeline reaches 50% in a falling direction.