Unequal ECMP for EVPN IP prefix routes

SR Linux supports unequal Equal Cost Multi Path (ECMP) for EVPN IP prefix IFL (interface-less) routes. To do this, SR Linux makes use of the EVPN link-bandwidth extended community (EC) defined in draft-ietf-bess-evpn-unequal-lb. This extended community indicates the weight for a specific IP prefix; that is, the number of PE-CE multi-paths for an IP prefix that is re-advertised into an EVPN IP prefix route.

The following figure shows an example of weighted ECMP. Assuming each Container Network Function (CNF) advertises the anycast subnet 10.1.1.0/24 from a different next-hop, each TOR ends up with a different number of multi-paths in its PE-CE session. In the example below, TOR3 has three multi-paths for the anycast subnet, and the advertised EVPN IP prefix route includes an EVPN link-bandwidth extended community with a weight of 3. The rest of the TORs send a weight of 1. Note that TOR1 and TOR2 are multi-homed to the same CNF1, yet they send a weight of 1.

Figure 1. EVPN using weighted ECMP

On the border leafs / data center gateways, when this feature is enabled, if the EVPN IP prefix route has an Ethernet Segment Identifier (ESI) of 0, the PE sprays the flows to the EVPN IP prefix route based on the received weight; in this example, one-fifth of the flows are sent to TOR4, and three-fifths are sent to TOR3.

When the EVPN IP prefix route has a non-zero ESI, and there is a weight in the route:
  • The EVPN link-bandwidth extended community received in the EVPN IP prefix route indicates the weight for the EVPN IP prefix route. The EVPN link-bandwidth extended community may also be received in the IP A-D per ES routes for each PE attached to the Ethernet segment (ES), but the system ignores it in this case.

  • The PE sprays the flows to the EVPN IP prefix route based on the received weight, dividing the flows to an ES among the number of PEs attached to the ES.

    In the example above, one-fifth of the flows are sent to the aliased pair TOR1/TOR2 (either one is selected because TOR1 and TOR2 are assumed to send equal weights in the AD per ES routes).

  • The system rounds up when the advertised weight for the ESI, divided by the number of PEs in the ES, is not an integer.

    For example, if ES1 (TOR1/TOR2) advertises BW=3, and TOR4 advertises BW=1, then 3 (BW) / 2 (PE in ES1) = 1.5. The system rounds up, and the remote nodes install weight=2 for TOR1, weight=2 for TOR2, and weight=1 for TOR4.

  • If the weight received in a non-zero ESI IP prefix route exceeds 128, the system caps it at 128, then divides the weight into the number of PEs in the ES.

  • If two EVPN IP prefix routes are received for the same prefix, same ESI, different route distinguishers (RDs), they should have the same weight. However, if they have different weights, the system selects the weight from the first EVPN IP prefix route.

  • If the EVPN link-bandwidth extended community is missing from any of the PEs in an ECMP set, or the Value Units field of the extended community is inconsistent, the weight is ignored by the receiving PE, and regular ECMP forwarding is performed. The Value Units field can indicate "bandwidth" or a "generalized weight", with the latter being supported by SR Linux.

Configuring weighted ECMP for received PE-CE BGP routes

When the system is enabled to process the link-bandwidth extended community, you can configure a weight to be internally added to the received PE-CE BGP routes for the purpose of EVPN unequal ECMP.

--{ * candidate shared default }--[  ]--
# info network-instance VRF1 protocols bgp group g1 link-bandwidth
    network-instance VRF1 {
        protocols {
            bgp {
                group g1 {
                    link-bandwidth {
                        add-next-hop-count-to-received-bgp-routes 100
                    }
                }
            }
        }
    }

Displaying normalized ECMP weights

The system normalizes the weights used in weighted ECMP using the max-ecmp-hash-buckets-per-next-hop-group setting and the advertised weights, according to the algorithm described in "Normalizing datapath weights" in the SR Linux Routing Protocols Guide.

You can display the normalized weights for a next-hop-group.

In the following example, the same prefix 19.1.1.1/32 is received from two PEs with weights 40 and 60, respectively. The maximum supported ECMP paths is 8. The max-ecmp-hash-buckets-per-next-hop-group setting is 4. The system programs the next-hops with normalized weights.

To display the number of the next-hop-group for prefix 19.1.1.1/32:

--{ running }--[  ]--
# info from state network-instance VRF1 route-table ipv4-unicast route 19.1.1.1/32 id 0 route-type bgp-evpn route-owner bgp_evpn_mgr origin-network-instance VRF1
    network-instance VRF1 {
        route-table {
            ipv4-unicast {
                route 19.1.1.1/32 id 0 route-type bgp-evpn route-owner bgp_evpn_mgr origin-network-instance VRF1 {
                    leakable false
                    metric 0
                    preference 170
                    active true
                    last-app-update 2023-05-31T18:29:25.207Z
                    next-hop-group 94413408183
                    next-hop-group-network-instance VRF1
                    resilient-hash false
                    fib-programming {
                        suppressed false
                        last-successful-operation-type modify
                        last-successful-operation-timestamp 2023-05-31T18:29:25.207Z
                        pending-operation-type none
                        last-failed-operation-type none
                    }
                }
            }
        }
    }

To display the received weights for the individual next hops in the next-hop group:

--{ running }--[  ]--
# info from state network-instance VRF1 route-table next-hop-group 94413408183
    network-instance VRF1 {
        route-table {
            next-hop-group 94413408183 {
                backup-next-hop-group 0
                fib-programming {
                    last-successful-operation-type add
                    last-successful-operation-timestamp 2023-05-31T18:29:25.207Z
                    pending-operation-type none
                    last-failed-operation-type none
                }
                next-hop 0 {
                    next-hop 94413408157
                    weight 40 
                    resolved true
                }
                next-hop 1 {
                    next-hop 94413408152
                    weight 60 
                    resolved true
                }
            }
        }
    }

The system normalizes the weights using the algorithm described in "Normalizing datapath weights" in the SR Linux Routing Protocols Guide, taking into account the received weights, the maximum supported ECMP paths, and the max-ecmp-hash-buckets-per-next-hop-group setting.

You can display the normalized weights with the following command:

--{ running }--[  ]--
# info from state platform linecard 1 forwarding-complex 0 fib-table next-hop-group 94413408183
    platform {
        linecard 1 {
            forwarding-complex 0 {
                fib-table {
                    next-hop-group 94413408183 {
                        oper-state down
                        backup-next-hop-group 0
                        backup-active false
                        next-hop 0 {
                            next-hop 94413408157
                            oper-state up
                            normalized-weight 1
                        }
                        next-hop 1 {
                            next-hop 94413408152
                            oper-state up
                            normalized-weight 2
                        }
                    }
                }
            }
        }
    }