Unequal ECMP for EVPN IP prefix routes
SR Linux supports unequal Equal Cost Multi Path (ECMP) for EVPN IP prefix IFL (interface-less) routes. To do this, SR Linux makes use of the EVPN link-bandwidth extended community (EC) defined in draft-ietf-bess-evpn-unequal-lb. This extended community indicates the weight for a specific IP prefix; that is, the number of PE-CE multi-paths for an IP prefix that is re-advertised into an EVPN IP prefix route.
The following figure shows an example of weighted ECMP. Assuming each Container Network Function (CNF) advertises the anycast subnet 10.1.1.0/24 from a different next-hop, each TOR ends up with a different number of multi-paths in its PE-CE session. In the example below, TOR3 has three multi-paths for the anycast subnet, and the advertised EVPN IP prefix route includes an EVPN link-bandwidth extended community with a weight of 3. The rest of the TORs send a weight of 1. Note that TOR1 and TOR2 are multi-homed to the same CNF1, yet they send a weight of 1.
On the border leafs / data center gateways, when this feature is enabled, if the EVPN IP prefix route has an Ethernet Segment Identifier (ESI) of 0, the PE sprays the flows to the EVPN IP prefix route based on the received weight; in this example, one-fifth of the flows are sent to TOR4, and three-fifths are sent to TOR3.
-
The EVPN link-bandwidth extended community received in the EVPN IP prefix route indicates the weight for the EVPN IP prefix route. The EVPN link-bandwidth extended community may also be received in the IP A-D per ES routes for each PE attached to the Ethernet segment (ES), but the system ignores it in this case.
-
The PE sprays the flows to the EVPN IP prefix route based on the received weight, dividing the flows to an ES among the number of PEs attached to the ES.
In the example above, one-fifth of the flows are sent to the aliased pair TOR1/TOR2 (either one is selected because TOR1 and TOR2 are assumed to send equal weights in the AD per ES routes).
-
The system rounds up when the advertised weight for the ESI, divided by the number of PEs in the ES, is not an integer.
For example, if ES1 (TOR1/TOR2) advertises BW=3, and TOR4 advertises BW=1, then 3 (BW) / 2 (PE in ES1) = 1.5. The system rounds up, and the remote nodes install weight=2 for TOR1, weight=2 for TOR2, and weight=1 for TOR4.
-
If the weight received in a non-zero ESI IP prefix route exceeds 128, the system caps it at 128, then divides the weight into the number of PEs in the ES.
-
If two EVPN IP prefix routes are received for the same prefix, same ESI, different route distinguishers (RDs), they should have the same weight. However, if they have different weights, the system selects the weight from the first EVPN IP prefix route.
- If the EVPN link-bandwidth extended community is missing from any of the PEs in an ECMP set, or the Value Units field of the extended community is inconsistent, the weight is ignored by the receiving PE, and regular ECMP forwarding is performed. The Value Units field can indicate "bandwidth" or a "generalized weight", with the latter being supported by SR Linux.
Advertising the EVPN link-bandwidth extended community
To configure weighted ECMP, you enable advertisement of the EVPN link-bandwidth extended community and specify the weight to be advertised in the extended community.
You can configure the following parameters for the advertised weight in the extended community:
-
weight
specifies the weight to be advertised in the EVPN link-bandwidth extended community for the advertised EVPN IP prefix routes for the service. If set todynamic
(the default value), the weight is dynamically set based on the number of BGP PE-CE paths for the IP prefix that is advertised in an EVPN IP prefix route. The dynamic weight only considers BGP PE-CE paths.Alternatively, the weight can be set to a fixed integer value in the range 1 to 128.
-
maximum-dynamic-weight
specifies the maximum weight to be advertised in the EVPN link-bandwidth extended community for the advertised EVPN IP-Prefix routes for the service. If weightdynamic
is configured, the actual advertised weight is the minimum of the number of BGP PE-CE paths for the prefix and the configuredmaximum-dynamic-weight
.
The following example enables advertisement of the EVPN link bandwidth extended community and specifies the weight to be included in the extended community for the advertised EVPN IP prefix routes.
--{ * candidate shared default }--[ ]--
# info network-instance VRF1 protocols bgp-evpn bgp-instance 1
network-instance VRF1 {
protocols {
bgp-evpn {
bgp-instance 1 {
routes {
route-table {
ip-prefix {
evpn-link-bandwidth {
advertise {
weight 60
}
}
}
}
}
}
}
}
}
Enabling weighted ECMP
When weighted ECMP is enabled, the system takes into account the EVPN link-bandwidth extended community when installing an ECMP set for an EVPN IP prefix route in the IP-VRF route table.
Flows to an IP prefix received with a weight and a zero ESI are sprayed according to the weight. If the EVPN IP prefix route received with a weight has a non-zero ESI, the weight is divided into the number of PEs attached to the ES, and rounded up if the result is not an integer.
The command also enables weighted ECMP for BGP CEs that are configured with a weight
specified with the
link-bandwidth.add-next-hop-count-to-received-bgp-routes
setting.
Configuring max-ecmp-hash-buckets-per-next-hop-group
preserves the
datapath resources used for the weighted next-hops. The normalization algorithm also
refers to this number of hash buckets. See Displaying normalized ECMP weights for an example of
how the weights are normalized the using the
max-ecmp-hash-buckets-per-next-hop-group
setting.
The following example enables weighted ECMP and specifies the maximum number of ECMP hash buckets per next-hop-group. The weights for weighted ECMP are normalized based on this number of hash buckets.
--{ * candidate shared default }--[ ]--
# info network-instance VRF1 protocols bgp-evpn bgp-instance 1
network-instance VRF1 {
protocols {
bgp-evpn {
bgp-instance 1 {
routes {
route-table {
ip-prefix {
evpn-link-bandwidth {
weighted-ecmp {
admin-state enable
max-ecmp-hash-buckets-per-next-hop-group 4
}
}
}
}
}
}
}
}
}
Configuring weighted ECMP for received PE-CE BGP routes
When the system is enabled to process the link-bandwidth extended community, you can configure a weight to be internally added to the received PE-CE BGP routes for the purpose of EVPN unequal ECMP.
--{ * candidate shared default }--[ ]--
# info network-instance VRF1 protocols bgp group g1 link-bandwidth
network-instance VRF1 {
protocols {
bgp {
group g1 {
link-bandwidth {
add-next-hop-count-to-received-bgp-routes 100
}
}
}
}
}
Displaying normalized ECMP weights
The system normalizes the weights used in weighted ECMP using the
max-ecmp-hash-buckets-per-next-hop-group
setting and the
advertised weights, according to the algorithm described in "Normalizing datapath
weights" in the SR Linux Routing Protocols Guide.
You can display the normalized weights for a next-hop-group.
In the following example, the same prefix 19.1.1.1/32 is received from two PEs with
weights 40 and 60, respectively. The maximum supported ECMP paths is 8. The
max-ecmp-hash-buckets-per-next-hop-group
setting is 4. The
system programs the next-hops with normalized weights.
To display the number of the next-hop-group for prefix 19.1.1.1/32:
--{ running }--[ ]--
# info from state network-instance VRF1 route-table ipv4-unicast route 19.1.1.1/32 id 0 route-type bgp-evpn route-owner bgp_evpn_mgr origin-network-instance VRF1
network-instance VRF1 {
route-table {
ipv4-unicast {
route 19.1.1.1/32 id 0 route-type bgp-evpn route-owner bgp_evpn_mgr origin-network-instance VRF1 {
leakable false
metric 0
preference 170
active true
last-app-update 2023-05-31T18:29:25.207Z
next-hop-group 94413408183
next-hop-group-network-instance VRF1
resilient-hash false
fib-programming {
suppressed false
last-successful-operation-type modify
last-successful-operation-timestamp 2023-05-31T18:29:25.207Z
pending-operation-type none
last-failed-operation-type none
}
}
}
}
}
To display the received weights for the individual next hops in the next-hop group:
--{ running }--[ ]--
# info from state network-instance VRF1 route-table next-hop-group 94413408183
network-instance VRF1 {
route-table {
next-hop-group 94413408183 {
backup-next-hop-group 0
fib-programming {
last-successful-operation-type add
last-successful-operation-timestamp 2023-05-31T18:29:25.207Z
pending-operation-type none
last-failed-operation-type none
}
next-hop 0 {
next-hop 94413408157
weight 40
resolved true
}
next-hop 1 {
next-hop 94413408152
weight 60
resolved true
}
}
}
}
The system normalizes the weights using the algorithm described in "Normalizing
datapath weights" in the SR Linux Routing Protocols Guide, taking into
account the received weights, the maximum supported ECMP paths, and the
max-ecmp-hash-buckets-per-next-hop-group
setting.
You can display the normalized weights with the following command:
--{ running }--[ ]--
# info from state platform linecard 1 forwarding-complex 0 fib-table next-hop-group 94413408183
platform {
linecard 1 {
forwarding-complex 0 {
fib-table {
next-hop-group 94413408183 {
oper-state down
backup-next-hop-group 0
backup-active false
next-hop 0 {
next-hop 94413408157
oper-state up
normalized-weight 1
}
next-hop 1 {
next-hop 94413408152
oper-state up
normalized-weight 2
}
}
}
}
}
}