BGP Weighted ECMP
This chapter provides information about BGP weighted ECMP.
Topics in this chapter include:
Applicability
The information and configuration in this chapter was originally based on SR OS Release 15.0.R4. The CLI in the current edition is based on SR OS Release 23.3.R2.
Overview
Equal-cost multipath (ECMP) is a routing strategy that allows the installation of multiple next hops for an IP destination in the routing table. When used in conjunction with BGP multipath, the ingress router can forward traffic to an IP prefix destination in a load-balanced fashion across the available ECMP next hops. For more information about the implementation, see the BGP Multipath chapter.
In the standard implementation, ECMP distributes traffic as evenly as possible across all the ECMP next hops. Standard ECMP - Equal Bandwidth Links shows an example scenario where CE-4 is dual-homed to two PE routers and advertises the prefix 10.0.0.0/8. This prefix is then advertised within AS 64496 and received by PE-3, which in turn advertises it to CE-6 in AS 64501. PE-3 has BGP multipath and ECMP enabled, so the traffic toward destinations in 10.0.0.0/8 sent by CE-6 is load-balanced toward PE-1 and PE-2 as evenly as possible.
The behavior of equally distributing across the ECMP next hops may not be suitable under specific circumstances. Consider the same topology with the connection between CE-4 and PE-1 replaced with a 10GE link, while the CE-4 to PE-2 connection still is a 1GE link, as shown in Standard ECMP - Unequal Bandwidth Links. In standard ECMP operation, when PE-3 sends 50% of traffic to PE-1 and 50% to PE-2, this may result in an under-utilization of the link between CE-4 and PE-1 or an over-utilization of the link between CE-4 and PE-2.
BGP Weighted ECMP, also known as Unequal-Cost Multipath (UCMP), allows for the distribution of traffic in proportion to the relative bandwidth of each equal-cost path. This feature uses a BGP community called the Link Bandwidth Extended Community. Link Bandwidth Extended Community Advertisement shows that PE-1 and PE-2, with this functionality, can add a Link Bandwidth Extended Community to the BGP routes advertised toward other routers within AS 64496 that indicates the bandwidth of their PE-CE link.
PE-3 can use the information in the Link Bandwidth Extended Community to distribute the traffic according to the relative bandwidth, or the "weight" of each path. Weighted ECMP - Unequal Bandwidth Links shows that 91% of traffic is sent toward PE-1 with the 10GE link and 9% is sent toward PE-2 with the 1GE link.
Weighted ECMP - Link Aggregation Group shows another example where the CE-4-to-PE-1 link is composed of four 1GE links that are part of a Link Aggregation Group (LAG) and the CE-4-to-PE-2 link is 1GE. Weighted ECMP can be used here to achieve an 80% to 20% distribution of traffic sent from PE-3 to PE-1 and PE-2, respectively.
Standard ECMP - Unequal Bandwidth Links with eBGP shows an example where PE-1 is connected to two eBGP routers in neighbor AS 64500. Using the weighted ECMP functionality, 91% of traffic is sent to CE-4 and 9% to CE-5, according to the relative bandwidth values.
Weighted ECMP - Unequal Bandwidth Links with VPRN shows an example with a Layer 3 VPRN service. PE-1 receives prefix 10.0.0.0/8 from CE-4 via eBGP, and also from PE-2 via iBGP. PE-1 sets the Link Bandwidth Extended Community indicating 3GE on the route received from CE-4. PE-2 sets the community value indicating 1GE on the route it advertises to PE-1. With Exterior Interior Border Gateway Protocol (EIBGP) multipath (described in the BGP Multipath chapter) and ECMP within the VPRN, PE-1 can send 75% of traffic on the direct LAG link to CE-4 and 25% to PE-2, which then forwards that traffic to CE-4.
Link Bandwidth Extended Community is defined in draft-ietf-idr-link-bandwidth-06 and has the following characteristics:
Signals the link bandwidth of a BGP path
Has the following format: bandwidth:<as-number>:<value>
bandwidth is the community type
<as-number> is the local AS number
<value> is a fixed/static bandwidth in Mb/s (converted to IEEE floating point format in a BGP Update message)
Optional and non-transitive attribute (not sent to other eBGP peers upon receipt)
If a router changes the route next hop, it does not propagate the Link Bandwidth Extended Community
A route can only have a single Link Bandwidth Extended Community
SR OS routers automatically perform weighted load balancing if all the BGP updates received for a destination contain the Link Bandwidth Extended Community
Link Bandwidth Extended Community can be added to a BGP route with the following methods:
link-bandwidth command
BGP import policy action
VRF import policy action
BGP export policy action
The link-bandwidth command has the following characteristics:
Configurable per BGP group or neighbor in base router or VPRN
Adds a Link Bandwidth Extended Community to all (IPv4, IPv6, VPN-IPv4, VPN-IPv6, label-IPv4, label-IPv6) routes received from directly connected EBGP peers
Bandwidth value is based on the speed of port or active LAG members
Bandwidth is automatically adjusted for LAG interfaces based on the number of active LAG member ports
SR OS uses the following rules when BGP paths are received with Link Bandwidth Extended Communities:
If BGP multipath and ECMP are configured and all the eligible multipaths have a Link Bandwidth Extended Community, then weighted ECMP is performed on the relative bandwidth of each path.
If EIBGP multipath and ECMP are enabled in a VPRN and all the eligible next hops have a Link Bandwidth Extended Community, then weighted ECMP is performed based on the relative bandwidth of each path.
The Link Bandwidth Extended Community is not used as a criterion for two or more paths to be considered equal for BGP/EIBGP multipath purposes.
Configuration
The following configuration examples for BGP weighted ECMP are covered in this chapter:
Example Topology - BGP Weighted ECMP for IPv4 Family shows the example topology for BGP Weighted ECMP for IPv4 family with the following characteristics:
CE-4 in AS 64500 advertises both prefixes 10.1.2.3/32 and 10.2.4.6/32 to its eBGP peers PE-1 and PE-2 in AS 64496.
RR-5 is route reflector for all PEs in AS 64496.
add-paths is configured on all PE routers and RR-5 with a send limit of 2.
CE-6 in AS 64501 advertises both prefixes 10.3.4.5/32 and 10.4.6.8/32 to its eBGP peer PE-3 in AS 64496.
Initial Configuration
The initial configuration on all nodes includes:
Cards, MDAs, ports
LAG configured for the link between CE-4 and PE-1 with two member links
Router interfaces
IS-IS as IGP on all interfaces within AS 64496 (alternatively, OSPF can be used)
BGP is configured on all the nodes. CE-4 peers with PE-1 and PE-2 and exports the 10.1.2.3/32 and 10.2.4.6/32 loopback prefixes to both eBGP peers, as follows:
# on CE-4:
configure
router Base
interface "int-loopback-1"
address 10.1.2.3/32
loopback
no shutdown
exit
interface "int-loopback-2"
address 10.2.4.6/32
loopback
no shutdown
exit
autonomous-system 64500
policy-options
begin
prefix-list "10.0.0.0/8"
prefix 10.0.0.0/8 longer
exit
policy-statement "policy-export-bgp"
entry 10
from
prefix-list "10.0.0.0/8"
exit
action accept
exit
exit
exit
commit
exit
bgp
rapid-withdrawal
split-horizon
group "eBGP"
export "policy-export-bgp"
peer-as 64496
neighbor 172.16.14.1
exit
neighbor 172.16.24.1
exit
exit
no shutdown
exit
exit all
The BGP configuration on CE-6 is identical, except for the loopback interface addresses.
PE-1 peers with CE-4 in AS 65400 and RR-5 in AS 64496. add-paths is enabled on the iBGP group to advertise redundant BGP paths to the route reflector. The BGP configuration on PE-1 is as follows:
# on PE-1:
configure
router Base
autonomous-system 64496
bgp
rapid-withdrawal
split-horizon
group "eBGP"
peer-as 64500
neighbor 172.16.14.2
exit
exit
group "iBGP"
family ipv4
next-hop-self
peer-as 64496
add-paths
ipv4 send 2 receive
exit
neighbor 192.0.2.5
exit
exit
no shutdown
exit
exit all
The BGP configuration on PE-2 and PE-3 is similar to that on PE-1.
RR-5 acts as a route reflector to all the PEs in AS 64496 with a cluster ID of 5.5.5.5. add-paths is enabled to advertise redundant BGP paths to the PEs. The configuration on RR-5 is as follows:
# on RR-5:
configure
router Base
autonomous-system 64496
bgp
rapid-withdrawal
split-horizon
group "iBGP"
family ipv4
cluster 5.5.5.5
peer-as 64496
add-paths
ipv4 send 2 receive
exit
neighbor 192.0.2.1
exit
neighbor 192.0.2.2
exit
neighbor 192.0.2.3
exit
exit
no shutdown
exit
exit all
BGP Weighted ECMP for IPv4 Family using the link-bandwidth command
PE-3 receives the prefixes 10.1.2.3/32 and 10.2.4.6/32 from PE-1 and PE-2 via the route reflector and indicates the ones received from PE-1 as the "used" or active routes, as follows:
*A:PE-3# show router bgp routes
===============================================================================
BGP Router ID:192.0.2.3 AS:64496 Local AS:64496
===============================================================================
Legend -
Status codes : u - used, s - suppressed, h - history, d - decayed, * - valid
l - leaked, x - stale, > - best, b - backup, p - purge
Origin codes : i - IGP, e - EGP, ? - incomplete
===============================================================================
BGP IPv4 Routes
===============================================================================
Flag Network LocalPref MED
Nexthop (Router) Path-Id IGP Cost
As-Path Label
-------------------------------------------------------------------------------
u*>i 10.1.2.3/32 100 None
192.0.2.1 1 10
64500 -
*i 10.1.2.3/32 100 None
192.0.2.2 11 10
64500 -
u*>i 10.2.4.6/32 100 None
192.0.2.1 2 10
64500 -
*i 10.2.4.6/32 100 None
192.0.2.2 12 10
64500 -
u*>i 10.3.4.5/32 None None
172.16.36.2 None 0
64501 -
u*>i 10.4.6.8/32 None None
172.16.36.2 None 0
64501 -
-------------------------------------------------------------------------------
Routes : 6
===============================================================================
ECMP and BGP multipath are enabled on PE-3 with the following commands:
# on PE-3:
configure router ecmp 2
configure router bgp multi-path maximum-paths 2
As a result, PE-3 installs the routes from PE-2 as active, in addition to those from PE-1:
*A:PE-3# show router bgp routes
===============================================================================
BGP Router ID:192.0.2.3 AS:64496 Local AS:64496
===============================================================================
Legend -
Status codes : u - used, s - suppressed, h - history, d - decayed, * - valid
l - leaked, x - stale, > - best, b - backup, p - purge
Origin codes : i - IGP, e - EGP, ? - incomplete
===============================================================================
BGP IPv4 Routes
===============================================================================
Flag Network LocalPref MED
Nexthop (Router) Path-Id IGP Cost
As-Path Label
-------------------------------------------------------------------------------
u*>i 10.1.2.3/32 100 None
192.0.2.1 1 10
64500 -
u*>i 10.1.2.3/32 100 None
192.0.2.2 11 10
64500 -
u*>i 10.2.4.6/32 100 None
192.0.2.1 2 10
64500 -
u*>i 10.2.4.6/32 100 None
192.0.2.2 12 10
64500 -
u*>i 10.3.4.5/32 None None
172.16.36.2 None 0
64501 -
u*>i 10.4.6.8/32 None None
172.16.36.2 None 0
64501 -
-------------------------------------------------------------------------------
Routes : 6
===============================================================================
The multiple next hops are also visible in the route table of PE-3:
*A:PE-3# show router route-table protocol bgp
===============================================================================
Route Table (Router: Base)
===============================================================================
Dest Prefix[Flags] Type Proto Age Pref
Next Hop[Interface Name] Metric
-------------------------------------------------------------------------------
10.1.2.3/32 Remote BGP 00h00m40s 170
192.168.13.1 10
10.1.2.3/32 Remote BGP 00h00m40s 170
192.168.23.1 10
10.2.4.6/32 Remote BGP 00h00m40s 170
192.168.13.1 10
10.2.4.6/32 Remote BGP 00h00m40s 170
192.168.23.1 10
10.3.4.5/32 Remote BGP 00h05m12s 170
172.16.36.2 0
10.4.6.8/32 Remote BGP 00h05m12s 170
172.16.36.2 0
-------------------------------------------------------------------------------
No. of Routes: 6
Flags: n = Number of times nexthop is repeated
B = BGP backup route available
L = LFA nexthop available
S = Sticky ECMP requested
===============================================================================
The following command shows that the routes received on PE-3 have no community added (do not forget to add the keyword "expression" after the match statement).
*A:PE-3# show router bgp routes 10.1.2.3/32 hunt brief | match "^Nexthop |Community" expression
Nexthop : 192.0.2.1
Community : No Community Members
Nexthop : 192.0.2.2
Community : No Community Members
The following command output shows the ECMP-weight outputs assigned to next hops 192.0.2.1 and 192.0.2.2. Both have a value of 1.
*A:PE-3# show router fib 1 10.1.2.3/32 extensive
===============================================================================
FIB Display (Router: Base)
===============================================================================
Dest Prefix : 10.1.2.3/32
Protocol : BGP
Installed : Y
Indirect Next-Hop : 192.0.2.1
QoS : Priority=n/c, FC=n/c
Source-Class : 0
Dest-Class : 0
ECMP-Weight : 1
Resolving Next-Hop : 192.168.13.1
Interface : int-PE-3-PE-1
ECMP-Weight : 1
Indirect Next-Hop : 192.0.2.2
QoS : Priority=n/c, FC=n/c
Source-Class : 0
Dest-Class : 0
ECMP-Weight : 1
Resolving Next-Hop : 192.168.23.1
Interface : int-PE-3-PE-2
ECMP-Weight : 1
===============================================================================
Total Entries : 1
===============================================================================
The following command is executed on both PE-1 and PE-2 to automatically add a Link Bandwidth Extended Community on routes received from their eBGP neighbor CE-4:
# on PE-1 and on PE-2:
configure
router Base
bgp
group "eBGP"
link-bandwidth
add-to-received-ebgp ipv4
exit all
PE-3 now receives the routes from PE-1 and PE-2 with Link Bandwidth Extended Communities corresponding to the interface bandwidth for each CE-PE link:
*A:PE-3# show router bgp routes 10.1.2.3/32 hunt brief | match "^Nexthop |Community" expression
Nexthop : 192.0.2.1
Community : bandwidth:64496:200000
Nexthop : 192.0.2.2
Community : bandwidth:64496:100000
The following command output now shows that the ECMP-Weight value assigned to next hop 192.0.2.1 is 2, relative to its two member interfaces in the LAG, whereas the ECMP-Weight value of 192.0.2.2 is still 1, because it has a single interface to CE-4:
*A:PE-3# show router fib 1 10.1.2.3/32 extensive
===============================================================================
FIB Display (Router: Base)
===============================================================================
Dest Prefix : 10.1.2.3/32
Protocol : BGP
Installed : Y
Indirect Next-Hop : 192.0.2.1
QoS : Priority=n/c, FC=n/c
Source-Class : 0
Dest-Class : 0
ECMP-Weight : 2
Resolving Next-Hop : 192.168.13.1
Interface : int-PE-3-PE-1
ECMP-Weight : 1
Indirect Next-Hop : 192.0.2.2
QoS : Priority=n/c, FC=n/c
Source-Class : 0
Dest-Class : 0
ECMP-Weight : 1
Resolving Next-Hop : 192.168.23.1
Interface : int-PE-3-PE-2
ECMP-Weight : 1
===============================================================================
Total Entries : 1
===============================================================================
If a tester tool is available, it can be used to test the traffic load-balancing behavior by using it to replace CE-4 and CE-6 in the topology. This would be the preferred option to get better results in observing the effect of weighted ECMP. Multiple flows (preferably a couple of hundred or thousands) should be created and sent between the tester ports. For a simple test, the SR OS rapid ping tool can be used to create traffic between the loopback interfaces of CE-6 and CE-4.
At least three flows need to be created to see traffic distributed over the two LAG links between CE-4 and PE-1 and the single link between CE-4 and PE-2. The loopback IP addresses on CE-4 and CE-6 have been specifically chosen to demonstrate the expected load balancing. The behavior may be different if different loopback IP addresses are used, because it affects the load-balancing algorithm.
To facilitate the test, two more Telnet or SSH sessions are initiated to CE-6 (three in total) and the following commands are executed in each separate session:
First session:
*A:CE-6# ping 10.1.2.3 source 10.3.4.5 size 1200 count 100000 rapid
Second session:
*A:CE-6# ping 10.1.2.3 source 10.3.4.5 size 1200 count 100000 rapid
Third session:
*A:CE-6# ping 10.1.2.3 source 10.4.6.8 size 1200 count 100000 rapid
The monitor command outputs on PE-1 and PE-2 show the traffic from CE-6 to CE-4 is being distributed over the two LAG links on PE-1 and the single link on PE-2. In the ideal case, PE-1 would receive 67% and PE-2 would receive 33% of total traffic; however, it may not be possible to observe this effectively with only three ICMP flows.
On the PE-1 LAG link to CE-4, the following traffic is monitored. In each interval of 3 seconds, the number of output bytes is 250000 (or more if other traffic is sent in parallel).
*A:PE-1# monitor lag 1 interval 3 repeat 999 rate
===============================================================================
Monitor statistics for LAG ID 1
===============================================================================
Port-id Input packets Output packets
Input bytes Output bytes
Input errors [Input util %] Output errors [Output util %]
-------------------------------------------------------------------------------
---snip---
-------------------------------------------------------------------------------
At time t = 6 sec (Mode: Rate)
-------------------------------------------------------------------------------
1/1/c2/1 301 201
375128 250128
0 ~0.00 0 ~0.00
1/1/c5/1 1 1
128 128
0 ~0.00 0 ~0.00
-------------------------------------------------------------------------------
Totals 302 202
375256 250256
0 ~0.00 0 ~0.00
-------------------------------------------------------------------------------
At time t = 9 sec (Mode: Rate)
-------------------------------------------------------------------------------
1/1/c2/1 301 201
375128 250128
0 ~0.00 0 ~0.00
1/1/c5/1 1 1
128 128
0 ~0.00 0 ~0.00
-------------------------------------------------------------------------------
Totals 302 202
375256 250256
0 ~0.00 0 ~0.00
On the PE-2 to CE-4 link, the following traffic is monitored. In each interval of 3 seconds, the number of output bytes is 125000 (or more if other traffic is sent in parallel):
*A:PE-2# monitor port 1/1/c1/1 interval 3 repeat 999 rate
===============================================================================
Monitor statistics for Port 1/1/c1/1
===============================================================================
Input Output
-------------------------------------------------------------------------------
---snip---
-------------------------------------------------------------------------------
At time t = 6 sec (Mode: Rate)
-------------------------------------------------------------------------------
Octets 0 125000
Packets 0 100
Errors 0 0
Bits 0 1000000
Utilization (% of port capacity) 0.00 ~0.00
-------------------------------------------------------------------------------
At time t = 9 sec (Mode: Rate)
-------------------------------------------------------------------------------
Octets 0 125000
Packets 0 100
Errors 0 0
Bits 0 1000000
Utilization (% of port capacity) 0.00 ~0.00
BGP Weighted ECMP for IPv4 Family using BGP Import Policy
The link-bandwidth command, which was enabled in the previous step, is removed on PE-1 and PE-2:
# on PE-1 and on PE-2:
configure router bgp group "eBGP" link-bandwidth no add-to-received-ebgp
The following policy is configured on PE-1 to manually add the Link Bandwidth Extended Community "bandwidth:64500:4000" to routes received from CE-4:
# on PE-1:
configure
router Base
policy-options
begin
prefix-list "10.0.0.0/8"
prefix 10.0.0.0/8 longer
exit
community "bandwidth-4G" members "bandwidth:64500:4000"
policy-statement "policy-import-bandwidth-4G"
entry 10
from
prefix-list "10.0.0.0/8"
exit
action accept
community add "bandwidth-4G"
exit
exit
exit
commit
exit all
The policy is applied on PE-1 for the eBGP group in the import direction:
# on PE-1:
configure router bgp group "eBGP" import "policy-import-bandwidth-4G"
The following policy is configured on PE-2 to manually add the Link Bandwidth Extended Community "bandwidth:64500:2000" to routes received from CE-4:
# on PE-2:
configure
router Base
policy-options
begin
prefix-list "10.0.0.0/8"
prefix 10.0.0.0/8 longer
exit
community "bandwidth-2G" members "bandwidth:64500:2000"
policy-statement "policy-import-bandwidth-2G"
entry 10
from
prefix-list "10.0.0.0/8"
exit
action accept
community add "bandwidth-2G"
exit
exit
exit
commit
exit all
The policy is applied on PE-2 for the eBGP group in the import direction:
# on PE-2:
configure router bgp group "eBGP" import "policy-import-bandwidth-2G"
PE-3 receives the routes from PE-1 and PE-2 with Link Bandwidth Extended Communities as configured in the previous step:
*A:PE-3# show router bgp routes 10.1.2.3/32 hunt brief | match "^Nexthop |Community" expression
Nexthop : 192.0.2.1
Community : bandwidth:64500:4000
Nexthop : 192.0.2.2
Community : bandwidth:64500:2000
Again, the following command output shows that the ECMP-weight output assigned to next hop 192.0.2.1 has become 2:
*A:PE-3# show router fib 1 10.1.2.3/32 extensive
===============================================================================
FIB Display (Router: Base)
===============================================================================
Dest Prefix : 10.1.2.3/32
Protocol : BGP
Installed : Y
Indirect Next-Hop : 192.0.2.1
QoS : Priority=n/c, FC=n/c
Source-Class : 0
Dest-Class : 0
ECMP-Weight : 2
Resolving Next-Hop : 192.168.13.1
Interface : int-PE-3-PE-1
ECMP-Weight : 1
Indirect Next-Hop : 192.0.2.2
QoS : Priority=n/c, FC=n/c
Source-Class : 0
Dest-Class : 0
ECMP-Weight : 1
Resolving Next-Hop : 192.168.23.1
Interface : int-PE-3-PE-2
ECMP-Weight : 1
===============================================================================
Total Entries : 1
===============================================================================
Any dynamic changes to the Link Bandwidth Extended Community upon failure or bandwidth change of a LAG link are not possible with the policy functionality, as opposed to using the link-bandwidth command.
Similar tests can be run using the rapid ping facility or an external tester tool as described in the previous section to check the packet forwarding behavior.
Conclusion
BGP Weighted ECMP allows modification of the standard load-balancing behavior to accommodate the relative link bandwidth values of different BGP next hops. This allows better utilization of the links in the network with different capacities. The bandwidth values are advertised by edge routers and carried within a BGP community called the Link Bandwidth Extended Community. SR OS routers automatically perform load balancing if all the BGP routes to a destination contain this community.