BNG-UP resiliency

Resiliency based on Fate Sharing Group

The MAG-c groups the sessions in FSGs. All sessions in an FSG share their fate, that is, they become active or standby together. The MAG-c provides the following parameters to the BNG-UP per FSG:

When the MAG-c does not provide an FSG template, the template with the name default is used. If there is no default template, the setup of the FSG and any associated session fails.

Use the following command to configure FSG templates.

configure subscriber-mgmt up-resiliency fate-sharing-group-template

After this, the active BNG-UP and standby BNG-UP are used in the context of a single FSG. Each BNG-UP can have multiple FSGs and can have a different status for each FSG.

To attract traffic from the access network, an active BNG-UP replies to ARP requests or ND messages for any IP gateway associated with the FSG. A standby BNG-UP never replies to those ARP or ND messages. To expedite convergence when switching from standby to active, the new active BNG-UP sends Gratuitous ARP (GARP) messages using the IP gateway address for the FSG, or the system IP address if no IP gateway is known. Afterward, the BNG-UP keeps sending periodic GARP messages to ensure traffic is attracted at all times to the correct BNG-UP.

Use the following command to configure the granularity of GARP messages for QinQ SAPs.

configure subscriber-mgmt up-resiliency fate-sharing-group-template gratuitous-arp

You can configure the BNG-UP to send a single GARP message per SAP or per outer tag. Configure the fsg-active, fsg-active-path-restoration and fsg-standby options for the following command to correctly draw traffic to the active BNG-UP.

configure policy-options policy-statement entry from state

All routes received from PFCP, including per-session framed routes, have one of these values as an option. You can use this option to adjust values in routing export policies; for example, adjust a metric or a preference to the needs of the used routing protocol.

The following reduced configuration example shows a simplified policy that sets a metric of 100 for active routes, a metric of 150 for active routes while in headless, and a metric of 200 for standby routes.

Policy statement configuration

[ex:/configure policy-options policy-statement "upf_resiliency_aware_export"]
A:admin@BNG-UPF# info
    entry 20 {
        from {
            origin pfcp
            state fsg-active
        }
        action {
            action-type accept
            metric {
                set 100
            }
        }
    }
    entry 30 {
        from {
            origin pfcp
            state fsg-active-path-restoration
        }
        action {
            action-type accept
            metric {
                set 150
            }
        }
    }
    entry 40 {
        from {
            origin pfcp
            state fsg-standby
        }
        action {
            action-type accept
            metric {
                set 200
            }
        }
    }

An active BNG-UP always forwards traffic in both directions. It uses the FSG MAC as source MAC for downlink unicast traffic. A standby BNG-UP by default forwards downlink traffic using its local port MAC as source MAC and drops all received uplink traffic. You can modify the default behavior in the following ways:

To shunt downlink traffic from the standby to the active BNG-UP and have the active BNG-UP forward that downlink traffic, do the following:
- Use the following command to configure a redundant interface.
```
configure subscriber-mgmt up-resiliency fate-sharing-group-template redundant-interface
```
- Use the following command to configure the same shunt ID on the active and the standby BNG-UP for the applicable service.
```
configure service ies subscriber-mgmt multi-chassis-shunt-id
configure service vprn subscriber-mgmt multi-chassis-shunt-id
```
Use the following command to enable forwarding of uplink traffic by the standby BNG-UP.
```
configure subscriber-mgmt up-resiliency fate-sharing-group-template uplink-forwarding-while-standby
```
CAUTION: Enabling the uplink-forwarding-while-standby command can lead to packet replication toward the core network. To prevent the possibility of packet replication toward the core network, provision the access network not to replicate unknown unicast packets to the BNG-UP.

When the standby BNG-UP forwards uplink traffic, it can significantly lower packet loss during transition scenarios. The following examples illustrate this benefit:

The resiliency based on FSG does not use the SRRP protocol, but the system internally consumes an SRRP instance for each unique combination of FSG, port, and group interface template. To avoid potential conflicts with pre-configured SRRP instance IDs, use the following command to define a range of SRRP instance IDs for the inter-BNG-UP resiliency functionality.

configure redundancy srrp auto-srrp-id-range

BNG-UP health reporting

The BNG-UP can send health reports to the MAG-c using PFCP Node Report messages. The MAG-c uses the health reports to determine the need for a BNG-UP status change (active or standby). Per FSG, the MAG-c selects the active and the standby BNG-UP. For example, the MAG-c can base its decision on link failures in the access network.

The BNG-UP supports health reports for the following contexts:

per network instance
Use the commands in the following context to configure the health monitoring for the applicable service.
```
configure service ies subscriber-mgmt up-resiliency
configure service vprn subscriber-mgmt up-resiliency
```
The health reports per network instance can, for example, be used to indicate the status of the network where the subscriber is serviced.
per Layer 2 access ID
Use the commands in the following context to configure health monitoring.
```
configure service vpls capture-sap pfcp up-resiliency
```
The health reports per Layer 2 access ID can, for example, be used to indicate the status of the access links.

Each health report generates a health value between 0 (unhealthy) and 100 (healthy). The base health value is 100 and decreases with the number of failed members in the operation group x the configured health drop number for the operational group.

Whenever a member of the operational group changes its state (fails or recovers), the BNG-UP calculates the health value and sends an updated report to the MAG-c.

To configure the operational group and the health drop number, use the monitor-oper-group and the monitor-oper-group health-drop commands in the previously mentioned contexts.

For more information about operational groups, see 7450 ESS, 7750 SR, 7950 XRS, and VSR Layer 3 Services Guide: IES and VPRN, sections Object grouping and state monitoring.

With the following example configuration, the BNG-UP sends health reports for Layer 2 access ID (port) lag-access. The operational group has five members (port 1/1/20 to port 1/1/24) and the health value decreases with 51 per failed member, that is, with 20% of the base health value.

Configuration of health reports for Layer 2 access ID (port) lag-access

[ex:/configure port 1/1/20]
A:admin@BNG-UPF# info
    oper-group "lag_access_health"
[ex:/configure port 1/1/21]
A:admin@BNG-UPF# info
    oper-group "lag_access_health"
[ex:/configure port 1/1/22]
A:admin@BNG-UPF# info
    oper-group "lag_access_health"
[ex:/configure port 1/1/23]
A:admin@BNG-UPF# info
    oper-group "lag_access_health"
[ex:/configure port 1/1/24]
A:admin@BNG-UPF# info
    oper-group "lag_access_health"
[ex:/configure service vpls "access" capture-sap lag-access:*.* pfcp up-resiliency]
A:admin@BNG-UPF# info
    monitor-oper-group "lag_access_health" {
        health-drop 20
    }

The BNG-UP sends a health report for every status change in the operational group. Additionally, it sends all health reports periodically (every 60 seconds) and when a PFCP audit is requested.

With the following example configuration, the BNG-UP sends health reports for network instance HSI. The health drop number is not configured, that is, the default value of 100 is used. The health is based on a BFD session that is used to check if the BNG-UP is isolated from the rest of the network. When the BFD session is up, the health value equals 100, otherwise, the health value equals 0.

Configuration of health reports for network instance HSI

[ex:/configure service oper-group "hsi-bfd"]
A:admin@BNG-UPF# info
    bfd-liveness {
        router-instance "to_uplink_router"
        interface-name "endpoint"
        dest-ip 203.0.113.10
    }
[ex:/configure service vprn "hsi" subscriber-mgmt up-resiliency]
A:admin@BNG-UPF# info
    monitor-oper-group "hsi-bfd" {
    }

Interaction with headless mode

If the PFCP path between the BNG-UP and the MAG-c fails and the BNG-UP becomes headless, use the following command to determine the behavior for active FSGs.

configure subscriber-mgmt up-resiliency fate-sharing-group-template path-restoration-state

When using the standby option for this command, the BNG-UP forces all active FSGs to become standby. This avoids any possibility of active/active UP behavior. However, if all the BNG-UPs become headless, for example, because of a routing issue to the MAG-c, all FSGs on all BNG-UPs become standby, and no forwarding is possible.

Note: Nokia recommends leaving this command set to the default (auto). Only enable standby if the network cannot handle the described active/active scenarios and avoidance.

If you use the auto option, the BNG-UP uses an heuristic process to decide whether to keep the FSG as active or move it to standby. The BNG-UP autonomously changes FSGs to standby if any of the following conditions are met:

No single network instance is monitored for health.
At least one of the monitored network instances indicates health failure.
No GARP messages are snooped from another BNG-UP.

This process detects the difference between an isolated BNG-UP becoming headless, or all BNG-UPs becoming headless. When the BNG-UP estimates that all BNG-UPs are headless, it keeps the FSGs active. Alternatively, the BNG-UP keeps FSGs as standby, because the MAG-c activates another BNG-UP.

The heuristic check of network health determines if the failure is a more generic network failure, which is more likely to be BNG-UP local (for example, a network link failure). If the PFCP fails but the local network is fine, the failure is probably central, and all BNG-UPs became headless. In addition, if the network link is down the system probably cannot forward the session traffic anyway.

The heuristic check of GARP snooping is used to determine if another BNG-UP became active while this UP is headless. If another BNG-UP sends a GARP message this means it was updated by the MAG-c to become active, which in turns means it cannot be headless. Because of this it can be estimated that the headless mode is contained to a single node and it is safe to become standby.

These heuristic checks are best-effort and may fail to detect active/active conditions. However, by correctly setting routing metrics to differentiate between the fsg-active and fsg-active-path-restoration options, you can avoid the worst of active/active scenarios. By giving the headless BNG-UP a worse metric or preference, only the non-headless active BNG-UP draws downlink traffic. The headless BNG-UP may still erroneously answer ARP requests and update forwarding databases in the access node for the FSG virtual MAC. However, typically this is quickly corrected by downlink traffic using the vMAC as the source MAC address, coming from the non-headless BNG-UP.

The following are other risks of active/active scenarios:

When the aggregation network replicates unknown unicast packets to both BNG-UPs, it forwards these packets twice, leading to duplicate packets in the network. However, it is unlikely that the FSG virtual MAC is ever an unknown MAC. To further reduce the risk, disable unicast replication.
Exact QoS enforcement is not guaranteed in this situation.

The standby FSGs remain as-is during headless conditions. Whichever option you use (standby or auto), the BNG-UP reverts the FSG state to active after the headless condition is no longer valid.