BNG-UP resiliency

An overview of the BNG-UP resiliency function and capabilities, resiliency handling, and deployment use cases.

Terminology for BNG-UP resiliency

fate sharing group (FSG)
An FSG is a group of sessions that stay together when moved between BNG-UPs. This guarantees that any associated resources, such as ODSA allocated prefixes, are moved together with the sessions.
active BNG-UP
In the scope of a single FSG, the active BNG-UP is the BNG-UP on which the sessions are created and that actively forwards traffic for those sessions.
standby BNG-UP
In the scope of a single FSG, the standby BNG-UP indicates the BNG-UP that is ready to install sessions and forward traffic upon failure of the active BNG-UP. Whether sessions are proactively created on this BNG-UP depends on the chosen resiliency model.
hot standby
In the hot standby resiliency model, sessions are proactively created on a standby BNG-UP. The standby BNG-UP does not attract traffic but is ready to start forwarding as soon as the MAG-c instructs it to do so.
warm standby

In the warm standby resiliency model, sessions are created solely on the active BNG-UP. Sessions on the standby (new active) BNG-UP are only created after the active BNG-UP fails.

Introduction to MAG-c-driven BNG-UP resiliency

The Nokia MAG-c supports a MAG-c-driven BNG-UP resiliency scheme. In this scheme, the MAG-c selects the active and standby BNG-UPs and the BNG-UPs must follow this decision. The BNG-UPs do not communicate directly to negotiate the active or standby role or to synchronize session state. Instead, each BNG-UP sends its local status indicators to the MAG-c ; for example, whether it has full connectivity to the access network. The MAG-c aggregates these status indicators from all BNG-UPs and makes an informed decision that is sent to the BNG-UPs. The PFCP node messages of the PFCP association between the BNG-UP and MAG-c that are already in place for session management carry the status indicators and informed decisions.

Figure 1. High-level overview of communication for BNG-UP resiliency

It is possible and often wanted that a BNG-UP is active for a subset of the sessions and standby for another subset of the sessions. For example, when two BNG-UPs are fully available, making both BNG-UPs active for half of the sessions and standby for the other half of the sessions may be preferred. Similarly, two Layer 2 access IDs (ports) on the same BNG-UP can be backed up by two different BNG-UPs. The following figure shows the use case where the BNG-UP "central" is backed up by both the BNG-UPs "west" and "east" for two different Layer 2 access IDs.

Figure 2. Multiple backup BNG-UPs

To support all use cases, the MAG-c assigns sessions to a FSG. The MAG-c assigns the active or standby state to each FSG. The state applies to all sessions of the FSG, but not to any other session on the same BNG-UPs. ODSA is also FSG-aware and allocates micro-nets on an FSG basis, instead of a BNG-UP basis, to account for FSGs moving between BNG-UPs.

Modeling a resilient BNG-UP deployment using UP groups

The UP group configuration is a key component of the CUPS BNG-UP resiliency. This configuration serves as a high-level description of the BNG-UP access network so that the MAG-c knows which BNG-UPs are interconnected for BNG-UP resiliency. Based on the UP group configuration, the MAG-c automatically generates FSGs for the resiliency functionality. The UP group contains parameters to create the FSGs.

Use the following command to configure the UP group:
configure mobile-gateway pdn bng up-group
At the core of the UP group configuration is a list of BNG-UPs. The PFCP Node ID IE as signaled during the PFCP association setup procedure identifies each BNG-UP. The identifier can be either a name or an IP address. The BNG-UPs that form the UP group are interconnected and BNG-UP resiliency can occur between them.
Note:

If the identifier of the BNG-UP is an IP address, it must match the PFCP Node ID IE. It does not fall back to the PFCP source IP address.

If a BNG-UP uses a name in the PFCP Node ID IE, the MAG-c must be configured to use the name and not the PFCP source IP address.

Fate sharing group creation

The MAG-c creates a single FSG per configured UP group. The following configuration for the FSG is provisioned via the UP group:

  • reference to an FSG profile
    Use the following command to configure the FSG profile to reference to:
    configure mobile-gateway profile bng fsg-profile
    The profile contains detailed parameters on the resiliency behavior; for example, health calculation for each BNG-UP.
  • preferred indicator

    Per BNG-UP, a flag indicates whether the BNG-UP is active by preference. When the flag is set for a BNG-UP, the FSG prefers this BNG-UP to be active if all other parameters are equal.

  • drain indicator

    Per BNG-UP, a flag indicates whether the BNG-UP is in drain mode. When the flag is set for a BNG-UP, the FSG avoids selecting this BNG-UP as active. For example, this flag can be used before upgrading a BNG-UP to achieve a graceful switchover.

    Note: Changing the drain flag for an active BNG-UP acts as a BNG-UP reselection trigger for the linked FSGs. The MAG-c moves the sessions after changing the configuration.

Fixed access

Fixed access sessions require the Layer 2 circuit (Layer 2 access ID and VLAN parameters) that is learned from incoming IBCP packets. In a resilient setting, the Layer 2 circuits can differ between the BNG-UPs. For example, in Multiple backup BNG-UPs, Layer 2 access ID "central-A" on BNG-UP "central" is backed up by Layer 2 access ID "west-A" on BNG-UP "west". Because the MAG-c cannot rely on the initial IBCP messages to learn all the Layer 2 access IDs, the IDs must be configured manually.

A single Layer 2 access ID can be configured per BNG-UP in a UP group.. When setting up a new session for this UP group, the MAG-c learns the initial Layer 2 access ID from the incoming IBCP packet, but derives the Layer 2 access IDs for the other BNG-UPs from the configuration. A UP group-level default can be configured to simplify cases where the Layer 2 access IDs are identically named. See Example for a 1:1 hot standby resiliency with an S-tag per access node for this use case.

When all BNG-UPs use identical Layer 2 access IDs, it is possible to list multiple Layer 2 access IDs per UP group at the group level to avoid creating multiple UP groups for each Layer 2 access ID. When this is configured, the BNG-UP assumes that each Layer 2 access ID is backed up by the identically named Layer 2 access ID on other BNG-UPs. The MAG-c does not assume that there is one big broadcast domain shared between all ports. The following figure shows a UP group that covers two Layer 2 access IDs, named "link-1" and "link2". The sessions on "link-1" cannot be backed up on "link-2" because "link-2" connects to another access node.

Figure 3. Multiple Layer 2 access IDs per UP group
up-group "demo"
    fsg-profile "empty"
    l2-access-id link-1 link-2
    up "north"
    exit
    up "south"
    exit
    no shutdown
exit

Similarly, a VLAN range can be configured per BNG-UP for both S-tags and C-tags. A UP group-level default is also available. The VLAN range configuration serves the following purposes:

  • Split a single Layer 2 access ID in multiple FSGs and set a different preferred status on different BNG-UPs. In stable conditions, this achieves active-active behavior where some sessions are active on one BNG-UP while others are active on another BNG-UP. See Example for a 1:1 hot standby resiliency with an S-tag per access node for this use case.
  • Set different VLAN ranges on several BNG-UPs in more complex aggregation requirements. The MAG-c automatically adjusts the VLANs learned from IBCP for each UP based on the difference between the start values of the VLAN ranges of each BNG-UP. For example, if UP A is configured with range 100 to 200, and UP B with range 500 to 600, a session with VLAN 150 on UP A automatically uses VLAN 550 on UP B. While the start values of the VLAN range can be different, all ranges must have an equal size. For example, it is not possible to configure a range of 100 to 200 on one BNG-UP, and 100 to 300 on another BNG-UP in the same UP group.
    WARNING: VLAN ranges with a different offset over more BNG-UPs are an advanced use case and should be carefully validated against the deployed aggregation network. To avoid accidentally enabling different offsets when this functionality is not required, Nokia recommends only configuring a VLAN range on the UP group level.

The following subsections provide deployment use cases and example UP group configurations for the BNG-UP resiliency concepts.

Example for a 1:1 hot standby resiliency with an S-tag per access node

Four access nodes are connected to a pair of BNG-UPs using a shared broadcast domain. To simplify Layer 2 forwarding, each access node is assigned a unique S-tag. The broadcast domain is connected to each BNG-UP through an identically-named Layer 2 access ID on both BNG-UPs. The MAG-c makes abstraction of whether this connection is a port, LAG, BGP-VPLS, EVPN, or any similar construct.

Note: To achieve identical naming on a Nokia BNG-UP, provision an Layer 2 access ID alias using the following command:
  • MD-CLI
    configure service vpls capture-sap pfcp l2-access-id-alias
  • classic CLI
    configure service vpls sap pfcp l2-access-id-alias
The goal is to have hot standby resiliency, in stable conditions (both BNG-UPs are healthy), such that the active sessions are split between the two BNG-UPs. The following configurations achieve this goal:
  • Split the Layer 2 access IDs based on S-tag ranges in two UP groups, each serving half of the access nodes.
  • Configure a different BNG-UP as preferred in each group to make the associated FSG active on the preferred BNG-UP as long as that BNG-UP is healthy.
  • Link to an empty FSG profile to make the default session standby mode equal to hot (see Warm and hot standby for more information).
up-group "prefer-east"
    fsg-profile "empty"
    s-tag-range start 1 end 2
    l2-access-id to-access
    up "east"
        preferred
    exit
    up "west"
    exit
    no shutdown
up-group "prefer-west"
    fsg-profile "empty"
    s-tag-range start 3 end 4
    l2-access-id to-access
    up "east"
    exit
    up "west"
        preferred
    exit
    no shutdown
Figure 4. 1:1 hot standby resiliency example

Example for a per S-tag 1:1 hot standby resiliency with an S-tag per access node

This example extends the model in Example for a 1:1 hot standby resiliency with an S-tag per access node with two access nodes and two BNG-UPs.

Instead of splitting the BNG-UPs such that there are two pairs of 1:1 BNG-UPs, each S-tag range gets a different pair of standby BNG-UPs as follows:

  • S-tag 1 is backed by BNG-UP "north" and "east"
  • S-tag 2 is backed by BNG-UP "east" and "west"
  • S-tag 3 is backed by BNG-UP "west" and "south"
  • S-tag 4 is backed by BNG-UP "south" and "north"
  • S-tag 5 is backed by BNG-UP "north" and "west"
  • S-tag 6 is backed by BNG-UP "east" and "south"
up-group "s-tag-1"
    fsg-profile "empty"
    s-tag-range start 1 end 1
    l2-access-id to-access
    up "north"
    exit
    up "east"
    exit
    no shutdown
up-group "s-tag-2"
    fsg-profile "empty"
    s-tag-range start 2 end 2
    l2-access-id to-access
    up "east"
    exit
    up "west"
    exit
    no shutdown
up-group "s-tag-3"
    fsg-profile "empty"
    s-tag-range start 3 end 3
    l2-access-id to-access
    up "west"
    exit
    up "south"
    exit
    no shutdown
up-group "s-tag-4"
    fsg-profile "empty"
    s-tag-range start 4 end 4
    l2-access-id to-access
    up "south"
    exit
    up "north"
    exit
    no shutdown
up-group "s-tag-5"
    fsg-profile "empty"
    s-tag-range start 5 end 5
    l2-access-id to-access
    up "north"
    exit
    up "west"
    exit
    no shutdown
up-group "s-tag-6"
    fsg-profile "empty"
    s-tag-range start 6 end 6
    l2-access-id to-access
    up "east"
    exit
    up "south"
    exit
    no shutdown
Figure 5. Per S-tag 1:1 hot standby resiliency example
When using default FSGs, the MAG-c load-balances the FSGs and sessions as equal as possible by default:
  • Two BNG-UPs have two active FSGs.
  • Two BNG-UPs have one active FSG.
To improve the balance, you can add more S-tags or more BNG-UPs or both. For example, using 12 S-tags with a UP group each leads to a balance where each BNG-UP has three active FSGs.

The difference between a BNG-UP-level 1:1 model and an S-tag-level 1:1 model lies in the impact of multiple BNG-UP failures. For example, compare the deployment where "north" and "south" back up each other and "east" and "west" back up each other without overlap. We assume each S-tag range is responsible for about 1/6th of the traffic.

  • When two BNG-UPs fail in the per S-tag mode, it always impacts 1/6th of traffic because each pair of BNG-UPs is always uniquely responsible for one S-tag out of six. For example, if "north" and "south" fail, S-tag 4 completely fails.
  • When two BNG-UPs fail in the per-BNG-UP mode, the impact depends on which nodes fail and that can either impact 0% or 50% of the traffic. For example, if both "north" and "west" fail, there is no lasting traffic impact because they do not back up each other. If both "south" and "north" fail, all traffic of the two S-tags covered by these BNG-UPs fails.

This effect becomes stronger with more BNG-UPs and S-tags to distribute. For example, in a model with 10 BNG-UPs, the configuration can limit a failure of two BNG-UPs to only affect about 2% of the traffic versus potentially 20% of the traffic if five 1:1 pairs are used.

This model makes the following assumptions on the aggregation model:

  • A shared L2 broadcast domain must be available for all BNG-UPs.
  • A suitable granularity to differentiate UP groups must be available, such as S-tags in the example above.
  • The BNG-UP failures are unrelated. If the BNG-UP failures happen in bulk (for example, because they are co-located), it can be better to make sure no co-located BNG-UPs back up each other instead of to distribute resiliency as much as possible.

Fate sharing groups

Fate sharing groups (FSGs) are groups of sessions on which resiliency operations are performed. FSGs are automatically created based on configured UP groups. The FSGs are provisioned via the UP group.

When an FSG is created, the MAG-c performs the following operations:

  • Map new sessions to the FSG (see Session-to-FSG mapping).
  • Determine traffic management parameters to attract traffic only to the BNG-UP that serves the specific FSG (see Traffic steering parameters).
  • Determine an aggregated health value for each BNG-UP in the FSG.
  • Upon BNG-UP state and health changes, reselect an active and standby BNG-UP for the FSG. Any change triggers this reselection, which guarantees that no state change is lost. In many cases, the MAG-c selects the same active and standby BNG-UP as before.
  • Upon any active/standby change, update the FSG state on the BNG-UP and, if necessary, update the session state on the BNG-UP.

FSGs follow an intent-based processing model. The configuration specifies the edge conditions of resiliency behavior, expressing its intent. For example, the configuration specifies whether switchovers should be revertive and whether there is a preferred BNG-UP. The MAG-c monitors multiple parameters and, if necessary, changes active/standby decisions to better match the intent.

Session-to-FSG mapping

When setting up a fixed access session, the MAG-c uses the BNG-UP ID, the Layer 2 access ID, and the VLAN ranges of the triggering IBCP packet to look up a UP group. If a UP group contains this set of parameters, the MAG-c links the session automatically to the FSG created for that UP group.

Traffic steering parameters

FSGs specify the granularity for the session switchover from one BNG-UP to another. A BNG-UP must uniquely attract traffic for a specific FSG in both the uplink and downlink direction without affecting other FSGs. To achieve this, the MAG-c:
  • associates unique uplink and downlink parameters with each FSG
  • signals those parameters to the BNG-UP as part of creating the FSG when that BNG-UP is selected as active or standby BNG-UP for that specific FSG

ODSA allocates a unique set of per-FSG subnets (micro-nets). Because the subnets are unique per FSG, the active BNG-UP can announce these subnets. To achieve the uniqueness, a session that is linked to an FSG passes the FSG as an allocation context to ODSA. ODSA automatically makes the micro-nets unique in that context.

Note: A standby BNG-UP can also announce the subnet in routing messages but it should make sure that the subnet has lower priority. To achieve this, the standby BNG-UP appropriately sets metrics or preference values in the used routing protocol.
For fixed access sessions, the MAG-c generates a unique MAC address per FSG. When receiving ARP or ND requests in the scope of sessions or subnets linked to a specific FSG, only the active BNG-UP can respond to the requests with the unique MAC address. This makes sure that any MAC forwarding databases in the Layer 2 aggregation point to the correct active gateway. Each time the MAG-c signals a BNG-UP to become active, the BNG-UP can generate GARPs with the unique MAC address to expedite traffic convergence to the new active BNG-UP. The MAG-c bases the generation of the MAC addresses on a /32 prefix configuration. Use the following command to configure the prefix:
configure mobile-gateway profile bng fsg-profile mac-prefix
The default 02-00-5e-00 prefix is based on the MAC prefix used for VRRP, with the L bit flipped to remove its globally unique significance.
Example of the relationship between FSGs, MAC addresses, and subnets shows the MAC addressing for 6 FSGs with 2 subnets each, distributed over 3 BNG-UPs. The relationship between the FSGs, MAC addresses, and subnets is as follows:
  • FSG 1

    MAC 02-00-5e-00-00-01

    session subnet 10.1.1.0/24

    session subnet 10.1.2.0/24

  • FSG 2

    MAC 02-00-5e-00-00-02

    session subnet 10.2.1.0/24

    session subnet 10.2.2.0/24

  • FSG 3

    MAC 02-00-5e-00-00-03

    session subnet 10.3.1.0/24

    session subnet 10.3.2.0/24

  • FSG 4

    MAC 02-00-5e-00-00-04

    session subnet 10.4.1.0/24

    session subnet 10.4.2.0/24

  • FSG 5

    MAC 02-00-5e-00-00-05

    session subnet 10.5.1.0/24

    session subnet 10.5.2.0/24

  • FSG 6

    MAC 02-00-5e-00-00-06

    session subnet 10.6.1.0/24

    session subnet 10.6.2.0/24

Figure 7. Example of the relationship between FSGs, MAC addresses, and subnets

BNG-UP health determination

The BNG-UP health is the main criterion that the MAG-c uses to determine the active and standby BNG-UP. Health is a value between 0% and 100%; the -1 value indicates BNG-UP unavailability. The following rules determine the BNG-UP health:
  • When the PFCP path between the MAG-c and the BNG-UP is down or in headless mode, the health value is -1 (unavailable).
    Note: If a PFCP association is not set up, the BNG-UP is operationally not part of the UP group and has no health.
  • When the following command is set to true, the health value is -1 (unavailable).
    configure mobile-gateway pdn bng up-group up drain
  • In all other cases, the health value is based on an aggregation of the operational statuses received from the BNG-UP.

The BNG-UP can signal the following operational status values to the MAG-c:

  • per Layer 2 access ID

    A percentage value per Layer 2 access ID indicates the current forwarding capacity compared to the full forwarding capacity. For example, if the Layer 2 access ID represents a LAG with five members where one member failed, the expected capacity is 80%.

  • per Layer 3 service (also known as network instance or network realm)

    A binary connectivity status per Layer 3 service indicates whether the Layer 3 core network is reachable or not (connected or isolated). A Nokia BNG-UP additionally augments this value with a percentage value to cover partial failures. The MAG-c uses the more detailed percentage value if available; otherwise, the MAG-c interprets the binary connectivity status as 100% for the connected state and 0% for the isolated state.

Not all status values of a single BNG-UP apply to a specific FSG. For example, a UP group that only covers a single Layer 2 access ID is not impacted by any other Layer 2 access ID status. The MAG-c determines the applicable status values as follows:

  • By default, the MAG-c uses for the aggregation all Layer 2 access IDs configured for the BNG-UP in the UP group. The following commands configure the L2 access IDs:
    configure mobile-gateway pdn bng up-group l2-access-id 
    configure mobile-gateway pdn bng up-group up l2-access-id 
  • The MAG-c can exclude configured Layer 2 access IDs from the health calculation. This prevents the MAG-c from automatically setting the health value to 0 if the BNG-UP does not or cannot provide a status value for Layer 2 access IDs. The following command specifies whether to include L2 access IDs and is enabled by default:
    configure mobile-gateway profile bng fsg-profile health-calculation include-l2-access-ids
  • The MAG-c tracks a list of configured network realms for health aggregation. The following command configures the tracked network realms:
    configure mobile-gateway profile bng fsg-profile health-calculation network-realm
To calculate a single health value from the set of status values, the MAG-c applies an aggregation calculation that is configured using the following command:
configure mobile-gateway profile bng fsg-profile health-calculation aggregation-mode

The options for the aggregation mode are:

  • lowest

    This mode sets the per-BNG-UP health to the lowest value of any Layer 2 access ID and network realm value. A single failure aggressively decreases the health.

  • average

    This option sets the per-BNG-UP health to the arithmetic mean of all Layer 2 access ID and network realm values. A single failure less aggressively impacts the health.

If the BNG-UP does not signal a status value for a Layer 2 access ID or network realm that is configured to be tracked, the MAG-c sets the status value for the respective Layer 2 access ID or network realm to 0%. Because the MAG-c uses those values in the aggregation calculation, any missing status value sets the BNG-UP health to 0% for an aggregation mode that is equal to lowest.

Next to the BNG-UP health ranging from 0% to 100%, the MAG-c maintains a simplified BNG-UP failure state. A BNG-UP is considered failed if its health is below the failure threshold. To configure the failure threshold, use the following command:
configure mobile-gateway profile bng fsg-profile health-calculation failure-threshold
By default, the failure threshold is set to 1% , meaning that only a BNG-UP with a health value equal to 0% or -1 (unavailable) is considered failed.

The MAG-c maintains a special not-ready indicator for the current standby BNG-UP. This indicator is set in the following conditions:

  • The BNG-UP changes to standby, independent of its previous state or health.
  • The BNG-UP health becomes unavailable (-1).

The MAG-c removes the not-ready indicator each time an FSG change successfully completes (see Active/standby change or switchover) and the health of the BNG-UP at that time is 0% or higher.

The MAG-c avoids making a standby BNG-UP with the not-ready indicator active unless it has no other choice; for example. when the PFCP association for the active BNG-UP is released. This mechanism gives a failed or new standby BNG-UP a chance to go through one FSG change sequence to reinstall all the hot standby sessions before it can be made active.

The MAG-c can put a BNG-UP in a lockout state for an FSG. When a BNG-UP is in the lockout state, it cannot be made active or standby. Contrary to the other health values, the lockout state is intended to recover from hard failures where it is important that all FSG and related session state is removed from the BNG-UP before it is considered active or standby again. See UP Lockout for more information.

Table 1 provides an overview of the states that are kept for BNG-UPs that have an active association and that are linked to at least one FSG.

Table 1. Summary of BNG-UP states
State Description Sources
health

A value between 0% and 100% or the special value -1 (unavailable)

Indicates the health of the BNG-UP

Aggregation of the per-logical-port and per-network-realm health reports from the BNG-UP

PFCP path management state (for example, headless)

The drain mode configured with the following command:
configure mobile-gateway pdn bng up-group up drain
failed indicator

An indicator that considers the BNG-UP failed if its health is less than the failure threshold

Enables switchovers in more restrictive (for example, non-revertive) scenarios

Based on the health state and the threshold configured with the following command:
configure mobile-gateway profile bng fsg-profile health-calculation failure-threshold
not-ready indicator

An indicator on the standby BNG-UP that does not have all hot standby sessions installed

Kept until the standby BNG-UP has installed the hot standby sessions

Set for each new standby BNG-UP or a standby BNG-UP whose health becomes unavailable (-1)

Removed after the first successful FSG change when the health is 0% or higher

lockout

A failure state in which the BNG-UP cannot be made active or standby

Kept until the BNG-UP is no longer active or standby and a lockout timer has expired

Applied automatically for multiple failure scenarios, see UP Lockout for more information.

Active/standby selection triggers

The MAG-c monitors multiple triggers that can impact the active/standby selection and trigger a potential switchover. Most events are classified as one of the following:
  • recovery (for example, health up)
  • degradation (for example, health down)
When a trigger occurs, the MAG-c:
  • starts a hold timer
  • waits for the hold timer expiry
  • triggers the active/standby selection
A different hold time can be set for recovery and degradation using the following commands respectively:
configure mobile-gateway profile bng fsg-profile active-standby-selection hold-off-on-recovery
configure mobile-gateway profile bng fsg-profile active-standby-selection hold-off-on-degradation
By default, the degradation hold timer is disabled (0 ms) to immediately execute potential switchovers because of failure.
When a trigger occurs while the hold timer is running, the new hold timer is only applied if it is shorter than the one already running. For example, suppose the following events occur with 2 s in between:
  • A health increase triggers a recovery hold timer of 5 s.
  • A health decrease triggers the default degradation hold timer of 0 ms.
Because the second hold timer is shorter than the first one, the MAG-c immediately triggers the active/standby selection for the degradation.

When a trigger occurs while an active/standby change is in progress, the MAG-c ignores the hold timer of the new trigger and re-evaluates the active/standby selection as soon as the in-progress change completes.

The MAG-c treats the following events as a trigger:

  • Any health increase acts as a recovery trigger. The cause of the health increase is irrelevant and may be because of headless recovery, change of the drain configuration of the BNG-UP, or a BNG-UP health report.
  • Any health decrease acts as a degradation trigger.
  • A PFCP association setup acts as a recovery trigger, except if it is the first BNG-UP set up for the FSG.
  • A PFCP association release acts as a degradation trigger, except if it is already the active or standby BNG-UP.
  • A UP lockout acts as a degradation trigger.
  • A UP lockout removal acts as a recovery trigger.
  • The intended FSG state not matching the current FSG state after an FSG event acts as a recovery trigger (see Active/standby change or switchover).

The following exceptional triggers bypass the normal reselection mechanism because of their big impact:

  • The setup of the first PFCP association for an FSG triggers an immediate reselection. The MAG-c does not wait for the expiry of the recovery hold timer. If the PFCP association being set up is not the first association, it acts as a health increase and the MAG-c starts the recovery hold timer.
  • A PFCP association release for the active or standby BNG-UP triggers an immediate reselection, bypassing any hold timers. If an active/standby change is already in progress, the ongoing change is completed first. A PFCP association release for any other BNG-UP acts as a health decrease and the MAG-c starts the degradation hold timer.
  • If all BNG-UPs become headless, the MAG-c does not trigger any reselection. As soon as the first BNG-UP recovers from headless, the MAG-c ignores the recovery hold timer but starts a timer based on the configured path-management heartbeat intervals. The MAG-c triggers reselection of all BNG-UPs when one of the following occurs:
    • the timer based on the configured path-management heartbeat intervals expires
    • 5 s have passed after the last BNG-UP recovered
    Note: This mechanism ensures that after a full connectivity failure, all BNG-UPs have time to recover the PFCP communication. It makes sure that the MAG-c makes decisions based on the full set of recovered BNG-UPs and not on the first recovered BNG-UPs.

Active/standby selection

When an active/standby selection trigger occurs, the MAG-c re-evaluates the selection of the active and standby BNG-UPs for an FSG. If only one BNG-UP with an active association is available, that specific BNG-UP is always selected as the active BNG-UP. Otherwise, both the active and standby BNG-UP can be reselected.

Replacing the active BNG-UP with the current standby BNG-UP works in one of the following basic modes:

  • revertive

    The current standby BNG-UP can be selected as the active BNG-UP even if the active BNG-UP did not fail. The conditions in which the standby BNG-UP can become the active BNG-UP are the same as the conditions to select the standby BNG-UP. Additionally, the standby BNG-UP cannot have the not-ready indicator set.

  • non-revertive

    The current standby BNG-UP can only be selected as the active BNG-UP if the PFCP association of the current active BNG-UP is removed or if the BNG-UP is considered failed (see BNG-UP health determination), or if the BNG-UP is in lockout state (see UP Lockout). Otherwise, the current active BNG-UP is always reselected as the active BNG-UP.

To configure the mode, use the following command:
configure mobile-gateway profile bng fsg-profile active-standby-selection active-change-without-failure

The following command options are available:

  • always

    The MAG-c always uses the revertive mode.

  • never

    The MAG-c always uses the non-revertive mode.

  • initial-only

    The MAG-c uses the revertive behavior for a short period after the first BNG-UP PFCP association for the FSG was set up. After that short period, the MAG-c automatically switches to the non-revertive mode. This option is useful when the non-revertive mode is required but a predictable active/standby BNG-UP is expected during start-up of the BNG-UP and MAG-c; for example, to select the preferred BNG-UP at startup. When the never option is set, the first BNG-UP to come up is always selected as active (and that does not change), independent of its preferred state.

If the standby BNG-UP becomes active, the active BNG-UP automatically becomes standby. The MAG-c takes no further action.

The MAG-c selects a standby BNG-UP independent of the revertive mode configuration.

Both the revertive active BNG-UP and the standby BNG-UP are selected using the following criteria. This is a fall-through list that stops as soon as there is only one BNG-UP that meets all the criteria. Any BNG-UP for which the PFCP association is down or which is in lockout is not considered.

  1. the BNG-UP with the highest health (see BNG-UP health determination)
  2. the preferred BNG-UP
  3. the BNG-UP with the lowest number of sessions, simulated as if the FSG would move to that BNG-UP.
    Note: To avoid unnecessary FSG changes when the number of sessions on several BNG-UPs is very similar, the MAG-c applies a weight multiplier to the FSG session count when it simulates a move to a different BNG-UP than the current one.
  4. the BNG-UP with the lowest amount of FSGs, excluding the current FSG, with the goal to provide initial load-balancing when no sessions are set up.
  5. The current state of the BNG-UP, where the current active BNG-UP has priority over the current standby BNG-UP which then has priority on any backup BNG-UP . This avoids any unnecessary active or standby changes if all else is equal.
  6. the BNG-UP with the lowest IP used in PFCP signaling, with no specific goal other than to have a deterministic tiebreaker when all else is equal

If the result of the active/standby selection differs from the current active/standby selection, the MAG-c initiates an active/standby change.

If the result of the active/standby selection is the same as the current active/standby selection, but the health of any BNG-UP has changed from unavailable (-1) to 0% or higher, the MAG-c initiates an active/standby change.

Otherwise, the MAG-c takes no further action.

Note: The trigger to change the FSG for a recovered BNG-UP (even without an active/standby change) is to guarantee that a BNG-UP has all the PFCP state information after a potential communication failure between the BNG-UP and the MAG-c. The FSG change procedure guarantees that all the FSG states and PFCP session states are correctly downloaded if necessary. For example, when a standby BNG-UP becomes headless, it may miss the FSG updates and session installations and modifications for hot standby sessions. When the BNG-UP is recovered from headless, it becomes not ready (see section BNG-UP health determination). The active/standby state does not change, but the MAG-c triggers an FSG change procedure so that the latest FSG and session state are installed on the BNG-UP. After the FSG change, the MAG-c removes the not-ready indicator from the BNG-UP and the standby BNG-UP is again ready to fully take over.

Active/standby change or switchover

If the active/standby selection results in a new active or new standby BNG-UP, the MAG-c executes the change on the BNG-UPs as follows:

  1. The MAG-c updates the PFCP FSG state on all involved BNG-UPs.

    The change procedure ends if the active BNG-UP does not positively confirm. If the active BNG-UP change times out or explicitly returns an error, the MAG-c rolls back the changed FSG states and stops the active/standby change procedure.

    Changes to other BNG-UPs (for example, standby BNG-UPs) may fail. This is even expected in some cases; for example, in 1:1 deployments where the previously active BNG-UP has failed and becomes standby, the failed BNG-UP is not expected to respond.

    A BNG-UP that explicitly rejects an explicit FSG update is put into lockout. This triggers a degradation reselection, which is handled as soon as the change is completed. See UP Lockout for more information.

  2. When the active BNG-UP confirms the FSG change, the MAG-c starts updating the PFCP session states. The exact update for each session depends on the change and the session resiliency model as follows:
    • Warm standby, active/standby switch

      The MAG-c establishes the session on the new active BNG-UP and deletes it from the previous active BNG-UP.

    • Warm standby, new standby BNG-UP

      No updates to the BNG-UPs are needed.

    • Warm standby, health change only

      No updates to the BNG-UPs are needed.

    • Hot standby, active/standby switch

      No updates to the BNG-UPs are needed.

    • Hot standby, new standby BNG-UP

      The MAG-c establishes the session on the new standby BNG-UP and deletes it from the previous standby BNG-UP if there was one.

    • Hot standby, health change only

      This acts as a trigger to reinstall missing standby sessions on the standby BNG-UP.

  3. When the standby BNG-UP confirms the FSG change, the MAG-c sends a second FSG update message to the active BNG-UP without changing anything. This can be done in parallel with the previous step. The second FSG update message may seem redundant, but is required to resolve a rare race condition in the GARP/ARP signaling for fixed access connections.
  4. When the change is completed, the MAG-c evaluates whether the current active/standby state matches the expected active/standby state by running the selection logic again (see Active/standby selection). If the states do not match, the MAG-c automatically triggers a recovery reselection and starts the recovery hold timer (see Active/standby selection triggers).

GARP/ARP race conditions

Fixed access connections use per-FSG MAC addresses to attract traffic (see Traffic steering parameters). Most Layer 2 aggregation switches keep a forwarding database (FDB) that points each gateway MAC address to the correct BNG-UP to avoid broadcasting traffic. The FDBs are populated by snooping ARP and ND messages. To expedite updates of the FDBs during active/standby switchovers, the Nokia BNG-UP generates a gratuitous ARP (GARP) message with the FSG MAC address when the FSG is signaled to become active. However, in a very exceptional case, a single GARP is not enough when the following conditions apply:

  • The new standby BNG-UP has not yet processed the message that asks it to become standby.
  • A regular ARP is sent and broadcast as normal.
  • Both BNG-UPs answer, and the ARP response from the new standby BNG-UP comes later than the ARP response of the new active BNG-UP.

If the preceding conditions apply, the Layer 2 aggregation switch has a wrong FDB entry. Sending a second update to the new active BNG-UP can act as a new GARP trigger to correct the situation. GARP race conditions shows this case.

Note: The second update is a very lightweight operation as no actual FSG changes need to occur. It only acts as a GARP trigger. The BNG-UP may not have any action to perform if it does not need to send GARPs; for example, on aggregation networks where the FDBs are populated out-of-band such as EVPN networks.
Figure 8. GARP race conditions

UP Lockout

To handle FSG failure scenarios, the MAG-c can put a specific BNG-UP in lockout for that FSG. The following example scenarios trigger lockout:
  • an explicit FSG error from the BNG-UP when signaling an FSG create, modify, or delete
  • the path of a BNG-UP goes down, in addition to setting its health to -1 (unavailable) (see BNG-UP health determination)

The MAG-c sees putting a BNG-UP in lockout as a degradation trigger for the FSG (see Active/standby selection triggers). The MAG-c attempts to remove the locked out BNG-UP from being selected as either active or standby (see Active/standby selection).

Because many failure scenarios do not have an automatic recovery signal, the lockout is subject to a timer. For explicit FSG errors, use the following command to configure the lockout timer:
configure mobile-gateway profile bng fsg-profile active-standby-selection failure-lockout
For other scenarios, the lockout timer is set to a fixed value, typically equal to the minimal configurable value. When the lockout timer expires, the MAG-c takes one of the following actions:
  • If the BNG-UP is not active or standby for the FSG, the MAG-c removes the lockout state and triggers a recovery reselection for the FSG.
  • Otherwise, the MAG-c restarts the lockout timer with a fixed value and takes no further action. This guarantees that the BNG-UP is removed from the FSG at least once and starts from a clean slate before it can be made active or standby again.

Warm and hot standby

Warm and hot standby in BNG-UP resiliency is a per-session concept that defines how a session is handled on the standby BNG-UP:

  • Warm standby sessions are created on the standby BNG-UP when the BNG-UP becomes active. The sessions are not precreated on the standby BNG-UP. This saves resources on the standby BNG-UP, but it takes a significantly longer time during which there is no forwarding capability for those sessions.
  • Hot standby sessions are precreated on the standby BNG-UP. As soon as the BNG-UP becomes active, it can start forwarding traffic for those sessions. While this consumes more resources on the standby BNG-UP, it can offer significantly reduced forwarding loss during switchovers. Depending on the capabilities of the aggregation network, it may even be possible to achieve non-loss planned switchovers; for example, to seamlessly handle BNG-UP upgrades.

For hot standby, any procedure that interacts with a BNG-UP change (for example, a CoA with a QoS update) first applies the change on the active BNG-UP. If the change succeeds, the procedure continues as usual and updates the standby BNG-UP in parallel. In the unlikely event that only the standby BNG-UP update fails, the MAG-c does not fail the triggering procedure. Instead, it tries to reapply the update periodically in the background until the standby BNG-UP is realigned with the active BNG-UP. If this realignment is not resolved when the standby BNG-UP becomes active, the MAG-c does one final attempt to update the session state and if not successful, locally removes the full session.

By default, the MAG-c creates a session in an FSG scope always in the hot standby mode. To change the default at a per-FSG level, use the following command:
configure mobile-gateway profile bng fsg-profile default-standby-mode
To overwrite the default at a per-session level, use the following command:
configure mobile-gateway profile authentication-database entry resiliency standby-mode
See BNG EP and ADB lookup for more information about how the MAG-c chooses an ADB entry.
WARNING: At large scale and depending on the install rate of the involved BNG-UPs, it can take a long time for warm standby sessions to switch over. Timers such as the PPP keepalive, DHCP lease times, and RA lifetimes may time out before the switchover is completed if they are set too short.

Interaction with headless mode

BNG-UP resiliency is supported in combination with the BNG-UP headless mode (see Headless mode). When a BNG-UP becomes headless, its health becomes unavailable (-1) because the MAG-c cannot differentiate between a BNG-UP toward which communication failed (headless) or a BNG-UP that completely failed. See BNG-UP health determination for more information.

A BNG-UP becoming headless acts as a trigger to perform a potential switchover from active to standby. A switchover cannot be signaled to the headless BNG-UP, which operates on stale data. The Nokia BNG-UP, by default, keeps its FSG state from before becoming headless. As a result, it is possible that there is an active/active forwarding situation in which both the headless and non-headless BNG-UPs of an FSG have an active state. In this scenario, the following applies:

  • The MAG-c does not act on any LCP keepalive failure reports coming from the BNG-UP because it is possible that the headless BNG-UP is handling the keepalives. After the headless recovery, failures are again handled as normal.
  • QoS cannot always be guaranteed because traffic may switch from one BNG-UP to the other at any time. After headless recovery, the active/standby situation stabilizes and traffic flows through only one BNG-UP with normal QoS guarantees.
  • During headless, accounting reports may be off because traffic on the headless BNG-UP is not counted. After headless recovery, the MAG-c can fetch the missing statistics and the accounting is corrected.
  • If there is unicast replication in the access network, these packets may end up being replicated also in the data network.
If the consequences of the active/active state are not wanted, the Nokia BNG-UP can be configured to always automatically make any FSG standby when the headless conditions occur. This configuration avoids an active/active state, and one of following scenarios occurs:
  • When a single BNG-UP is headless, that BNG-UP makes its FSGs standby and the MAG-c makes the other BNG-UP active. This results in an active/standby state as expected.
  • When both BNG-UPs are headless; for example, because of a networking issue at the MAG-c, the FSG becomes standby on all BNG-UPs and all traffic is dropped.

As an exception, in case all BNG-UPs go headless simultaneously, the MAG-c does not take any action on the FSGs. This may occur when the common network infrastructure between all BNG-UPs and the MAG-c fails. To avoid moving all FSGs to the first BNG-UP to recover, the MAG-c initiates a custom hold timer after the first BNG-UP recovers. Only after this timer expires or after all BNG-UP paths recover and send their health, the MAG-c restarts the FSG selection procedure. This mechanism avoids needless FSG switches because BNG-UP failure is unlikely in this scenario.

Note: Because of the consequences of headless mode as described above, Nokia recommends reducing the path restoration time to the absolute minimum that is acceptable for the planned deployment.

Operational commands

The same operational commands can be used for resilient sessions as for non-resilient sessions. See Operational commands and debugging.

The MAG-c supports the following commands to display information about BNG-UPs, UP groups, FSGs, and the relationship between them. The commands are in the following context:
show mobile-gateway pdn bng
  • up-group

    This command provides an overview of all configured UP groups, and the number of sessions and BNG-UPs that are currently active in the UP groups.

  • up-group group-name

    This command provides an overview of the specific UP group, the associated list of FSGs, and the associated BNG-UPs.

  • fsg
    This command provides an overview of the UP group of the specific FSG and the current active/standby BNG-UPs. This is useful to quickly retrieve information when a specific FSG ID is known. The following are examples to get a specific FSG ID:
    • from a command in the following context:
      show mobile-gateway bng session
    • from a BNG-UP-specific operational command
  • up

    This command provides an overview of all BNG-UPs and their UP group participation. The applicable Layer 2 access ID, the applicable VLAN range, and the active/standby state for the BNG-UP in the UP group are displayed.