3G/4G MATRIXX Policy Failover

A MATRIXX Policy failover can be due to a hard failure or can be planned.

When a hard failure occurs, only one engine remains functional. During a planned failover, both engines remain functional.

Sy Hard Failure

During a hard failure of the primary site, the PCRF receives no responses from the MATRIXX Engine. Examples of a hard failure include forced shutdown of the engine or a catastrophic failure. In this case, the Diameter Watchdog timers and session timers expire and the Diameter client must direct new and existing Diameter messages to the secondary site. The secondary site includes the existing information from the failed primary site and responds to existing and new Diameter messages from the PCRF. The Diameter error code 5012 (unable to comply) may be sent from the MATRIXX Engine and the Diameter Watchdog may be present while the primary site fails or shuts down.

Figure 1. Sy Hard Failure
Figure 1 shows the following sequence of events for a hard failover:
  1. The PCRF sends a Diameter SLR command and establishes an Sy session for the subscriber. The request also includes the SL-Request-Type AVP which is set to the value INITIAL_REQUEST (0).
  2. The MATRIXX Engine sends a Diameter SLA command to the PCRF. The MATRIXX Engine at Site A fails and is not functional.
  3. The PCRF decides to modify the list of subscribed policy counters and sends a Diameter SLR with command SL-Request-Type AVP set to the value INTERMEDIATE_REQUEST (1).
  4. The Tx timer expires with no response and the PCRF sends the Diameter SLR to the MATRIXX Engine at site B.
  5. The MATRIXX Engine sends a successful Diameter SLA command to the PCRF. After no response is received from repeated watchdog requests (DWR) from site A, the PCRF must initiate a failover to site B.

Sy Planned Failover

During a planned failover, the secondary site (Site B) becomes the primary site and the primary site (Site A) becomes the secondary site. After the switch, the PCRF receives a Diameter 3002 (DIAMETER_UNABLE_TO_DELIVER) message from Site A. After receiving this Diameter result code, the PCRF must resend the Diameter message to Site B. The Site B MATRIXX Engine responds with a success message. Any SLA Initial messages sent to Site A for new sessions will be rejected and the PCRF should send all SLR Initial messages to Site B.

Figure 2. Sy Planned Failover
Figure 2 shows the following sequence of events for a planned failover:
  1. The PCRF sends a Diameter SLR command and establishes an Sy session for the subscriber. The request includes the SL-Request-Type AVP set to the value INITIAL_REQUEST (0).
  2. The MATRIXX Engine sends a Diameter SLA command to the PCRF. Site A changes roles and is now a secondary site. It will respond successfully to Watchdog requests and CER messages but will respond with Diameter result code 3002 (DIAMETER_UNABLE_TO_DELIVER) to Diameter policy requests.
  3. The PCRF modifies the list of subscribed policy counters and sends a Diameter SLR with command SL-Request-Type AVP set to the value INTERMEDIATE_REQUEST (1).
  4. The Site A MATRIXX Engine returns an SLA with Diameter result code 3002 (DIAMETER_UNABLE_TO_DELIVER).
  5. The PCRF resends the SLR command to the Site B MATRIXX Engine.
  6. The Site B MATRIXX Engine successfully responds with an SLA.

Diameter Gateway Disconnect

Configure the network connection to the Diameter Gateway to disconnect from the standby cluster during an engine switch-over operation, and then reconnect when the cluster becomes active. This disconnect/reconnect operation enables the Sy interface to re-establish a connection after an engine switch-over.

To enable this behavior, integrators must add a sed file to MATRIXX Engine configuration. For more information, see Installation and Configuration.
Note: The disconnect/reconnect operation applies to all Diameter interfaces, including Diameter Gy and Gx. The default High Availability allows Diameter connectivity to and from Diameter clients (Diameter Capabilities-Exchange-Request and Device-Watchdog-Request messages) for both primary and secondary engines. The secondary engine rejects application-level messages (such as an Sy Spending-Limit-Request or a Gy Credit-Control-Request) with a configurable result code. The default result code is 3002.
Figure 3 shows what happens when a MATRIXX Engine transitions from ACTIVE to STANDBY mode. The Cluster Manager disconnects any active Diameter connections using a TCP RST and rejects any subsequent requests from Diameter clients to establish Diameter connections while in STANDBY mode. A PCRF will automatically failover from its configures primary Sy MATRIXX Engine to its configured Sy secondary MATRIXX Engine if the secondary engine assumes the ACTIVE role. If the primary engine becomes available but is still in STANDBY mode, it will reject all attempts by the PCRF to establish connectivity.
Note: During normal operation, the MATRIXX Engine in STANDBY mode allows Diameter connections.
Figure 3. Diameter Gateway Disconnect Failover

Figure 4 provides another illustration of the high availability Diameter Gateway disconnect failover. The black bar between the secondary MATRIXX Engine and the TRA indicates that all attempts to establish Diameter connections are rejected.

Figure 4. High Availability Diameter Gateway Disconnect Failover

When Diameter connections on the STANDBY engine are disabled, only the ACTIVE engine accepts Diameter connections from the network. In the event of a switch-over, when the ACTIVE engine becomes the STANDBY, the existing connections to the new STANDBY (the previously ACTIVE engine before the switch-over) are closed and a TCP RST is sent to the Diameter client. The RST sent to the network client is generated by the TRA.