Configuring Health-Based Monitoring and Throttling for TRA

In load-balancer mode, the Traffic Routing Agent (TRA-PROC) pool monitoring and traffic routing functions can take the health status of downstream nodes into consideration.

The TRA-PROC can optionally exclude unhealthy MATRIXX Engine servers or nodes and throttle client requests based on information collected by the System Monitor service running on the engine.

Important: These features require comprehensive pre-production evaluation prior to deployment.

For more information, see the discussion about the System Monitor in MATRIXX Architecture.

These features are applicable to Diameter and MDC traffic only. They are optional and disabled by default.

Excluding Unhealthy Nodes

If usage level thresholds are exceeded, nodes can be excluded from traffic forwarding. When usage drops below threshold levels, a previously unhealthy node is considered healthy and is included again in forwarding.

Enable health monitoring in the tra_config.xml file or the tra_config_nerwork_topology.xml file with a health profile similar to the following example:

<health-state-profiles>
    <health-state-profile name="foo" log-throttle-msec="1000" monitoring-serviceability="true">
        <threshold entry="60" exit="50"/>
    </health-state-profile>
</health-state-profiles>

In the <health_state_profile> element:

The name attribute is a unique identifier for the profile.
The log-throttle-msec attribute (optional) specifies how frequently, in milliseconds, health state transition-related messages can be logged, to prevent rapid state transitions from causing over-saturation of logs.
The monitoring-serviceability attribute, set to true by default, controls monitoring of the System Monitor serviceability state for the node.
The entry attribute of the threshold element is the usage level at which a node is considered unhealthy and excluded from traffic forwarding.
The exit attribute of the threshold element is the usage level at which a node currently considered unhealthy is considered healthy again and eligible for traffic forwarding.

A health profile must either include the monitoring-serviceability attribute set to true to enable serviceability state tracking or it must include a threshold element for usage tracking.

A node is considered unhealthy for traffic forwarding if:

The monitoring-serviceability attribute is set to true and the serviceability state is seen to be out-of-service (OOS).
The usage level crosses the entry value specified in the threshold element.

Explicitly set the health profile for pools of monitor type cmi-node-active or cmi-node-active-cluster-active as shown in the following cmi-node-active example:

<pool balance-method="round-robin" health-state-profile=“foo” monitor="cmi-node-active" monitor-port="4800" name=”procPool">
      <node address=”10.10.10.101" id="1" name="b1"/>
      <node address=”10.10.10.102" id="2" name="b2"/>
</pool>

Client Request Throttling

The TRA-PROC supports throttling of client requests based on processing pool usage level. Depending on user configuration, processing pool nodes are monitored for usage level, and an overall usage level is set to the pool. When a pool usage threshold is crossed, a virtual server (VS) is set to a throttling state, if so configured.

The throttling profile defines the throttling rate. All connection instances from this VS will throttle per the configured rate if throttling is triggered.

Enable throttling in the tra_config.xml file or the tra_config_nerwork_topology.xml file with a throttling profile similar to the following example:

<pool-usage-throttle-profiles>
   <pool-usage-throttle-profile name=”foo">
     <threshold entry=”90" exit=”80" request-drop-every =”5"/>
   </pool-usage-throttle-profile>
</pool-usage-throttle-profiles>

where name is the name of the profile.

In the <pool-usage-throttle-profile> element:

The name attribute is a unique identifier for the profile.
The entry attribute of the threshold element is the pool usage level at which throttling is triggered.
The exit attribute of the threshold element is the pool usage level at which throttling is stopped.
The request-drop-every attribute of the threshold element defines the throttling rate. For example, a value of 5 means every fifth request is discarded.
The use_pause_for_throttle attribute of the threshold element (not shown above) is an alternative to the behavior configured by the request-drop-every attribute. When use_pause_for_throttle is set to true and the pool usage level specified with the entry attribute is reached, the upstream connection is paused until the pool usage has dropped to the level specified with the exit attribute.
Note: The request-drop-every and use_pause_for_throttle attributes are mutually exclusive.

Identify the pool usage throttle profile in the tra_config.xml file or the tra_config_network_topology.xml file in a new or exsiting VS options configuration (vsopt) element with the pool-usage-throttle-profile-name attribute.

<virtual-servers-options>
      <vsopt name=”bar" 
             pool-usage-throttle-profile-name=”foo"/>
</virtual-servers-options>

Set the VS options when defining a VS.

<vs name=”procDiam" pool=”procPool" port=”3868" protocol=”diameter" vip="vip1" vsopt=”bar"/>

For more information on <virtual-servers-options> and <vsopt> elements, see the discussion about TRA VS Protocol Elements.

Note: Throttling is supported only for the diameter and mdc protocols, and the associated pool must be of cmi-node-active monitor type.

SNMP Statistics for Health-Based Routing and Throttling

The print_snmp_stats.py script reports the current usage level of each TRA node.

If a pool for a node does not have a health state profile defined, and the virtual server for the node does not use throttling, then the usage level is reported as not-applicable.

If a pool for a node has a health state profile defined and the virtual server for the node uses throttling, but the node reports inactive, then the usage level is reported not-applicable.

If a pool for a node has a health state profile defined or the virtual server for the node uses throttling, and the node reports a valid usage level, then the usage level is reported.

Additionally, if Health State Monitoring is enabled, an SNMP ServiceState statistic reports if the node is unhealthy, per TRA node. There are also two global SNMP counters that represent the total amount of Diameter and MDC messages that have been throttled. For more information on reported SNMP statistics, see the discussion about the print_snmp_stats.py script.