Configuring Health-Based Monitoring and Throttling for TRA
In load-balancer mode, the Traffic Routing Agent (TRA-PROC) pool monitoring and traffic routing functions can take the health status of downstream nodes into consideration.
The TRA-PROC can optionally exclude unhealthy MATRIXX Engine servers or nodes and throttle client requests based on information collected by the System Monitor service running on the engine.
For more information, see the discussion about the System Monitor in MATRIXX Architecture.
These features are applicable to Diameter and MDC traffic only. They are optional and disabled by default.
Excluding Unhealthy Nodes
If usage level thresholds are exceeded, nodes can be excluded from traffic forwarding. When usage drops below threshold levels, a previously unhealthy node is considered healthy and is included again in forwarding.
Enable health monitoring in the tra_config.xml file or the tra_config_nerwork_topology.xml file with a health profile similar to the following example:
<health-state-profiles>
<health-state-profile name="foo" log-throttle-msec="1000" monitoring-serviceability="true">
<threshold entry="60" exit="50"/>
</health-state-profile>
</health-state-profiles>
In the <health_state_profile>
element:
- The
name
attribute is a unique identifier for the profile. - The
log-throttle-msec
attribute (optional) specifies how frequently, in milliseconds, health state transition-related messages can be logged, to prevent rapid state transitions from causing over-saturation of logs. - The
monitoring-serviceability
attribute, set totrue
by default, controls monitoring of the System Monitor serviceability state for the node. - The
entry
attribute of thethreshold
element is the usage level at which a node is considered unhealthy and excluded from traffic forwarding. - The
exit
attribute of thethreshold
element is the usage level at which a node currently considered unhealthy is considered healthy again and eligible for traffic forwarding.
A health profile must either include the monitoring-serviceability
attribute set to true
to enable serviceability state tracking or it
must include a threshold element for usage tracking.
A node is considered unhealthy for traffic forwarding if:
- The
monitoring-serviceability
attribute is set totrue
and the serviceability state is seen to be out-of-service (OOS). - The usage level crosses the
entry
value specified in thethreshold
element.
Explicitly set the health profile for pools of monitor type cmi-node-active
or cmi-node-active-cluster-active
as shown in the following
cmi-node-active
example:
<pool balance-method="round-robin" health-state-profile=“foo” monitor="cmi-node-active" monitor-port="4800" name=”procPool">
<node address=”10.10.10.101" id="1" name="b1"/>
<node address=”10.10.10.102" id="2" name="b2"/>
</pool>
Client Request Throttling
The TRA-PROC supports throttling of client requests based on processing pool usage level. Depending on user configuration, processing pool nodes are monitored for usage level, and an overall usage level is set to the pool. When a pool usage threshold is crossed, a virtual server (VS) is set to a throttling state, if so configured.
The throttling profile defines the throttling rate. All connection instances from this VS will throttle per the configured rate if throttling is triggered.
Enable throttling in the tra_config.xml file or the tra_config_nerwork_topology.xml file with a throttling profile similar to the following example:
<pool-usage-throttle-profiles>
<pool-usage-throttle-profile name=”foo">
<threshold entry=”90" exit=”80" request-drop-every =”5"/>
</pool-usage-throttle-profile>
</pool-usage-throttle-profiles>
where name
is the name of the profile.
In the <pool-usage-throttle-profile>
element:
- The
name
attribute is a unique identifier for the profile. - The
entry
attribute of thethreshold
element is the pool usage level at which throttling is triggered. - The
exit
attribute of thethreshold
element is the pool usage level at which throttling is stopped. - The
request-drop-every
attribute of thethreshold
element defines the throttling rate. For example, a value of 5 means every fifth request is discarded. - The
use_pause_for_throttle
attribute of thethreshold
element (not shown above) is an alternative to the behavior configured by therequest-drop-every
attribute. Whenuse_pause_for_throttle
is set to true and the pool usage level specified with theentry
attribute is reached, the upstream connection is paused until the pool usage has dropped to the level specified with theexit
attribute.Note: Therequest-drop-every
anduse_pause_for_throttle
attributes are mutually exclusive.
Identify the pool usage throttle profile in the tra_config.xml file or
the tra_config_network_topology.xml file in a new or exsiting
VS options configuration (vsopt) element with the
pool-usage-throttle-profile-name
attribute.
<virtual-servers-options>
<vsopt name=”bar"
pool-usage-throttle-profile-name=”foo"/>
</virtual-servers-options>
Set the VS options when defining a VS.
<vs name=”procDiam" pool=”procPool" port=”3868" protocol=”diameter" vip="vip1" vsopt=”bar"/>
For more information on <virtual-servers-options>
and
<vsopt>
elements, see the discussion about TRA VS Protocol
Elements.
diameter
and
mdc
protocols, and the associated pool must be of
cmi-node-active
monitor type.SNMP Statistics for Health-Based Routing and Throttling
The print_snmp_stats.py
script reports the current usage level of each TRA
node.
If a pool for a node does not have a health state profile defined, and the virtual server for the node does not use throttling, then the usage level is reported as not-applicable.
If a pool for a node has a health state profile defined and the virtual server for the node uses throttling, but the node reports inactive, then the usage level is reported not-applicable.
If a pool for a node has a health state profile defined or the virtual server for the node uses throttling, and the node reports a valid usage level, then the usage level is reported.
Additionally, if Health State Monitoring is enabled, an SNMP ServiceState statistic reports if
the node is unhealthy, per TRA node. There are also two global SNMP counters that
represent the total amount of Diameter and MDC messages that have been throttled.
For more information on reported SNMP statistics, see the discussion about the
print_snmp_stats.py
script.