sysmon_config.xml Reference

You can configure the MATRIXX System Monitor by configuring the sysmon_config.xml file.

Defining Global Resource Logging

The optional global resource logger element (<global-monitored-resource-logger>) collects system messages for all monitored objects in a MATRIXX server, except those monitored by their own individual <monitored-resource-logger> element. You define both entry and exit processing capacity percentages for the global resource logger element as thresholds to start, stop, or change the level of logging, and provide a custom string that is returned with each message.

Define only one global resource logger element.

The global resource logger uses these parameters:

global-monitored-resource-logger — The containing element for the global resource logger.
entry-threshold-percent="entry_%" — Integer between 1-100. The input stream processing capacity that triggers the logging level set by entry-log-level. Must be larger than exit-log-level.
exit-threshold-percent="exit_%" — Integer between 0-99. The input stream processing capacity below which logging is set to exit-log-level.
entry-log-level="logging_level" — (Optional) The logging level to use when the system is above the entry-threshold-percent threshold. The default value is warning. Supported values are: trace, info, warning, error, and critical.
exit-log-level="logging_level" — (Optional) The logging level to use when the system is below the exit-threshold-percent level. The default value is info. Supported values are: trace, info, warning, error, and critical.
custom-message="msg_text" — A custom message string that is included in each entry and exist system message.

This example defines global resource monitoring. These thresholds apply to all monitored objects that do not themselves have a logging element defined. It starts logging at a warning level when the usage percentage reaches 95% of its processing capacity, and then switches back to an info level when the usage percentage drops below 75% of capacity:

<global-monitored-resource-logger
    entry-threshold-percent="95"
    exit-threshold-percent="75"
    entry-log-level="warning"
    exit-log-level="info"
    custom-message="This is a custom message"
 />

Defining Logging for Individual Monitored Objects

In addition to, or instead of, global system logging you can optionally add logging thresholds to individual monitored objects by adding <monitored-resource-logger> elements. The parameters are the same as for global resource logging parameters. Define as many logging elements for individual monitored objects as your MATRIXX environment requires.

This example adds monitored logging thresholds to a hugeMtxBufPool element. Logging is changed to the default warning level when the processing capacity reaches 50% and then changed to the default info level if the capacity drops back below 30%:

<monitored-object
        name="hugeMtxBufPool"
        factory="StaticMtxBufPoolsCollectFactory">
        <mtxbuf-pools>
            <mtxbuf-pool>
                <object_name>hugepool</object_name>
                <params mtxbuf-pool-type="huge"/>
            </mtxbuf-pool>
        </mtxbuf-pools>  		   
        <monitored-resource-logger
            entry-threshold-percent="50"
            exit-threshold-percent="30"
            custom-message="msg_text"/>
</monitored-object>

In the example above, the system returns warning and info messages like these:

LM_WARN 27466|27541 2021-03-17 13:44:58.768752 [system_monitor_1:1:1:1(5220.81199)] | MonitoredResourceLogger::logMsgInternal: Monitored resource 'hugeMtxBufPool' entry threshold crossed, entry threshold set at 50%, current usage is calculated at 75%. User message: msg_text

LM_INFO 27466|27541 2021-03-17 13:45:00.272974 [system_monitor_1:1:1:1(5220.81199)] | MonitoredResourceLogger::logMsgInternal: Monitored resource 'hugeMtxBufPool' exit threshold cleared, exit threshold set at 30%, current usage is calculated at 0%. User message: msg_text

Defining General-Usage SNMP Traps

Create any SNMP trap thresholds by adding the <general-usage-traps> elements that your environment requires. The general-usage trap parameters include:

index — The threshold ID (index).
entry — The input stream percentage level that triggers an SNMP trap indicating that the node shared memory queue is near or at an overload level. If necessary, change this to a percentage that is appropriate for your MATRIXX environment.
user-text-entry— A 64-character text string sent with the SNMP trap when the entry level is reached. If necessary, replace this with any text that your MATRIXX environment requires.
exit — The input stream percentage level that triggers an SNMP indicating that the node queue is below overload level. If necessary, change this to a percentage that is appropriate for your MATRIXX environment.
user-text-exit — A 64-character text string sent with the SNMP trap when the exit level is reached. If necessary, replace this with any text that your MATRIXX environment requires.

Defining Resource-specific Threshold-based SNMP Monitoring Parameters

To define resource-specific usage threshold traps, add the following XML attribute to the appropriate <monitored-object> section:

<snmptrap entry-level=[entry level percentage] entry-text=[user text up 64 char] 
entry-id=[optional user provided int] exit-level=[exit level percentage] exit-text=
[user text up 64 char] exit-id=[optional user provided int]/>

Use these parameters to define resource-specific SNMP usage traps:

entry-level="entry_%" — Integer between 1-100. The node usage value (percentage) threshold crossing trigger. Must be larger than exit-level.
exit-level="exit_%" — Integer between 0-99. The node usage value (percentage) threshold clearing trigger.
entry-text="entry_text" — Text to include in the entry trap. You can optionally include the object identifiers below.
exit-text="exit_text" — Text to include in the exit trap. You can optionally include the object identifiers below.
entry-id="integer" — (Optional) An integer to include in entry trap (default is 0).
exit-id="integer" — (Optional) An integer to include in exit trap (default is 0).

The entry-text and exit-text parameters above may contain these optional object identifiers:

%n — The name of the monitored object.
%v — The current value.
%e — The engine number.
%c — The clusterID.
%b — The engine bladeID.

This example entry-text for a monitored object named Obj1, with a current value of 15: run on engine 1, cluster 2, blade 1:

entry-text="Entry trap. Name: %n. Value: %v for %e%c%b"

returns this line in the SNMP trap:

"Entry trap. Name: Obj1. Value: 15 for e1c2b1"

This example defines a resource-specific usage threshold for the AllServiceQueues object:

<monitored-object
 name="allServiceQueues"
 factory="StaticQueuesCollectFactory"
 selector="cfg-all-auto"
 giveup-timeout-ms="1000">
   <exclude-queue-list>
     <queue>QueueToExclude</queue>
   </exclude-queue-list>
   <!-- example SNMP trap for monitored resource -->
   <snmptrap entry-level="95" entry-text="inter service queues almost full"
    exit-level="75" exit-text="inter service queues usage level dropped"/>
 </monitored-object>

Defining CPU Monitoring Parameters

Use these parameters to define monitored objects that monitor CPU usage:

factory="StaticCumulativeCpuFactory" — Specifies a factory to monitor CPU usage.
name="string" — A name, unique within sysmon_config.xml, for the monitored object.
skip-initial-timeout="timeout_ms" — (Optional) The number of milliseconds to delay monitoring. Designed to avoid the false positives encountered when the engine enters a high CPU usage at startup. The default is 180000 (30 seconds).
enter="node_usage_%" — Integer between 1-100. The node usage level that defines critically high CPU usage.
enter-timeout="timeout_ms" — Time in milliseconds. Specifies how long the CPU usage remains above the enter level before the CPU enters the critically high CPU usage and is reported in an SNMP trap. The default is 5000 (.83 seconds).
leave="node_usage_%" — Integer between 1-99. Defines the CPU usage level at which a CPU leaves the node degraded state. The default is 96.
leave-timeout="timeout_ms" — Time in milliseconds. Specifies how long the CPU usage remains below the leave level before it leave the node degraded state. The default is 2000 (.33 seconds).

This example defines a monitored object that monitors CPU usage:

<monitored-objects>
  <monitored-object
    name="Test_CPU_Monitor"
    factory="StaticCumulativeCpuFactory"
    skip-initial-timeout="0"
    enter="75"
    enter-timeout="2000"
    leave="50"
    leave-timeout="2000"
  </monitored-object>
</monitored-objects>

Defining Disk Monitoring Parameters

Use these parameters to define monitored objects that monitor individual disk usage. Remember that System Monitor translates the available disk space into a numerical percentage value.

Note: Configure at least a 10-second window before shutting down a disk for a disk full condition by reserving at least 400 MB of disk space with the enter_mb parameter. This gives you a window in which the disk has a "node degraded" status to prevent the disk from shutting down.

These parameters specify disk usage traps:

factory="StaticDiskUsageFactory" — Specifies a factory to monitor CPU usage.
name="<name>" — A name, unique within sysmon_config.xml, that identifies the disk.
path="[ $local | $shared ]" — The type of disk memory to monitor. $local specifies pod-level SSD storage; $shared storage is the engine's SAN or shared storage.
check-timeout="timeout_ms" — (Optional) How often in milliseconds the System Monitor checks the disk. Every disk usage monitored object can use a different value. This value is used if a print-resolution-error-timeout setting is absent. The default is 15000 (2.5 seconds).
skip-initial-timeout="timeout_ms" — (Optional) The number of milliseconds to delay monitoring. Designed to avoid the false positives encountered when the engine enters a high CPU usage at startup. The default is 180000 (30 seconds).
enter="node_usage_%" — (Optional) Integer between 1-100. The node usage level threshold that defines critically high disk usage. enter and enter_mb can be used concurrently; both work with enter-timeout.
enter_mb="space_avail_MB" — (Optional) The remaining disk space available threshold on the disk below which the disk is put into a critical state. enter and enter_mb can be used concurrently; both work with enter-timeout.
enter-timeout="timeout_ms" — Time in milliseconds. Specifies how long the CPU usage remains above the enter level before the disk enters the critically high disk usage and is reported in an SNMP trap. You use this parameter to suppress extraneous critical states caused by short periods of high or low disk usage. If both enter and enter_mb are specified in the monitored object, both must exceed their specified levels to enter a critical state. The default is 5000 (.83 seconds).
leave="node_usage_%" — (Optional) Integer between 1-99. Defines the disk space usage percentage at which a disk leaves the node degraded state. The default is 96%.
leave_mb="min_disk_space_avail" — An integer. Specifies the minimum disk space available for a disk to leave a critical state. Works with leave-timeout.
leave-timeout="timeout_ms" — Time in milliseconds. Specifies how long the CPU usage remains below the leave level before it leave the node degraded state. The default is 2000 (.33 seconds).
print-resolution-error-timeout="timeout_ms" — (Optional) Sets the time between INFO-level access error messages. Useful to avoid extraneous messages while the disk is being dynamically mounted/remounted, or network file system update or journaling. No default value; uses the value for check-timeout if absent.

This example defines a monitored object that monitors local disk usage:

<monitored-objects>
  <monitored-object
    name="Mpnitor_Local_disk"
    factory="StaticDiskUsageFactory"
    path="$local"
    check-timeout="1000"
    skip-initial-timeout="500"
    enter="5.5"
    enter-mb="50"
    enter-timeout="2000"
    leave="10.0"
    leave-mb="100.5"
    leave-timeout="1000"
  </monitored-object>
</monitored-objects>