sysmon_config.xml Reference
You can configure the MATRIXX System Monitor by configuring the sysmon_config.xml file.
Defining Global Resource Logging
The optional global resource logger element (<global-monitored-resource-logger>
) collects system messages for all monitored objects in a MATRIXX server, except those
monitored by their own individual <monitored-resource-logger>
element. You define both entry and exit processing capacity percentages for the global resource logger
element as thresholds to start, stop, or change the level of logging, and provide a custom string that is returned with each message.
Define only one global resource logger element.
global-monitored-resource-logger
— The containing element for the global resource logger.entry-threshold-percent="entry_%"
— Integer between 1-100. The input stream processing capacity that triggers the logging level set by entry-log-level. Must be larger than exit-log-level.exit-threshold-percent="exit_%"
— Integer between 0-99. The input stream processing capacity below which logging is set to exit-log-level.entry-log-level="logging_level"
— (Optional) The logging level to use when the system is above theentry-threshold-percent
threshold. The default value iswarning
. Supported values are:trace
,info
,warning
,error
, andcritical
.exit-log-level="logging_level"
— (Optional) The logging level to use when the system is below theexit-threshold-percent
level. The default value isinfo
. Supported values are:trace
,info
,warning
,error
, andcritical
.custom-message="msg_text"
— A custom message string that is included in each entry and exist system message.
<global-monitored-resource-logger
entry-threshold-percent="95"
exit-threshold-percent="75"
entry-log-level="warning"
exit-log-level="info"
custom-message="This is a custom message"
/>
Defining Logging for Individual Monitored Objects
In addition to, or instead of, global system logging you can optionally add logging thresholds
to individual monitored objects by adding
<monitored-resource-logger>
elements. The parameters are the
same as for global resource logging parameters. Define as many logging elements for
individual monitored objects as your MATRIXX environment requires.
hugeMtxBufPool
element.
Logging is changed to the default warning level when the processing capacity reaches
50% and then changed to the default info level if the capacity drops back below
30%:<monitored-object
name="hugeMtxBufPool"
factory="StaticMtxBufPoolsCollectFactory">
<mtxbuf-pools>
<mtxbuf-pool>
<object_name>hugepool</object_name>
<params mtxbuf-pool-type="huge"/>
</mtxbuf-pool>
</mtxbuf-pools>
<monitored-resource-logger
entry-threshold-percent="50"
exit-threshold-percent="30"
custom-message="msg_text"/>
</monitored-object>
In the example above, the system returns warning and info messages like these:
LM_WARN 27466|27541 2021-03-17 13:44:58.768752 [system_monitor_1:1:1:1(5220.81199)] | MonitoredResourceLogger::logMsgInternal: Monitored resource 'hugeMtxBufPool' entry threshold crossed, entry threshold set at 50%, current usage is calculated at 75%. User message: msg_text
LM_INFO 27466|27541 2021-03-17 13:45:00.272974 [system_monitor_1:1:1:1(5220.81199)] | MonitoredResourceLogger::logMsgInternal: Monitored resource 'hugeMtxBufPool' exit threshold cleared, exit threshold set at 30%, current usage is calculated at 0%. User message: msg_text
Defining General-Usage SNMP Traps
<general-usage-traps>
elements that your environment requires. The general-usage trap parameters
include:index
— The threshold ID (index).entry
— The input stream percentage level that triggers an SNMP trap indicating that the node shared memory queue is near or at an overload level. If necessary, change this to a percentage that is appropriate for your MATRIXX environment.user-text-entry
— A 64-character text string sent with the SNMP trap when theentry
level is reached. If necessary, replace this with any text that your MATRIXX environment requires.exit
— The input stream percentage level that triggers an SNMP indicating that the node queue is below overload level. If necessary, change this to a percentage that is appropriate for your MATRIXX environment.user-text-exit
— A 64-character text string sent with the SNMP trap when theexit
level is reached. If necessary, replace this with any text that your MATRIXX environment requires.
Defining Resource-specific Threshold-based SNMP Monitoring Parameters
<monitored-object>
section:<snmptrap entry-level=[entry level percentage] entry-text=[user text up 64 char]
entry-id=[optional user provided int] exit-level=[exit level percentage] exit-text=
[user text up 64 char] exit-id=[optional user provided int]/>
entry-level="entry_%"
— Integer between 1-100. The node usage value (percentage) threshold crossing trigger. Must be larger than exit-level.exit-level="exit_%"
— Integer between 0-99. The node usage value (percentage) threshold clearing trigger.entry-text="entry_text"
— Text to include in the entry trap. You can optionally include the object identifiers below.exit-text="exit_text"
— Text to include in the exit trap. You can optionally include the object identifiers below.entry-id="integer"
— (Optional) An integer to include in entry trap (default is 0).exit-id="integer"
— (Optional) An integer to include in exit trap (default is 0).
entry-text
and exit-text
parameters above may
contain these optional object identifiers:- %n — The name of the monitored object.
- %v — The current value.
- %e — The engine number.
- %c — The clusterID.
- %b — The engine bladeID.
entry-text="Entry trap. Name: %n. Value: %v for %e%c%b"
returns
this line in the SNMP
trap:"Entry trap. Name: Obj1. Value: 15 for e1c2b1"
This example defines a resource-specific usage threshold for the
AllServiceQueues
object:
<monitored-object
name="allServiceQueues"
factory="StaticQueuesCollectFactory"
selector="cfg-all-auto"
giveup-timeout-ms="1000">
<exclude-queue-list>
<queue>QueueToExclude</queue>
</exclude-queue-list>
<!-- example SNMP trap for monitored resource -->
<snmptrap entry-level="95" entry-text="inter service queues almost full"
exit-level="75" exit-text="inter service queues usage level dropped"/>
</monitored-object>
Defining CPU Monitoring Parameters
factory="
— Specifies a factory to monitor CPU usage.StaticCumulativeCpuFactory
"name="string"
— A name, unique within sysmon_config.xml, for the monitored object.skip-initial-timeout="timeout_ms"
— (Optional) The number of milliseconds to delay monitoring. Designed to avoid the false positives encountered when the engine enters a high CPU usage at startup. The default is 180000 (30 seconds).enter="node_usage_%"
— Integer between 1-100. The node usage level that defines critically high CPU usage.enter-timeout="timeout_ms"
— Time in milliseconds. Specifies how long the CPU usage remains above theenter
level before the CPU enters the critically high CPU usage and is reported in an SNMP trap. The default is 5000 (.83 seconds).leave="node_usage_%"
— Integer between 1-99. Defines the CPU usage level at which a CPU leaves the node degraded state. The default is 96.leave-timeout="timeout_ms"
— Time in milliseconds. Specifies how long the CPU usage remains below the leave level before it leave the node degraded state. The default is 2000 (.33 seconds).
<monitored-objects>
<monitored-object
name="Test_CPU_Monitor"
factory="StaticCumulativeCpuFactory"
skip-initial-timeout="0"
enter="75"
enter-timeout="2000"
leave="50"
leave-timeout="2000"
</monitored-object>
</monitored-objects>
Defining Disk Monitoring Parameters
Use these parameters to define monitored objects that monitor individual disk usage. Remember that System Monitor translates the available disk space into a numerical percentage value.
enter_mb
parameter. This gives you a window in which the
disk has a "node degraded" status to prevent the disk from shutting down.factory="
— Specifies a factory to monitor CPU usage.StaticDiskUsageFactory
"name="<name>"
— A name, unique within sysmon_config.xml, that identifies the disk.path="[ $local | $shared ]"
— The type of disk memory to monitor.$local
specifies pod-level SSD storage;$shared
storage is the engine's SAN or shared storage.check-timeout="timeout_ms"
— (Optional) How often in milliseconds the System Monitor checks the disk. Every disk usage monitored object can use a different value. This value is used if aprint-resolution-error-timeout
setting is absent. The default is 15000 (2.5 seconds).skip-initial-timeout="timeout_ms"
— (Optional) The number of milliseconds to delay monitoring. Designed to avoid the false positives encountered when the engine enters a high CPU usage at startup. The default is 180000 (30 seconds).enter="node_usage_%"
— (Optional) Integer between 1-100. The node usage level threshold that defines critically high disk usage.enter
andenter_mb
can be used concurrently; both work withenter-timeout
.enter_mb="space_avail_MB"
— (Optional) The remaining disk space available threshold on the disk below which the disk is put into a critical state.enter
andenter_mb
can be used concurrently; both work withenter-timeout
.enter-timeout="timeout_ms"
— Time in milliseconds. Specifies how long the CPU usage remains above theenter
level before the disk enters the critically high disk usage and is reported in an SNMP trap. You use this parameter to suppress extraneous critical states caused by short periods of high or low disk usage. If bothenter
andenter_mb
are specified in the monitored object, both must exceed their specified levels to enter a critical state. The default is 5000 (.83 seconds).leave="node_usage_%"
— (Optional) Integer between 1-99. Defines the disk space usage percentage at which a disk leaves the node degraded state. The default is 96%.leave_mb="min_disk_space_avail"
— An integer. Specifies the minimum disk space available for a disk to leave a critical state. Works withleave-timeout
.leave-timeout="timeout_ms"
— Time in milliseconds. Specifies how long the CPU usage remains below the leave level before it leave the node degraded state. The default is 2000 (.33 seconds).print-resolution-error-timeout="timeout_ms"
— (Optional) Sets the time between INFO-level access error messages. Useful to avoid extraneous messages while the disk is being dynamically mounted/remounted, or network file system update or journaling. No default value; uses the value forcheck-timeout
if absent.
<monitored-objects>
<monitored-object
name="Mpnitor_Local_disk"
factory="StaticDiskUsageFactory"
path="$local"
check-timeout="1000"
skip-initial-timeout="500"
enter="5.5"
enter-mb="50"
enter-timeout="2000"
leave="10.0"
leave-mb="100.5"
leave-timeout="1000"
</monitored-object>
</monitored-objects>