Fix monit false alarm issue, which locates in process_checker and it (#16907)
Fix monit false alarm issue, which located in process_checker and it missed "disk-sleep" status check, thus some 201911 SONiC box report "pmon|sensord" error coincidently. #### Why I did it Currently psutil library returns below detail process status: running: The process is currently running. sleeping: The process is sleeping or waiting for an event to occur. disk-sleep: The process is waiting for I/O operations to complete. stopped: The process has been stopped (e.g. via the SIGSTOP signal). zombie: The process has terminated but is still listed in the process table. dead: The process has terminated and has been removed from the process table. We should regard running/sleeping/disk-sleep as normal case and not alert in monit process. Now once the disk-sleep occurs during monit cycle, below syslog will be paged, so get rid of syslog output meanwhile. yslog.2.gz:Feb 24 06:12:17.394619 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host syslog.2.gz:Feb 24 06:13:17.932531 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host syslog.2.gz:Feb 24 06:14:18.502505 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host Then I tried to reproduce the issue by triggering process_checker for sensord frequently and observed it's under "disk-sleep" status once the alert is raised. ##### Work item tracking - Microsoft ADO **(number only)**:17663589 #### How I did it Fix process_checker script code for adding "disk-sleep" case handling. #### How to verify it Verified in local DUT.
This commit is contained in:
parent
36e65035ba
commit
a5043bfc84
@ -51,7 +51,7 @@ def check_process_existence(container_name, process_cmdline):
|
||||
|
||||
for process in psutil.process_iter(["cmdline", "status", "pid"]):
|
||||
try:
|
||||
if ((' '.join(process.cmdline())).startswith(process_cmdline) and process.status() in ["running", "sleeping"]):
|
||||
if ((' '.join(process.cmdline())).startswith(process_cmdline) and process.status() in ["running", "sleeping", "disk-sleep"]):
|
||||
process_namespace_found_set.add(multi_asic.get_current_namespace(process.info['pid']))
|
||||
except psutil.NoSuchProcess:
|
||||
pass
|
||||
@ -72,10 +72,6 @@ def check_process_existence(container_name, process_cmdline):
|
||||
namespace_display_str += ", " + ns
|
||||
|
||||
join_str = " and" if host_display_str and namespace_display_str else ""
|
||||
|
||||
# If this script is run by Monit, then the following output will be appended to
|
||||
# Monit's syslog message.
|
||||
print("'{}' is not running{}{}{}".format(process_cmdline, host_display_str, join_str, namespace_display_str))
|
||||
sys.exit(1)
|
||||
else:
|
||||
syslog.syslog(syslog.LOG_ERR, "container '{}' is not included in SONiC image or the given container name is invalid!"
|
||||
|
Reference in New Issue
Block a user