Fix monit false alarm issue, which locates in process_checker and it (#16907)

Fix monit false alarm issue, which located in process_checker and it missed "disk-sleep" status check, thus some 201911 SONiC box report "pmon|sensord" error coincidently.

#### Why I did it
Currently psutil library returns below detail process status:
running: The process is currently running.
sleeping: The process is sleeping or waiting for an event to occur.
disk-sleep: The process is waiting for I/O operations to complete.
stopped: The process has been stopped (e.g. via the SIGSTOP signal).
zombie: The process has terminated but is still listed in the process table.
dead: The process has terminated and has been removed from the process table.

We should regard running/sleeping/disk-sleep as normal case and not alert in monit process.

Now once the disk-sleep occurs during monit cycle, below syslog will be paged, so get rid of syslog output meanwhile.

yslog.2.gz:Feb 24 06:12:17.394619 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host
syslog.2.gz:Feb 24 06:13:17.932531 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host
syslog.2.gz:Feb 24 06:14:18.502505 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host

Then I tried to reproduce the issue by triggering process_checker for sensord frequently and observed it's under "disk-sleep" status once the alert is raised.

##### Work item tracking
- Microsoft ADO **(number only)**:17663589

#### How I did it
Fix process_checker script code for adding "disk-sleep" case handling.

#### How to verify it
Verified in local DUT.
This commit is contained in:
Feng-msft 2023-10-27 09:23:24 +08:00 committed by GitHub
parent 36e65035ba
commit a5043bfc84
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -51,7 +51,7 @@ def check_process_existence(container_name, process_cmdline):
for process in psutil.process_iter(["cmdline", "status", "pid"]):
try:
if ((' '.join(process.cmdline())).startswith(process_cmdline) and process.status() in ["running", "sleeping"]):
if ((' '.join(process.cmdline())).startswith(process_cmdline) and process.status() in ["running", "sleeping", "disk-sleep"]):
process_namespace_found_set.add(multi_asic.get_current_namespace(process.info['pid']))
except psutil.NoSuchProcess:
pass
@ -72,10 +72,6 @@ def check_process_existence(container_name, process_cmdline):
namespace_display_str += ", " + ns
join_str = " and" if host_display_str and namespace_display_str else ""
# If this script is run by Monit, then the following output will be appended to
# Monit's syslog message.
print("'{}' is not running{}{}{}".format(process_cmdline, host_display_str, join_str, namespace_display_str))
sys.exit(1)
else:
syslog.syslog(syslog.LOG_ERR, "container '{}' is not included in SONiC image or the given container name is invalid!"