This repository has been archived on 2025-03-20. You can view files and clone it, but cannot push or open issues or pull requests.
sonic-buildimage/files/image_config/monit
Feng-msft a5043bfc84
Fix monit false alarm issue, which locates in process_checker and it (#16907)
Fix monit false alarm issue, which located in process_checker and it missed "disk-sleep" status check, thus some 201911 SONiC box report "pmon|sensord" error coincidently.

#### Why I did it
Currently psutil library returns below detail process status:
running: The process is currently running.
sleeping: The process is sleeping or waiting for an event to occur.
disk-sleep: The process is waiting for I/O operations to complete.
stopped: The process has been stopped (e.g. via the SIGSTOP signal).
zombie: The process has terminated but is still listed in the process table.
dead: The process has terminated and has been removed from the process table.

We should regard running/sleeping/disk-sleep as normal case and not alert in monit process.

Now once the disk-sleep occurs during monit cycle, below syslog will be paged, so get rid of syslog output meanwhile.

yslog.2.gz:Feb 24 06:12:17.394619 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host
syslog.2.gz:Feb 24 06:13:17.932531 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host
syslog.2.gz:Feb 24 06:14:18.502505 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host

Then I tried to reproduce the issue by triggering process_checker for sensord frequently and observed it's under "disk-sleep" status once the alert is raised.

##### Work item tracking
- Microsoft ADO **(number only)**:17663589

#### How I did it
Fix process_checker script code for adding "disk-sleep" case handling.

#### How to verify it
Verified in local DUT.
2023-10-26 18:23:24 -07:00
..
conf.d [201911][multi-asic] Monit changes to enable internal link monitoring script (#16393) 2023-09-12 15:57:13 -07:00
generate_monit_config [201911][Monit] Monitor critical processes in PMon contianer. (#7438) 2021-04-28 17:12:21 -07:00
generate_monit_config.service [201911][Monit] Monitor critical processes in radv and dhcp_relay containers. (#7340) 2021-04-16 08:40:06 -07:00
memory_checker [201911][Monit] Restart telemetry container if memory usage is beyond the threshold (#7618) 2021-05-17 16:51:13 -07:00
monitrc [Monit] Delay start of monitoring for 5 minutes (#4281) 2020-03-22 22:58:57 -07:00
process_checker Fix monit false alarm issue, which locates in process_checker and it (#16907) 2023-10-26 18:23:24 -07:00
restart_service [201911][Monit] Restart telemetry container if memory usage is beyond the threshold (#7618) 2021-05-17 16:51:13 -07:00