sonic-buildimage/files/image_config/monit/conf.d/sonic-host
yozhao101 04cd1d61e8
[Monit] Monitoring the running status of containers. (#6251)
**- Why I did it**
This PR aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command `sudo systemctl reset-failed <container_name>` manually. 

**- How I did it**
We will employ Monit to monitor a script. This script will generate the expected running container list and compare it with the current running containers. If there are containers which were expected to run but were not running, then an alerting message will be written into syslog.

**- How to verify it**
I tested this feature on a lab device `str-a7050-acs-3` which has single ASIC and `str2-n3164-acs-3` which has a Multi-ASIC. First I manually stopped a container by running the command `sudo systemctl stop <container_name>`, then I checked whether there was an alerting message in the syslog.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2021-01-07 19:52:22 -08:00

36 lines
1.7 KiB
Plaintext

###############################################################################
## Monit configuration for SONiC host OS
##
## This includes system-level monitoring as well as processes which
## run in the host OS (i.e., not inside a Docker container)
###############################################################################
check filesystem root-overlay with path /
if space usage > 90% for 10 times within 20 cycles then alert repeat every 1 cycles
check filesystem var-log with path /var/log
if space usage > 90% for 10 times within 20 cycles then alert repeat every 1 cycles
check system $HOST
if memory usage > 90% for 10 times within 20 cycles then alert repeat every 1 cycles
if cpu usage (user) > 90% for 10 times within 20 cycles then alert repeat every 1 cycles
if cpu usage (system) > 90% for 10 times within 20 cycles then alert repeat every 1 cycles
check process rsyslog with pidfile /var/run/rsyslogd.pid
start program = "/bin/systemctl start rsyslog.service"
stop program = "/bin/systemctl stop rsyslog.service"
if totalmem > 800 MB for 10 times within 20 cycles then restart
# route_check.py Verify routes between APPL-DB & ASIC-DB are in sync.
# For any discrepancy, details are logged and a non-zero code is returned
# which would trigger a monit alert.
# Hence for any discrepancy, there will be log messages for "ERR" level
# from both route_check.py & monit.
#
check program routeCheck with path "/usr/local/bin/route_check.py"
every 5 cycles
if status != 0 for 3 cycle then alert repeat every 1 cycles
check program container_checker with path "/usr/bin/container_checker"
if status != 0 for 5 times within 5 cycles then alert repeat every 1 cycles