04cd1d61e8
**- Why I did it** This PR aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command `sudo systemctl reset-failed <container_name>` manually. **- How I did it** We will employ Monit to monitor a script. This script will generate the expected running container list and compare it with the current running containers. If there are containers which were expected to run but were not running, then an alerting message will be written into syslog. **- How to verify it** I tested this feature on a lab device `str-a7050-acs-3` which has single ASIC and `str2-n3164-acs-3` which has a Multi-ASIC. First I manually stopped a container by running the command `sudo systemctl stop <container_name>`, then I checked whether there was an alerting message in the syslog. Signed-off-by: Yong Zhao <yozhao@microsoft.com> |
||
---|---|---|
.. | ||
per_namespace | ||
share_image | ||
arp_update_vars.j2 | ||
buffers_config.j2 | ||
config-chassisdb.service.j2 | ||
config-setup.service.j2 | ||
database.service.j2 | ||
dhcp_relay.service.j2 | ||
docker_image_ctl.j2 | ||
gbsyncd.service.j2 | ||
iccpd.service.j2 | ||
init_cfg.json.j2 | ||
kube_cni.10-flannel.conflist | ||
lldp.service.j2 | ||
mgmt-framework.service.j2 | ||
mgmt-framework.timer | ||
nat.service.j2 | ||
organization_extensions.sh | ||
pcie-check.timer | ||
pmon.service.j2 | ||
qos_config.j2 | ||
radv.service.j2 | ||
restapi.service.j2 | ||
sflow.service.j2 | ||
snmp.service.j2 | ||
snmp.timer | ||
sonic_debian_extension.j2 | ||
swss_vars.j2 | ||
telemetry.service.j2 | ||
telemetry.timer | ||
updategraph.service.j2 |