05f1a5a31e
Add watchdog mechanism to swss service and generate alert when swss have issue. **Work item tracking** Microsoft ADO (number only): 16578912 **What I did** Add orchagent watchdog to monitor and alert orchagent stuck issue. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Manually test process_monitoring/test_critical_process_monitoring.py can pass. Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). **Details if related** Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737 UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306 |
||
---|---|---|
.. | ||
base_image_files | ||
arp_update.conf | ||
buffermgrd.sh | ||
critical_processes.j2 | ||
docker-init.j2 | ||
Dockerfile.j2 | ||
enable_counters.py | ||
events_info.json | ||
ipinip.json.j2 | ||
ndppd.conf | ||
ndppd.conf.j2 | ||
orchagent.sh | ||
ports.json.j2 | ||
supervisord.conf.j2 | ||
switch.json.j2 | ||
swss_regex.json | ||
swssconfig.sh | ||
tunnel_packet_handler.conf | ||
tunnel_packet_handler.py | ||
vlan_vars.j2 | ||
vxlan.json.j2 | ||
wait_for_link.sh.j2 | ||
watchdog_processes.j2 |