Add watchdog mechanism to swss service and generate alert when swss have issue. **Work item tracking** Microsoft ADO (number only): 16578912 **What I did** Add orchagent watchdog to monitor and alert orchagent stuck issue. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Manually test process_monitoring/test_critical_process_monitoring.py can pass. Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). **Details if related** Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737 UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306 |
||
---|---|---|
.. | ||
docker-base | ||
docker-base-bullseye | ||
docker-base-buster | ||
docker-base-stretch | ||
docker-basic_router | ||
docker-config-engine | ||
docker-config-engine-bullseye | ||
docker-config-engine-buster | ||
docker-config-engine-stretch | ||
docker-database | ||
docker-dhcp-relay | ||
docker-eventd | ||
docker-fpm-frr | ||
docker-fpm-gobgp | ||
docker-iccpd | ||
docker-lldp | ||
docker-macsec | ||
docker-mux | ||
docker-nat | ||
docker-orchagent | ||
docker-pde | ||
docker-platform-monitor | ||
docker-ptf | ||
docker-ptf-sai | ||
docker-router-advertiser | ||
docker-sflow | ||
docker-snmp | ||
docker-sonic-mgmt | ||
docker-sonic-mgmt-framework | ||
docker-sonic-p4rt | ||
docker-sonic-restapi | ||
docker-sonic-sdk | ||
docker-sonic-sdk-buildenv | ||
docker-sonic-telemetry | ||
docker-swss-layer-bullseye | ||
docker-swss-layer-buster | ||
docker-teamd | ||
dockerfile-macros.j2 |