sonic-buildimage/dockers/docker-orchagent
Hua Liu 05f1a5a31e
Add watchdog mechanism to swss service and generate alert when swss have issue. (#15429)
Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
2023-06-12 17:53:54 -07:00
..
base_image_files [Monit] Deprecate the feature of monitoring the critical processes by Monit (#7676) 2021-06-04 10:16:53 -07:00
arp_update.conf [docker-orchagent]: use service dependency in supervisord to start services 2020-05-22 11:01:28 -07:00
buffermgrd.sh [Reclaim buffer] Common infrastructure update for reclaiming buffer (#9133) 2021-11-24 15:00:23 +02:00
critical_processes.j2 [fabric] Disable unnecessary processes in swss and the orchagent-portsyncd dependency for fabric asic (#5569) 2021-06-09 10:53:47 -07:00
docker-init.j2 Add watchdog mechanism to swss service and generate alert when swss have issue. (#15429) 2023-06-12 17:53:54 -07:00
Dockerfile.j2 [infra] Support syslog rate limit configuration (#12490) 2022-12-20 10:53:58 +02:00
enable_counters.py Replace swsssdk with swsscommon (#11215) 2022-07-11 10:01:10 +08:00
events_info.json Add rsyslog plugin regex for select operation failure (#12659) 2022-11-13 21:41:33 -08:00
ipinip.json.j2 [IPinIP] Add Loopback2 interface, change dscp mode to uniform (#7234) 2021-04-07 09:58:12 -07:00
ndppd.conf [swss]: Wait for vlan intf to start ndppd (#10119) 2022-03-02 16:23:56 -08:00
ndppd.conf.j2 [docker-orchagent]: Increase ndppd kernel poll interval (#7456) 2021-04-30 16:30:30 -07:00
orchagent.sh Introduce the asic_subtype field for adding the sub platform variants. (#10235) 2022-03-28 11:22:32 -07:00
ports.json.j2 [cfggen] Make Jinja2 Template Python 3 Compatible 2020-09-30 07:07:43 -07:00
supervisord.conf.j2 Add watchdog mechanism to swss service and generate alert when swss have issue. (#15429) 2023-06-12 17:53:54 -07:00
switch.json.j2 Add EPMS and MgmtTsToR (#10478) 2022-04-07 21:49:42 -07:00
swss_regex.json Add rsyslog plugin regex for select operation failure (#12659) 2022-11-13 21:41:33 -08:00
swssconfig.sh [Mellanox][VXLAN] add params to vxlan.json file in order to configure VXLAN src port range feature (#9658) 2022-01-31 15:57:30 +02:00
tunnel_packet_handler.conf [swss]: Run tunnel_pkt_handler on dualtor only (#11627) 2022-08-09 16:19:59 -07:00
tunnel_packet_handler.py [tunnel_pkt_handler]: Skip nonexistent intfs (#12424) 2022-10-20 09:29:57 -07:00
vlan_vars.j2 [swss] Reduce Calls to SONiC Cfggen (#5177) 2020-08-17 15:47:52 -07:00
vxlan.json.j2 [Mellanox][VXLAN] add params to vxlan.json file in order to configure VXLAN src port range feature (#9658) 2022-01-31 15:57:30 +02:00
wait_for_link.sh.j2 [swss]: Wait for vlan intf to start ndppd (#10119) 2022-03-02 16:23:56 -08:00
watchdog_processes.j2 Add watchdog mechanism to swss service and generate alert when swss have issue. (#15429) 2023-06-12 17:53:54 -07:00