Add watchdog mechanism to swss service and generate alert when swss have issue.
**Work item tracking**
Microsoft ADO (number only): 16578912
**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.
**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.
**How I verified it**
Pass all UT.
Manually test process_monitoring/test_critical_process_monitoring.py can pass.
Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:
Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).
**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
This reverts commit 44427a2f6b.
Docker image not updated during PR validation and caused PR check failures.
Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.
This PR depends on https://github.com/sonic-net/sonic-swss/pull/2737 merge first.
**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.
**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.
**How I verified it**
Pass all UT.
Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:
Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).
**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
At SWSS docker init time, check the device subtype and enable tunnel packet handler only if it is dualtor
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
- Why I did it
To provide an ability to suppress ASAN false positives and have a clean ASAN report for docker-sonic-vs/mlnx-syncd/orchagent docker
- How I did it
Added the "print_suppressions=0" to ASAN configs.
- How to verify it
add a suppression to some ASAN-enabled component (the suppression should catch some leak)
build with ENABLE_ASAN=y
run a test and see that the ASAN report is empty instead of having the suppression summary
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
- Use the `wait_for_link.sh` script to delay ndppd start until after the VLAN interface is ready
- Avoids issue where ndppd tries to change interface attributes before the interface is ready
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Implement infrastructure that allows enabling address sanitizer
for docker containers. Enable address sanitizer for SWSS container.
- Why I did it
To add a possibility to compile SONiC applications with address sanitizer (ASAN).
ASAN is a memory error detector for C/C++. It finds:
1. Use after free (dangling pointer dereference)
2. Heap buffer overflow
3. Stack buffer overflow
4. Global buffer overflow
5. Use after return
6. Use after the scope
7. Initialization order bugs
8. Memory leaks
- How I did it
By adding new ENABLE_ASAN configuration option.
- How to verify it
By default ASAN is disabled and the SONiC image is not affected.
When ASAN is enabled it inspects all allocation, deallocation, and memory usage that the application does in run time. To verify whether the application has memory errors tests that trigger memory usage of the application should be run. Ideally, the whole regression tests should be run. Memory leaks reports will be placed in /var/log/asan/ directory of SONiC host OS.
Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
- Create a script in the orchagent docker container which listens for these encapsulated packets which are trapped to CPU (indicating that they cannot be routed/no neighbor info exists for the inner packet). When such a packet is received, the script will issue a ping command to the packet's inner destination IP to start the neighbor learning process.
- This script is also resilient to portchannel status changes (i.e. interface going up or down). An interface going down does not affect traffic sniffing on interfaces which are still up. When an interface comes back up, we restart the sniffer to start capturing traffic on that interface again.