884dfa5427
7 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
yozhao101
|
e24fe9bc60
|
[Monit] Fix the issue which shows Monit can not reset its counter. (#10288)
Signed-off-by: Yong Zhao <yozhao@microsoft.com> Why I did it This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container. Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following: check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted. Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window. The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok. Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry: Program 'container_memory_telemetry' status Status ok monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Sat, 19 Mar 2022 19:56:26 Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service: Program 'container_memory_telemetry' status Status failed monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Tue, 01 Feb 2022 22:52:55 After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok. How I did it In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles. How to verify it I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review. |
||
yozhao101
|
1a3cab43ac
|
[Monit] Deprecate the feature of monitoring the critical processes by Monit (#7676)
Signed-off-by: Yong Zhao yozhao@microsoft.com Why I did it Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit. How I did it I removed the script process_checker and corresponding Monit configuration entries of critical processes. How to verify it I verified this on the device str-7260cx3-acs-1. |
||
yozhao101
|
37863ac854
|
[Monit] Restart telemetry container if memory usage is beyond the threshold (#7645)
Signed-off-by: Yong Zhao yozhao@microsoft.com Why I did it This PR aims to monitor the memory usage of streaming telemetry container and restart streaming telemetry container if memory usage is larger than the pre-defined threshold. How I did it I borrowed the system tool Monit to run a script memory_checker which will periodically check the memory usage of streaming telemetry container. If the memory usage of telemetry container is larger than the pre-defined threshold for 10 times during 20 cycles, then an alerting message will be written into syslog and at the same time Monit will run the script restart_service to restart the streaming telemetry container. How to verify it I verified this implementation on device str-7260cx3-acs-1. |
||
abdosi
|
dddf96933c
|
[monit] Adding patch to enhance syslog error message generation for monit alert action when status is failed. (#5720)
Why/How I did: Make sure first error syslog is triggered based on FAULT TOLERANCE condition. Added support of repeat clause with alert action. This is used as trigger for generation of periodic syslog error messages if error is persistent Updated the monit conf files with repeat every x cycles for the alert action |
||
yozhao101
|
13cec4c486
|
[Monit] Unmonitor the processes in containers which are disabled. (#5153)
We want to let Monit to unmonitor the processes in containers which are disabled in `FEATURE` table such that Monit will not generate false alerting messages into the syslog. Signed-off-by: Yong Zhao <yozhao@microsoft.com> |
||
RayWang910012
|
f90bf8fe62
|
[monit]: monit_telemetry which will have error when telemetry is in secure mode (#4286)
When telemetry is in secure mode ,the monitor will have error log of the match string "--insecure". So I modify to be compatiable with insecure mode and secure mode. Co-authored-by: Ubuntu <ubuntu@ip-10-5-1-21.ap-south-1.compute.internal> |
||
yozhao101
|
b7e48b422f |
[Services] Allow monit system tool to monitor the critical processes status running in various SONiC containers. (#3940)
* Add a monit config file for teamd container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * Add a copy mechanism to put the monit config file in teamd container into base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * Add a monit config file for snmp container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * Add a copy mechanism to put the monit config file of snmp container into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * Add a monit config file for dhcp_relay container in the dir base_image_files. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * Add a copy mechanism to put the monit config file of dhcp_relay container into base image under /etc/monit/conf.d. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * Add a monit config file for router advertiser container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * Add a copy mechanism to put the monit config file of router advertiser contianer into base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-Pmon] Add a monit config file for pmon container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-Pmon] Add a copy mechanism to put the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-lldp] Add a monit config file for lldp container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-lldp] Add a copy mechanism to put the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-bgp] Add a monit config file for BGP container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-bgp] Add a copy mechanism to put monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-swss] Add a monit config file for the swss container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-swss] Add a copy mechanism to put monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a monit config file for syncd container on barefoot platform. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a copy mechanism to put the monit config file into the base image on barefoot. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a monit config file for syncd container on broadcom. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a copy mechanism to put the monit config file into the base image on broadcom. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a monit config file for syncd container on cavium. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a copy mechanism to put the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-centec] Add a monit config file for syncd container on centen platform. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a copy mechanism to put the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a monit config file for syncd container on centen platform. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a copy mechanism to put the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a monit config file for syncd container on marvell. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a copy mechanism to put the monit conifg file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a monit config file for syncd container on marvell-arm64. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a copy mechanism to put the monit config file into the base image on marvell-arm64. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a monit config file for syncd container on marvell-armhf. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a copy mechanism to put the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a monit config file for syncd container on mellanox. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a copy mechanism to put the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a monit config file for syncd container on nephos. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Add a copy mechanism to put the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-sflow] Add a monit config file for sflow container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-sflow] Add a copy mechanism to put the monit conifg file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-telemetry] Add a monit config file for telemetry container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-telemetry] Add a copy mechanism to put the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-database] Add a monit config file for database container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-database] Add a copy mechanism to put the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-Dhcprelay] Change a typo. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-Dhcprelay] Change the process name in monit config file to dhcrelay. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] There is no desserve process in syncd container on barefoot. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] There is no process desserve in syncd container on cavium. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] There is no process named desserve in syncd on centec. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] There is no process named desserve in syncd on marvell. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Should not delete the process desserve in syncd container on marvell. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Delete the process dsserve in syncd on marvell. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Delete the process dsserve in syncd container on marvell-arm64. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Delete the process dsserve in syncd container on marvell-armhf. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Delete the process dsserve in syncd container on mellanox. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-Radv] Change the process name to radvd. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-telemetry] Correct a typo in monit_telemetry. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-teamd] Delete the monit config file for teamd. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-teamd] Delete the mechanism to copy the monit config file into base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-dhcprelay] Delete the monit config file for dhcp_relay container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-dhcprelay] Delete the mechanism to copy the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-radv] Delete the monit config file foe radv container. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-radv] Delete the mechanism to copy the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-bgp] change the monit config file for BGP container such that monit only generates alert if the process is not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-snmp] Change the monit config file for snmp container such that monit only generates alret if the process is not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-pmon] Change the monit config file for pmon container such that monit only generates alert if the processes are not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-lldp] Change the monit config file for lldp container such that monit only generates alerts if some processes are not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-pmon] Delete the monit config file for pmon container since some of processes are not running depended on the type of box. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-pmon] Delete the copy mechanism to copy the monit config file into the base image. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-lldp] Change the matching name for the process lldpd. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-swss] Change the monit config file for swss container such that monit only generates alerts if the processes are not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Change the monit config file for syncd container on barefoot such that monit only generates alerts if the process is not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Correct a typo in monit config file. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Change the monit config file for syncd container on broadcom such that monit only generates alerts if the processes are not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Change the monit config file for syncd container on cavium such that monit only generates alerts if the process is not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Change the monit config file for syncd container such that monit only generates alerts if the process is not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Change the monit config file for syncd container on marvell such that monit only generates alerts if the process is not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Change the monit config file for syncd container on marvell-arm64 such that monit only generates alerts if the process is not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Change the monit config file for syncd container on marvell-armhf such that monit will generate alert if the process is not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Change the monit config file for syncd container on mellanox such that monit only generates alerts if the process is not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-sycnd] Change the monit config file for syncd container such that monit only generates alerts if the processes are not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-sflow] Change the monit config file for sflow container such that monit only generates alerts if the process is not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-telemetry] Change the monit config file for telemetry container such that monit only generates alerts if the processes are not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-database] Change the monit config file for database container such that monit only generates alerts if the process is not running for 5 minutes. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-database] Use 4 spaces to replace 2 spaces in monit config file. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-bgp] Use 4 spcess to replace 2 spaces in monit config file. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-lldp] Use 4 spaces to replace 2 spaces in monit config file. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-swss] Use 4 spaces to replace 2 space in monit config file. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-sflow] Use 4 spaces to replace 2 spaces in monit config file. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-snmp] Use 4 spaces to replace 2 spaces in monit config file. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-telemetry] Use 4 spaces to replace 2 spaces in monit config file. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Use 4 spaces to replace 2 spaces in the monit config file on barefoot. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Use 4 spaces to replace 2 spaces in the monit config file on broadcom. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Use 4 spaces to replace 2 spaces in the monit config file on cavium. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Use 4 spaces to replace 2 spaces in the monit config file on centec. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Use 4 spaces to replace 2 spaces in the monit config file on marvell. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Use 4 spaces to replace 2 spaces in the monit config file on mellanox. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-syncd] Use 4 spaces to repalce 2 spaces in the monit config file on nephos. Signed-off-by: Yong Zhao <yozhao@microsoft.com> * [Docker-bgp] Remove the trailing extra spaces in monit config file. Signed-off-by: Yong Zhao <yozhao@microsoft.com> |