sonic-buildimage/dockers
yozhao101 e24fe9bc60
[Monit] Fix the issue which shows Monit can not reset its counter. (#10288)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
2022-04-20 18:08:06 -07:00
..
docker-base Add a config variable to override default container registry instead of dockerhub. (#10166) 2022-03-14 18:09:20 +08:00
docker-base-bullseye Image disk space reduction (#10172) 2022-03-15 18:12:49 -07:00
docker-base-buster Image disk space reduction (#10172) 2022-03-15 18:12:49 -07:00
docker-base-stretch Add a config variable to override default container registry instead of dockerhub. (#10166) 2022-03-14 18:09:20 +08:00
docker-basic_router [supervisord]: use abspath as supervisord entrypoint (#5995) 2020-11-22 21:18:44 -08:00
docker-config-engine [docker-base-buster][docker-config-engine-buster] No longer install Python 2 (#6162) 2020-12-25 21:29:25 -08:00
docker-config-engine-bullseye Image disk space reduction (#10172) 2022-03-15 18:12:49 -07:00
docker-config-engine-buster Image disk space reduction (#10172) 2022-03-15 18:12:49 -07:00
docker-config-engine-stretch [docker-base-buster][docker-config-engine-buster] No longer install Python 2 (#6162) 2020-12-25 21:29:25 -08:00
docker-database [redis] Upgrade redis version (#9757) 2022-02-15 16:43:01 -08:00
docker-dhcp-relay [dhcp_relay] Remove dhcp6mon (#10467) 2022-04-12 10:44:17 -07:00
docker-fpm-frr [chassis][bgp] create v4 and v6 peer group for VoQ internal neighbors (#9693) 2022-02-24 11:21:26 -08:00
docker-fpm-gobgp [dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7083) 2021-03-27 21:14:24 -07:00
docker-iccpd add platform to iccpd's env (#8945) 2021-12-08 09:21:44 -08:00
docker-lldp Dynamic port configuration - solve lldp issues when adding/removing ports (#9386) 2022-03-25 17:47:24 -07:00
docker-macsec [macsec]: Upgrade docker-macsec to bullseye (#10574) 2022-04-17 20:32:51 +08:00
docker-mux Upgrade mux container to Bullseye (#10498) 2022-04-19 09:27:45 -07:00
docker-nat Create a docker-swss-layer that holds the swss package. 2022-01-06 09:26:55 -08:00
docker-orchagent Add EPMS and MgmtTsToR (#10478) 2022-04-07 21:49:42 -07:00
docker-pde [PDE]: introduce the SONiC Platform Development Env (#7510) 2021-07-24 16:24:43 -07:00
docker-platform-monitor Removed python2 dependency for sonic-pcied in sonic-platform-daemons (#10421) 2022-04-09 13:16:50 -07:00
docker-ptf Revert "[docker-ptf]: Upgrade scapy to 2.4.5 in docker-ptf (#10507)" (#10537) 2022-04-13 08:57:10 +08:00
docker-ptf-sai install xmlrunner python3 version (#10086) 2022-02-28 11:21:04 +08:00
docker-router-advertiser Update docker-router-advertiser.supervisord.conf.j2 (#10375) 2022-04-06 09:44:21 -07:00
docker-sflow Create a docker-swss-layer that holds the swss package. 2022-01-06 09:26:55 -08:00
docker-snmp updated jinja template for snmp contact python2 vs python3 issue (#9949) 2022-02-10 09:01:46 -08:00
docker-sonic-mgmt Add scapy support for python3 virtual environment in the sonic-mgmt docker container (#10234) 2022-03-16 12:00:51 +08:00
docker-sonic-mgmt-framework Image disk space reduction (#10172) 2022-03-15 18:12:49 -07:00
docker-sonic-p4rt [PINS] update sonic-p4rt docker to bullseye (#10182) 2022-03-23 17:21:36 -07:00
docker-sonic-restapi [restapi]: Don't use python/python2 for restapi start scripts (#10285) 2022-03-22 18:34:42 -07:00
docker-sonic-sdk [sonic-sdk] add sonic sdk and sonic sdk buildenv (#6712) 2021-05-28 10:16:02 -07:00
docker-sonic-sdk-buildenv [sonic-sdk] add sonic sdk and sonic sdk buildenv (#6712) 2021-05-28 10:16:02 -07:00
docker-sonic-telemetry [Monit] Fix the issue which shows Monit can not reset its counter. (#10288) 2022-04-20 18:08:06 -07:00
docker-swss-layer-buster Create a docker-swss-layer that holds the swss package. 2022-01-06 09:26:55 -08:00
docker-teamd Create a docker-swss-layer that holds the swss package. 2022-01-06 09:26:55 -08:00
dockerfile-macros.j2 [sonic-config-engine] Clean up dependencies, pin versions; install Python 3 package in Buster container (#5656) 2020-10-26 13:48:50 -07:00