Commit Graph

15 Commits

Author SHA1 Message Date
Joe LeVeque
c651a9ade4
[dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7083)
To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:

```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```

This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.

This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).

I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-27 21:14:24 -07:00
yozhao101
be3c036794
[supervisord] Monitoring the critical processes with supervisord. (#6242)
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
2021-01-21 12:57:49 -08:00
vmittal-msft
a624aa01c7
Upgrade syncd to buster. (#6106)
- Why I did it
To upgrade brcm syncd to buster

- How I did it

Updated BCM SAI using kernel version 4.19.0-12 and debian 10 to support buster.
Updated syncd docker from stretch to buster in sonic-buildimage
- How to verify it

Ensured docker is running synd buster.
After upgrade, ensured all BGP peers and ip interfaces are up.
Ping to BGP neighbors is working fine.
2020-12-17 12:46:45 -08:00
Joe LeVeque
80bf8691e8
[Syncd] containers still based on Stretch must still use Python 2 (#6010)
Some syncd containers are still based on Debian Stretch, and thus do not have Python 3 available. For these containers, we must still rely on Python 2 to run supervisord_dependent_startup and supervisor-proc-exit-listener.
2020-11-23 22:35:58 -08:00
Joe LeVeque
7bf05f7f4f
[supervisor] Install vanilla package once again, install Python 3 version in Buster container (#5546)
**- Why I did it**

We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0.

**- How I did it**

- Remove Makefiles and patches for building supervisor package from source
- Install Python 3 supervisor package version 4.2.1 in Buster base container
    - Also install Python 3 version of supervisord-dependent-startup in Buster base container
- Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again.
- Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers
2020-11-19 23:41:32 -08:00
Joe LeVeque
5b3b4804ad
[dockers][supervisor] Increase event buffer size for dependent-startup (#5247)
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:

```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```

This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.

Resolves https://github.com/Azure/sonic-buildimage/issues/5241
2020-09-08 23:36:38 -07:00
Guohan Lu
bb40300cad [docker-syncd-brcm]: use service dependency in supervisord to start services 2020-05-22 11:01:28 -07:00
yozhao101
91e5fb5602
[Service] Enable/disable container auto-restart based on configuration. (#4073) 2020-02-07 12:34:07 -08:00
Joe LeVeque
85b0de3df1 [docker-syncd]: Restart SwSS, syncd and dependent services if a critical process in syncd container exits unexpectedly (#3534)
Add the same mechanism I developed for the SwSS service in #2845 to the syncd service. However, in order to cause the SwSS service to also exit and restart in this situation, I developed a docker-wait-any program which the SwSS service uses to wait for either the swss or syncd containers to exit.
2019-11-09 10:26:39 -08:00
Joe LeVeque
13e17d3a33 [docker-syncd-brcm] Add 'startsecs=0' to ledinit process (#2366) 2018-12-07 20:04:22 -08:00
Qi Luo
ff237aaf18
[syncd] Treat bcmcmd as a supervisor task so we could collect stdout/stderr (#1825) 2018-06-29 08:37:20 -07:00
Joe LeVeque
f49cac086f Remove extra trailing newlines at EOF (#804)
Files now end with a single newline
2017-07-12 20:54:37 -07:00
Joe LeVeque
a2eda30a03 [docker-syncd-*]: Properly manage syncd with supervisord (#617)
- This PR allows supervisord to log syncd exit events to syslog 
 - Syncd dockers now are built from docker-config-engine instead of docker-base
 - Supervisord in all syncd dockers now call syncd_start.s which is installed by sonic-sairedis repo
2017-05-24 11:53:38 -07:00
Joe LeVeque
d5c13c0a83 [dockers]: Disable autorestart on all supervisor processes inside containers (#580) 2017-05-09 17:37:08 -07:00
Joe LeVeque
8f348399f5 [Dockers]: Manage all Docker containers with Supervisord (#573)
- Consolidate config.sh and start.sh scripts into one script (start.sh)
 - Solve issue #435 - All dockers now run supervisord as their ENTRYPOINT
 - All stdout/stderr output from processes managed by supervisord is now sent to syslog instead of their own files
 - Supervisord log messages are now also sent to syslog
 - Removed unused smartmontools package from docker-platform-monitor
2017-05-08 15:43:31 -07:00