- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.
Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.
- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:
The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.
If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.
- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.
Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.
- Which release branch to backport (provide reason below if selected)
201811
201911
[x ] 202006
Some syncd containers are still based on Debian Stretch, and thus do not have Python 3 available. For these containers, we must still rely on Python 2 to run supervisord_dependent_startup and supervisor-proc-exit-listener.
**- Why I did it**
We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0.
**- How I did it**
- Remove Makefiles and patches for building supervisor package from source
- Install Python 3 supervisor package version 4.2.1 in Buster base container
- Also install Python 3 version of supervisord-dependent-startup in Buster base container
- Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again.
- Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:
```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```
This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.
Resolves https://github.com/Azure/sonic-buildimage/issues/5241
* Initial commit
* Add Ingrasys S9180-32X platform dirver.
Signed-off-by: Wade He <chihen.he@gmail.com>
* Add bfn.service for init barefoot.
Signed-off-by: Wade He <chihen.he@gmail.com>
* [Barefoot Beta] Add some functions and fixed some bugs.
1. Update sensors.conf.
2. Fixed IO expander init.
3. Fixed PSU EEPROM.
4. Fixed MB EEPROM.
5. Add fancontrol and fan init.
6. Add SYS LED control (sys, fan, fan tray).
7. 2.5V compute and setup max and min.
8. Fixed typo MB eeprom delete address.
9. Remove coretemp to BMC.
10. Add active CPLD.
11. Modify SFP+ GPIO slave address.
12. Modify tmp75 Near Port 32 slave address.
Signed-off-by: Wade He <chihen.he@gmail.com>
* Add bfn script in /etc/init.d/
Signed-off-by: Wade He <chihen.he@gmail.com>
* Add bfn service in debian
Signed-off-by: Wade He <chihen.he@gmail.com>
* Fixed CPLD switch LED behavior.
Signed-off-by: Wade He <chihen.he@gmail.com>
* [Barefoot Beta] Fixed sensors and hwmon order.
1. Fixed ignore sensors Vbat.
2. Reorg hwmon order.
Signed-off-by: Wade He <chihen.he@gmail.com>
* Fixed PSU1 and PSU2 EEPROM order.
Signed-off-by: Wade He <chihen.he@gmail.com>
* initial barefoot checkin october 2017
* update refpoint
* update refpoints
* update refpoints to bf-master
* update refpoint
* update refpoint to tested version
* change to platform from asic
* update refpoint for swss
* revert core creation setting
* update refpoints
* add telnet for debug shell
* update refpoints 11/17/17
* missed change in file on previous merge
* [CPLD] Fixed blink LED issue.
* Fixed blink LED mask set error.
Signed-off-by: Wade He <chihen.he@gmail.com>
* Update bf_kdrv.c for 6.0.2.39
* Update bf kernel driver
* Add bf_fun kernel module.
* Update bf_tun for fixed build error
* merge with Azure master (12/12/17)
* update swss refpoint
* update refpoint of swss
* library dependency for stack unroll
* update refpoint to bf-master
* [DHCP relay]: Fix circuit ID and remote ID bugs (#1248)
* [DHCP relay]: Fix circuit ID and remote ID bugs
* Set circuit_id_len after setting circuit_id_len to ip->name
* [Platform] Add Psuutil and update sensors.conf for S9100-32X, S8810-32Q and S9200-64X (#1272)
* Add I2C CPLD kernel module for psuutil.
* Support psuutil script.
* Add voltage min and max threshold.
* Update sensors.conf for tmp75.
Signed-off-by: Wade He <chihen.he@gmail.com>
* Allow multi platform support - infra (more changes to follow)
* update relative path to include platform for clarity
* [Platform] Add Ingrasys S9130-32X and S9230-64X with Nephos Switch ASIC for "branch 201712" (#1274)
- What I did
Add switch ASIC vendor: Nephos
Add Nephos platforms: Ingrasys S9130-32X, Ingrasys S9230-64X
- How I did it
Add platform/nephos files
Add platform/nephos/sonic-platform-modules-ingrasys submodule
Add device/ingrasys/x86_64-ingrasys_s9130_32x-r0 files
Add device/ingrasys/x86_64-ingrasys_s9230_64x-r0 files
Add SONiC to support Nephos platform
Update Head of submodule src/sonic-sairedis to "3b817bb"
- How to verify it
To build SONiC installer image and docker images, run the following commands:
make configure PLATFORM=nephos
make target/sonic-nephos.bin
Check system and network feature is worked as well
- Description for the changelog
Add switch ASIC vendor and platforms for Nephos
- A picture of a cute animal (not mandatory but encouraged)
Signed-off-by: Sam Yang <yang.kaiyu@gmail.com>
* change source of files to github (from dropbox), update sairedis refpoint
* update refpoint of sairedis
* [centec] support CENTEC SAI 1.0 on 201712 branch and update e582-48x6q board (#1269)
* [marvel]: Marvell's updates for SONiC.201712 & SAI v1.0 (#1287)
* update sairedis (fast-boot refpoint)
* fix syncd rpc make files
* update refpoint to handle Makefile change (no functional change)
* [Marvell]: Add support for SLM5401-54x device (#1307)
* Marvell's updates for SONiC.201712 & SAI v1.0
* [Platform] Add Marvell's SLM5401-54x for branch 201712
* [Broadcom]: Update Boradcom SAI package to 3.0.3.3-3 (#1312) (#1321)
- update Arista 7050-QX32S config.bcm file
- update Accton th-as771*-32x100G.config.bcm files
* update refpoint for Makefile chnage in sairedis
* update refpoint - sairedis
* update sairedis to older refpoint till we debug clean build
* export asic platform for build
* update refpoint for makefiles
* [PLATFORM] Centec update E582 driver fan/epprom/sensor (#1332)
* Upload wnc-osw1800
* Modify for Barefoot suggest
* Revert bfn-platform.mk
* Update bfn-platform-wnc.mk
Update parameter name
* Update parameter name
* initial support for WNC platform
* change switch name to "switch"
* Delete bf modules for rel_7_0
* Add Ingrasys S9180 platform
Signed-off-by: Wade He <chihen.he@gmail.com>
* Modify bfnsdk for Ingrasys S9180 platform
Signed-off-by: Wade He <chihen.he@gmail.com>
* Resolved the conflict.
* Resolved the conflict.
* Update submodule path and url.
* Delete unused file.
* Update PSU GPIO and EEPROM for psuutil.
* Add psuutil in S9180-32X
Signed-off-by: Wade He <chihen.he@gmail.com>
* update refpoint
* update refpoint
* change contact email, update refpoint
* cleanup and update kernel modules
* updates based on review
* update refpoint
* update refpoint
* fix typo in config script to check for platforms
* remove stale file
* resolve conflicts
* cleanup diffs with Azure repo and update SDK debs
* update refpoints to Azure
* address review comments
* revert refpoint of swss-common
* porting the build fix from master
* porting build fix from master
* Minor Fix
* Minor fix
* Temp to sde deb packages url
* Update sonic - sairedis,swss & swss-common refpoints
* Update git modules url path to bfn repo
* updated paths for swss, swss-common & sairedis
* Update refpoint for sonic-swss to local bfn repo
* Update URL for downloading sde debian packages
* porting fix links of debian git server from master
* porting fix links of debian git server from master
* [Ingrasys] Add platform support for S9280-64X with Barefoot ASIC
* Update ref points for swss, swss-common and sairedis repos
* Add sonic platform scripts for bfn montara/maverick
* Call sh scripts instead of calling py scripts
* Address upstream PR Comments (#10)
* Update bf-master with azure/master
* Undo changes to some files
* Revert "Address upstream PR Comments (#10)"
This reverts commit a7fddb83ca.
* Address upstream comments (#11)
* Remove all non bfn specific changes from upstream PR
* Revert "Address upstream comments (#11)"
This reverts commit 559132103e.
* Undo non bfn changes
* Little more cleanup
* Add back code removed in merge
* export CONFIGURED_PLATFORM
* Update sairedis and swss refpoints
* Address Upstream PR comment
* change deb pkg dependency from 3.16.0-4-amd64 to 3.16.0-5-amd64
* Set default tx queue len for usb0 interface to 64
* Update sairedis refpoint
* Update swss ref point
* Add bfn buffer cfg files for montara/maverick as per new design
* Update buffer cfg templates for bfn montara
* add non zero size to buffer profile
* add macro to generate port lists
* Update buffer cfg templates for bfn mavericks
* add non zero size for buffer profiles
* add port generation macro
* Add missing psmisc package
* BGP docker seems to be missing killall utility being used by fast-reboot script. This is causing non graceful termination of BGP sessions.
Adding psmisc to resolve this issue.
* Update swss ref point
* Update swss ref point
* Update sairedis refpoint
* Update sairedis refpoint
* Update sairedis refpoint
* Update sairedis refpoint
* Update refpoint for sairedis and swss
* sairedis to azure master
* swss to latest bfn bf-master
* Update gitmodules
Update url for sairedis to azure master
* Correct typo in bfn platform script
* Update swss and sairedis ref points
* Update swss ref point
* Address Review comments
* Update swws path in gitmodules to azure master
* update swss refpoint
* update base docker j2 file -remove psmisc package (could be a concern, would cause fast reboot to not work correctly will fix in another PR)
* Fix sairedis refpoint broken in by previous merge
* Remove psmisc from docker base image
* This will break fast reboot as killall is required for killing bgp process and initiating graceful termination of BGP session.
Will fix this in a seperate PR. Need this for SONIC upstreaming
* Address upstream comments
* Remove bmc interface from interface jinja template and sample output interfaces file
* Add bmc interface at boot time to network interfaces for bfn bmc based platforms
* Remove autogen ingrasys debian files
* Revert "Remove autogen ingrasys debian files"
* Buffer and qos config template fix for bfn platforms (#21)
SWI-1509 Buffer and qos config template fix for bfn platforms
* Fix qos config files for montara & mavericks (#22)
* Reference only ppg 3,4 in qos files as no profiles are attached to 0,1 in buffer configs
* Fix vs test (#23)