Commit Graph

26 Commits

Author SHA1 Message Date
Kebo Liu
9c4a7c2fed
[PMON] Skip chassis_db_init task on Mellanox simx platform (#9017)
Why I did it
"chassis_db_init" task of PMON should be skipped on Mellanox simx platform, since the hardware info which this task is trying to access is not available on simx platforms, It will introduce some error log.

How I did it
Add the capability for "chassis_db_init" in the template for it can be skipped by adding configuration in "pmon_daemon_control.json".
add "skip_chassis_db_init" configuration for simx platforms.
use symbol link for "pmon_daemon_control.json" since all the simx platforms share the same configuration
How to verify it
Build an image and install it on simx platform to check whether "chassis_db_init" task is skipped.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-10-24 09:10:41 -07:00
Kostiantyn Yarovyi
6530f93881 [Pcied] run by python 3
Why I did it
Pcied running by python 2.

How I did it
dropped python2 support and add python3 support for pcied in file docker-pmon.supervisord.conf.j2

How to verify it
docker exec pmon supervisorctl status
2021-08-23 03:30:12 +00:00
Sujin Kang
447f0c64da
[pmon]: Enable Autorestart of the daemons in PMON for unexpected exit cases (#8326)
Remove the daemon list from the critical_process which prevent the PMON
from restarting when the individual daemon crashes.
2021-08-04 09:57:54 -07:00
Alexander Allen
21b9fccd75
[dockers][platform-monitor] Add chassis_db_init to platform monitor tasks (#7596)
I added `chassis_db_init` to the startup tasks for the `docker-platform-monitor` docker so that the script is run on startup of the switch and the chassis info is correctly provisioned to STATE_DB.

Depends on https://github.com/Azure/sonic-platform-daemons/pull/183
2021-05-28 12:01:03 -07:00
Joe LeVeque
c651a9ade4
[dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7083)
To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:

```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```

This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.

This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).

I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-27 21:14:24 -07:00
Samuel Angebault
0464d15b18
[pmon]: Run ledd using python3 unless excluded (#6528)
**- Why I did it**

Ledd is the last daemon that is not enabled to run in python3.
Even though there is a plan to deprecate this daemon and to replace it by something else it's one simple step toward python2 deprecation.

**- How I did it**

Changed the `command=` line for `ledd` in the `supervisord` configuration of `pmon`.
Copied what was done for other daemons.

**- How to verify it**

Booting a product that has a `led_control.py` should now show the ledd running in python3.
I ran `python3 -m pylint` on all `led_control.py` plugin which means that most of them should be python3 compliant.
There is however still a risk that some might not work.
2021-01-22 07:12:01 -08:00
yozhao101
be3c036794
[supervisord] Monitoring the critical processes with supervisord. (#6242)
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
2021-01-21 12:57:49 -08:00
Junchao-Mellanox
51f896b33e
Add pmon daemons python3 build support (#6176)
**- Why I did it**

python2 is end of life and SONiC is going to support python3. This PR is going to support:

1. Build pmon daemons with python3
2. Install and run python3 version pmon daemons

**- How I did it**

1. Change pmon daemons make files to build bothe python2 and python3 whl
2. Change docker-platform-monitor make files to install both python2 and python3 whl
3. Change pmon docker startup files to start pmon daemons according to the supported platform API version
2020-12-28 10:19:24 -08:00
mprabhu-nokia
00cea080af
Chassisd to monitor cards in a modular chassis (#5523)
HLD: Azure/SONiC#646

Introducing chassisd process to monitor status of the control, line and fabric cards in a modular chassis.

- Why I did it
Modular Chassis has control-cards, line-cards and fabric-cards along with other peripherals. Chassisd will be a central entity that has visibility of the entire chassis.

- How I did it
Chassisd process will monitor cards in the main thread. Another configuation_handling_task is created to listen to CONFIG_DB for admin_status up/down events. The monitored status is persisted in REDIS-DB.
2020-12-15 16:28:58 -08:00
Junchao-Mellanox
68464381bc
Add a configuration to delay start xcvrd for fast-reboot (#5643) 2020-12-02 21:28:18 +02:00
Joe LeVeque
7bf05f7f4f
[supervisor] Install vanilla package once again, install Python 3 version in Buster container (#5546)
**- Why I did it**

We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0.

**- How I did it**

- Remove Makefiles and patches for building supervisor package from source
- Install Python 3 supervisor package version 4.2.1 in Buster base container
    - Also install Python 3 version of supervisord-dependent-startup in Buster base container
- Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again.
- Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers
2020-11-19 23:41:32 -08:00
Junchao-Mellanox
781188f549
[thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test (#5633)
Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.
2020-10-30 12:01:17 -07:00
Joe LeVeque
5b3b4804ad
[dockers][supervisor] Increase event buffer size for dependent-startup (#5247)
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:

```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```

This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.

Resolves https://github.com/Azure/sonic-buildimage/issues/5241
2020-09-08 23:36:38 -07:00
Joe LeVeque
6132ae34fe
[build] Build/install remaining platform daemons as Python wheel packages (#5188)
As part of migrating all Python-based package installers to wheel format rather than Debian packages. Also to allow for easily building a Python 3 version of the package in the near future. ledd and psud were converted in earlier PRs. This PR converts the remainder:

- pcied
- syseepromd
- thermalctld
- xcvrd
2020-08-15 08:42:11 -07:00
Joe LeVeque
c3202d8982
[build] Build/install sonic-psud as a Python wheel package (#5182)
As part of migrating all Python-based package installers to wheel format rather than Debian packages. Also to allow for easily building a Python 3 version of the package in the near future.
2020-08-14 11:11:45 -07:00
Joe LeVeque
fc9e97fc3d
[build] Build/install sonic-ledd as a Python wheel package (#5168)
As part of migrating all Python-based package installers to wheel format rather than Debian packages. Also to allow for easily building a Python 3 version of the package in the near future.

- Also remove some references to sonic-daemon-base which I previously missed and add missing sonic-py-common dependency for sonic-pcied.
2020-08-13 11:26:43 -07:00
Sujin Kang
02a98add92
Add pcied to PMON docker to monitor the PCIe device status (#5000)
* Add pcied to PMON container

* remove tailing spaces

* update pmon submodule

* review comments

* rebase to the latest
2020-07-29 11:27:49 -07:00
Guohan Lu
8da46d26c3 [docker-pmon]: use service dependency in supervisord to start services 2020-05-22 11:01:28 -07:00
Sujin Kang
cbc75fe4c8
[pmon]: Fix the continous syseepromd autorestart issue on 201911 (#4478)
- Remove syseepromd from the critical process of pmon docker
- Fix supervisor autorestart configuration of syseepromd
2020-04-30 15:51:34 -07:00
Kebo Liu
860cb265ac
[PMON] Extend pmon daemon start control to lm-sensors and fancontrol (#4447) 2020-04-21 08:00:48 -07:00
Sujin Kang
01f3f9286f
[fancontrol] Restart process upon unexpected exit, not entire pmon container (#4101)
* fancontrol restart

* Cleanup the default setting for exitcodes

* Remove the unnecessary stopwaitsecs default settin
2020-03-19 17:24:22 -07:00
Junchao-Mellanox
be549db395
Add thermal control support for SONiC (#3949) 2020-03-09 10:41:10 -07:00
yozhao101
91e5fb5602
[Service] Enable/disable container auto-restart based on configuration. (#4073) 2020-02-07 12:34:07 -08:00
yozhao101
4fa3a1e27e [Services] Restart Platform-monitor service upon unexpected critical process exit. (#3689)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2019-11-04 17:44:01 -08:00
Kebo Liu
8a08595006 [Pmon] Add new daemon "syseepromd" to pmon docker (#2866) 2019-06-18 11:02:24 -07:00
Kebo Liu
84b46bb0e0 [Pmon] dynamically load pmon daemons (#2654)
* dynamically load pmon daemons
2019-03-22 02:49:35 -07:00