Commit Graph

58 Commits

Author SHA1 Message Date
yozhao101
a8d2d0b5cd
[201911][Monit] Monitor critical processes in PMon contianer. (#7438)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to monitor the critical processes in PMon container by Monit in 201911 branch.

How I did it
I created a template configuration file of Monit and it will be rendered to generate Monit configuration file of PMon container
by a service generate_monit_config.service.

How to verify it
I verified this on a Mellanox device str-msn2700-03 and an Arista device str-a7050-acs-1.

Which release branch to backport (provide reason below if selected)
 201811
[x ] 201911
 202006
 202012
2021-04-28 17:12:21 -07:00
Joe LeVeque
72b32a96fc
[201911][dockers][supervisor] Increase event buffer size for process exit listener (#7106)
Backport of https://github.com/Azure/sonic-buildimage/pull/7083 to the 201911 branch.

#### Why I did it

To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:

```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```

This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.

This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).

I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-29 10:07:43 -07:00
Junchao-Mellanox
547ec0a905 Add a configuration to delay start xcvrd for fast-reboot (#5643) 2020-12-22 09:51:54 -08:00
Junchao-Mellanox
1070d024bc [thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test (#5633)
Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.
2020-11-03 08:19:19 -08:00
Joe LeVeque
b70c6f72b2 [dockers][supervisor] Increase event buffer size for dependent-startup (#5247)
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:

```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```

This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.

Resolves https://github.com/Azure/sonic-buildimage/issues/5241
2020-09-28 16:12:53 +00:00
Joe LeVeque
802e77c3f1 [docker-pmon] Fix copy of fancontrol config file (#5037)
Copy proper fancontrol config file to the proper destination. Also some minor refactoring for code reuse to help prevent issues like this in the future.

Fixes a bug introduced by #4599
2020-08-15 22:35:02 -07:00
Guohan Lu
763673993e [docker-pmon]: use service dependency in supervisord to start services 2020-08-15 22:23:50 -07:00
yozhao101
c2364cf03e
[201911][dockers] Update critical_processes file syntax (#4854)
Backport of https://github.com/Azure/sonic-buildimage/pull/4831 to the 201911 branch
2020-06-26 11:37:05 -07:00
Junchao-Mellanox
0a70571011
[201911][thermal control] Backport feature from master branch (#4677)
Backport thermal control feature from master branch to 201911 branch by cherry-picking commits and manually resolving conflicts.
2020-06-08 11:20:43 -07:00
Nazarii Hnydyn
c266435d40
Revert "Add thermal control support for SONiC (#3949)" (#4527)
This reverts commit 109a13cc03.

Conflicts:
	dockers/docker-platform-monitor/docker-pmon.supervisord.conf.j2
2020-05-04 21:20:47 +03:00
Sujin Kang
9cbc07996e [pmon]: Fix the continous syseepromd autorestart issue on 201911 (#4478)
- Remove syseepromd from the critical process of pmon docker
- Fix supervisor autorestart configuration of syseepromd
2020-04-30 22:40:46 -07:00
Junchao-Mellanox
109a13cc03 Add thermal control support for SONiC (#3949) 2020-04-30 22:39:17 -07:00
Kebo Liu
e12d2e8bee
[PMON] Extend pmon daemon start control to lm-sensors and fancontrol for 201911 (#4487)
Extend the PMON daemon start control to lm-sensors and fancontrol.

change template docker-pmon.supervisord.conf.j2 and start.sh.j2 to have lm-sensors and fancontrol start scripts and supervisord config file controlled by pmon_daemon_control.json.

the intention is to avoid wrong daemon status in "supervisorctl status" output. For example, on some platform, if there is no fancontrol config file, and it is not ruled out from supervisord conf file and start.sh, we'll see fancontrol in "STOPPED" status from "supervisorctl status" output, which will violate some check in the platform test(check daemon status as expected)
2020-04-28 18:39:10 -07:00
Sujin Kang
c34dcbec4c [fancontrol] Restart process upon unexpected exit, not entire pmon container (#4101)
* fancontrol restart

* Cleanup the default setting for exitcodes

* Remove the unnecessary stopwaitsecs default settin
2020-04-27 06:23:19 +00:00
Kebo Liu
e4bd7ab189 [Mellanox] Extend mellanox platform API to report SFP error event (#4365)
* extend mellanox platform API to report SFP error event
* remove unnecessary loop code
* install enum34 to pmon to support using Enum
2020-04-15 13:11:59 -07:00
yozhao101
71225ea4cc [Service] Enable/disable container auto-restart based on configuration. (#4073) 2020-02-13 16:20:21 -08:00
yozhao101
4fa3a1e27e [Services] Restart Platform-monitor service upon unexpected critical process exit. (#3689)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2019-11-04 17:44:01 -08:00
Andriy Moroz
976850fc00 [submodule update] Add SSD Health tools (#3218)
Signed-off-by: Andriy Moroz <c_andriym@mellanox.com>
2019-10-04 10:52:58 -07:00
sridhar-ravindran
56608bf06b [devices]: DELL Platform 2.0 API Infra and Reboot Reason support in Z9100 & S6100 (#3063) 2019-07-03 06:52:35 -07:00
Stepan Blyshchak
81cf33231f [build]: Improve dockerfile instructions (#3048)
- create a dockerfile-marcros.j2 file with all common operations
  written as j2 macro
- use single dockerfile instruction for COPY and RUN commands
  when possible to improve build time
- reorganize dockerfile instructions to make more cache friendly
  (in case someday we will remove --no-cache to build docker images)

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
2019-06-22 11:26:23 -07:00
Kebo Liu
8a08595006 [Pmon] Add new daemon "syseepromd" to pmon docker (#2866) 2019-06-18 11:02:24 -07:00
Stephen Sun
95452b7385 [docker-pmon] install dmidecode tool to pmon (#2990) 2019-06-12 12:10:43 +03:00
Kebo Liu
f5d3ee71a2 [pmon]: Add ethtool to pmon docker (#2943) 2019-05-25 17:59:56 -07:00
Samuel Angebault
77cde50541 [device/Arista] Improvements to the boot of Arista devices. (#2898)
* Fix showing systemd shutdown sequence when verbose is set

* Fix creation of kernel-cmdline file

Sometimes boot0 prints error
"mv: can't preserve ownership of '/mnt/flash/image-arsonic.xxxx/kernel-cmdline': Operation not permitted"

* Improve flash space usage during installation

Some older systems only have 2GB of flash available. Installing a second
image on these can prove to be challenging.
The new installation process moves the installer swi to memory in order
to avoid free up space from the flash before uncompressing it there.
It removes all the flash space usage spike and also improves the IO
since the installation is no more reading and writting to the flash at
the same time.

* Add support of 7060CX-32S-SSD

* 7260CX3: use inventory powerCycle procedures

* 7050QX-32S: use inventory powerCycle procedures

* 7050QX-32: use inventory powerCycle procedures

* platform: arista: add common platform_reboot

Replace platform_reboot by a link to new common for devices already
using a similar script.

* 7060CX-32S: use inventory powerCycle procedures

* Install python smbus in pmon

Some platform plugin need the python smbus library to perform some actions.
This installs the dependency.
2019-05-15 12:45:05 -07:00
Qi Luo
6b3a26f0cc
Remove unused packages in docker images and host (#2807)
* Remove unneeded packages in docker images and host
* Remove libpython3.6 from snmp docker image
2019-04-29 17:21:24 -07:00
Wirut Getbamrung
27803ec603 [docker-platform-monitor]: Add smartmontools 6.6-1 (#2703) 2019-04-10 21:55:54 -07:00
Mykola F
3826ffd30f [pmon] move platform monitor docker to stretch (#2680)
Signed-off-by: Mykola Faryma <mykolaf@mellanox.com>
2019-03-22 16:42:56 -07:00
Kebo Liu
84b46bb0e0 [Pmon] dynamically load pmon daemons (#2654)
* dynamically load pmon daemons
2019-03-22 02:49:35 -07:00
Nazarii Hnydyn
b22fe37670 [mellanox]: Upgraded hw-management V.2.0.0160. (#2643)
Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>
2019-03-06 18:51:46 -08:00
lguohan
f20665008c
[build]: put stretch debian packages under target/debs/stretch/ (#2519)
* [build]: put stretch debian packages under target/debs/stretch/

* in stretch build phase, all debian packages built in that stage are placed under target/debs/stretch directory.
* for python-based debian packages, since they are really the same for jessie and stretch, they are placed under target/python-debs directory.

Signed-off-by: Guohan Lu <gulv@microsoft.com>
2019-02-04 22:06:37 -08:00
Kevin(Shengkai) Wang
b3abf9af7f [docker-platform-monitor] add psud daemon to Pmon (#2423)
* Add psud daemon to pmon container
* Update submodule sonic-platform-daemons

Submodule update sonic-platform-daemons:

e5d8155 - [sonic-psud] add a new daemon sonic-psud to platform monitor (#20)

Signed-off-by: Kevin Wang <kevinw@mellanox.com>
2019-01-15 21:24:47 -08:00
lguohan
f3ca7c422f
[rsyslog]: use # to separate container name and program name in syslog message (#1918)
Previously use / to separate container name and program name.

However, in rsyslogd:

Precisely, the programname is terminated by either (whichever occurs first):

end of tag
nonprintable character
‘:’
‘[‘
‘/’
The above definition has been taken from the FreeBSD syslogd sources.

Signed-off-by: Guohan Lu <gulv@microsoft.com>
2018-08-12 22:23:58 -07:00
paavaanan
ecfca8bf23 [devices]: DellEMC new platform support for z9264f - 64x100 (#26)
* Added new platform support DellEMC - Z92264f - 64x100

* Includes changes with Makefiles, sfputil, eeprom and default minigraph

* Led support for Z9264f platform

* Includes changes on default minigraph

* ipmitool implementation in pmon docker. platform_sensors script is inclued in pmon startup
2018-08-11 09:09:03 +00:00
Kebo Liu
38beca654c [docker-platform-monitor] make file and supervisord conf change for new xcvrd deamon (#1840)
* [docker-platform-monitor] make file and supervisord conf change for new xcvrd deamon

* make file change for the new daemon
* supervisord conf change for the new daemon

signed-off-by Liu Kebo kebol@mellanox.com

* make xcvrd start sequence aligned with the supervisord conf

* update submodules to include xcvrd modification
2018-08-03 16:33:56 -07:00
Qi Luo
7ba08e5bf6
Prefix docker container name to syslog syslogtag (program name) (#1810) 2018-06-25 10:48:42 -07:00
Joe LeVeque
1102acec48 [ledd] Exit with code 0 if we fail to find a platform-specific led_control module; no autorestart (#1688) 2018-05-10 01:20:22 -07:00
Joe LeVeque
1df7c9a993
[docker-platform-monitor] Convert ledd from polling-based to subscription-based model (#1623) 2018-04-20 10:42:19 -07:00
Joe LeVeque
e1cb2ace36 [base image files] All 'docker exec' wrapper scripts now dynamically adjust their flags depending on whether or not they are run on a terminal (#1507) 2018-03-17 00:43:29 -07:00
Joe LeVeque
ab26a5c589
Install sonic-platform-common package in platform-monitor docker for ledd (#1330)
* Install sonic-platform-common package in platform-monitor docker for ledd

* Specify Python wheel dependencies in docker-platform-monitor.mk; Remove explicit specifications from Dockerfile.j2
2018-01-22 10:52:52 -08:00
Joe LeVeque
def0f2e4de [sensors]: Workaround for apparent bug in lm-sensors (#1058) 2017-10-20 11:01:26 -07:00
Joe LeVeque
bbf1d6624b [docker-platform-monitor]: Remove stale fancontrol.pid file (if exists) before starting fancontrol (#1002) 2017-09-30 10:55:03 -07:00
Joe LeVeque
f938f3ecaf [docker-platform-monitor]: Prevent supervisor from logging unexpected exits from processes known to exit in < 1 second (#889) 2017-08-15 10:38:22 -07:00
Joe LeVeque
f49cac086f Remove extra trailing newlines at EOF (#804)
Files now end with a single newline
2017-07-12 20:54:37 -07:00
Joe LeVeque
22819d9983 [docker-platform-monitor]: Add fancontrol (#735) 2017-06-23 15:23:00 -07:00
Joe LeVeque
d094ceecc2 [docker-platform-monitor]: Add LED control daemon and plugin for x86_64-arista_7050_qx32 platform (#691)
* Add files for building ledd package; add ledd to docker-platform-monitor; Control platform monitor docker using supervisord

* Add sonic-platform-daemons submodule

* Rename ledd.mk -> sonic-ledd.mk

* Add led_control.py plugin for x86_64-arista_7050_qx32 platform

* Rename Dockerfile -> Dockerfile.j2

* Fix build

* Remove blank line
2017-06-10 22:05:11 -07:00
Joe LeVeque
d5c13c0a83 [dockers]: Disable autorestart on all supervisor processes inside containers (#580) 2017-05-09 17:37:08 -07:00
Joe LeVeque
8f348399f5 [Dockers]: Manage all Docker containers with Supervisord (#573)
- Consolidate config.sh and start.sh scripts into one script (start.sh)
 - Solve issue #435 - All dockers now run supervisord as their ENTRYPOINT
 - All stdout/stderr output from processes managed by supervisord is now sent to syslog instead of their own files
 - Supervisord log messages are now also sent to syslog
 - Removed unused smartmontools package from docker-platform-monitor
2017-05-08 15:43:31 -07:00
pavel-shirshov
814fd87e63 Remove /var/run/rsyslogd.pid bofore starting rsyslog (#453) 2017-03-29 18:07:25 -07:00
Taoyu Li
f08874db36 [platform-monitor]: Fix sensors.conf file path (#426)
sensors.conf file was moved in #316.
2017-03-22 16:59:12 -07:00
pavel-shirshov
a845740543 [All Dockerfiles]: Prevent apt asking questions on the console (#300)
Add noninteractive setting into every Dockerfile in the repo

Signed-off-by: Pavel Shirshov pavelsh@microsoft.com
2017-02-16 21:48:49 -08:00