Commit Graph

53 Commits

Author SHA1 Message Date
Joe LeVeque
dd9be59cd1
[202012][dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7203)
#### Why I did it

Backport of https://github.com/Azure/sonic-buildimage/pull/7083 to the 202012 branch.

To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:

```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```

This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.

This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).

I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-04-01 12:52:19 -07:00
yozhao101
cc9c3f567e [supervisord] Monitoring the critical processes with supervisord. (#6242)
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
2021-01-28 09:28:27 -08:00
Joe LeVeque
905a5127bb
[Python] Align files in root dir, dockers/ and files/ with PEP8 standards (#6109)
**- Why I did it**

Align style with slightly modified PEP8 standards (extend maximum line length to 120 chars). This will also help in the transition to Python 3, where it is more strict about whitespace, plus it helps unify style among the SONiC codebase. Will tackle other directories in separate PRs.

**- How I did it**

Using `autopep8 --in-place --max-line-length 120` and some manual tweaks.
2020-12-03 15:57:50 -08:00
lguohan
4d3eb18ca7
[supervisord]: use abspath as supervisord entrypoint (#5995)
use abspath makes the entrypoint not affected by PATH env.

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2020-11-22 21:18:44 -08:00
Joe LeVeque
7bf05f7f4f
[supervisor] Install vanilla package once again, install Python 3 version in Buster container (#5546)
**- Why I did it**

We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0.

**- How I did it**

- Remove Makefiles and patches for building supervisor package from source
- Install Python 3 supervisor package version 4.2.1 in Buster base container
    - Also install Python 3 version of supervisord-dependent-startup in Buster base container
- Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again.
- Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers
2020-11-19 23:41:32 -08:00
Tamer Ahmed
6754635010 [cfggen] Make Jinja2 Template Python 3 Compatible
Jinja2 templates rendered using Python 3 interpreter, are required
to conform with Python 3 new semantics.

singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2020-09-30 07:07:43 -07:00
Tamer Ahmed
a10c5bfd02
[frr] Reduce Calls to SONiC Cfggen (#5176)
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to two calls during startup when starting frr service.

singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2020-08-17 15:47:42 -07:00
yozhao101
4fa81b4f8d
[dockers] Update critical_processes file syntax (#4831)
**- Why I did it**
Initially, the critical_processes file contains either the name of critical process or the name of group.
For example, the critical_processes file in the dhcp_relay container contains a single group name
`isc-dhcp-relay`. When testing the autorestart feature of each container, we need get all the critical
processes and test whether a  container can be restarted correctly if one of its critical processes is
killed. However, it will be difficult to differentiate whether the names in the critical_processes file are
the critical processes or group names. At the same time, changing the syntax in this file will separate the individual process from the groups and also makes it clear to the user.

Right now the critical_processes file contains two different kind of entries. One is "program:xxx" which indicates a critical process. Another is "group:xxx" which indicates a group of critical processes
managed by supervisord using the name "xxx". At the same time, I also updated the logic to
parse the file critical_processes in supervisor-proc-event-listener script.

**- How to verify it**
We can first enable the autorestart feature of a specified container for example `dhcp_relay` by running the comman `sudo config container feature autorestart dhcp_relay enabled` on DUT. Then we can select a critical process from the command `docker top dhcp_relay` and use the command `sudo kill -SIGKILL <pid>` to kill that critical process. Final step is to check whether the container is restarted correctly or not.
2020-06-25 21:18:21 -07:00
yozhao101
23ff55a709
[Services] Restart BGP service upon unexpected critical process exit. (#4207) 2020-03-03 16:50:32 -08:00
pavel-shirshov
d5af096f41
[TSA]: Add community to the loopback prefix, when isolated (#3708)
* Rename asn/deployment_id_asn_map.yaml to constants/constants.yaml

* Fix bgp templates

* Add community for loopback when bgpd is isolated

* Use correct community value
2019-11-06 16:07:28 -08:00
pavel-shirshov
7af546908f
[bgp]: Fix isolate/unisolate command for ipv6 peers (#3183)
* Fix isolate/unisolate command for ipv6 peers
2019-07-18 16:34:26 -07:00
Prince Sunny
231d309b69
Generate interface table to have an entry designated to default VRF. (#2848)
* Generate default VRF table for router interfaces

* Updated jinja2 template to have prefix filter
2019-06-10 14:02:55 -07:00
pavel-shirshov
602369126c [docker-fpm-quagga]: Add support for PeerAsn and UpdateAddress (#2766) 2019-04-10 21:50:36 -07:00
Ying Xie
af64fd66d2 [bgp quagga] increase BGP graceful restart timeout to 240 seconds (#2754)
There are some platforms with less powerful CPU/hard-drive could take
longer to get ready for BGP. For these platforms, 240 seconds would be
a safer threshold.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-04-10 19:13:03 -07:00
Ying Xie
d9c076dada [quagga bgp] set quagga graceful restart timeout to 180 seconds (#2362)
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2018-12-08 11:38:31 -08:00
Rodny Molina
196d9f5f8a [quagga]: Adjusting bgp jinja template and quagga's supervisord (#2291)
There are two minor changes in this PR:

* Adjust quagga's jinja template to enable bgp-gr functionality by default. Currently is only applicable to those devices tagged as TOR/T0.

* Ensure that no bgp-notification is sent out to remote-peers during bgpd shutdown events. The goal here is to make sure that remote-peers kick off bgp-gr-helper logic (i.e. retain restarting-router state), which can be only achieved if an ungraceful-shutdown (tcp pipe/socket down) is perceived. There are other approaches to accomplish this goal, such as draft-ietf-idr-bgp-gr-notification, but this one hasn't been implemented yet by Quagga/FRR.

Signed-off-by: Rodny Molina <rmolina@linkedin.com>
2018-11-27 00:39:38 -08:00
zhenggen-xu
51a76614a3 Restore neighbor table to kernel during system warm-reboot (#2213)
* Restore neighbor table to kernel during system warm-reboot

Added a service: "restore_neighbors" to restore neighbor table into
kernel during system warm reboot. The service is started by supervisord
in swss docker when the docker is started.

In case system warm reboot is enabled, it will try to restore the neighbor
table from appDB into kernel through netlink API calls and update the neighbor
table by sending arp/ns requests to all neighbor entries, then it sets the
stateDB flag for neighsyncd to continue the reconciliation process.

-- Added tcpdump python-scapy debian package into orchagent and vs dockers.
-- Added python module: pyroute2 netifaces into orchagent and vc dockers.
-- Workarounded tcpdump issue in the vs docker

Signed-off-by: Zhenggen Xu <zxu@linkedin.com>

* Move the restore_neighbors.py to sonic-swss submodule
Made changes to makefiles accordingly

Make dockerfile.j2 changes and supervisord config changes

Add python monotonic lib for time access

Signed-off-by: Zhenggen Xu <zxu@linkedin.com>

* Added PYTHON_SWSSCOMMON as swss runtime dependency

Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
2018-11-09 17:06:09 -08:00
Taoyu Li
6a37365d93 [zebra.conf] Avoid zebra crash upon empty configuration (#2203) 2018-10-28 22:32:31 -07:00
lguohan
f3ca7c422f
[rsyslog]: use # to separate container name and program name in syslog message (#1918)
Previously use / to separate container name and program name.

However, in rsyslogd:

Precisely, the programname is terminated by either (whichever occurs first):

end of tag
nonprintable character
‘:’
‘[‘
‘/’
The above definition has been taken from the FreeBSD syslogd sources.

Signed-off-by: Guohan Lu <gulv@microsoft.com>
2018-08-12 22:23:58 -07:00
Qi Luo
7ba08e5bf6
Prefix docker container name to syslog syslogtag (program name) (#1810) 2018-06-25 10:48:42 -07:00
pavel-shirshov
a2a6aead4c [bgp]: Enable bgp soft-reconfiguration inbound for quagga templates (#1803)
* Enable bgp soft-reconfiguration inbound for quagga templates
2018-06-22 18:04:18 -07:00
pavel-shirshov
bbca58329b
Manually send SIGHUP to vtysh when the current session was disconnected (#1801)
* Manually send SIGHUP to vtysh when the current session was disconnected

* Address comments
2018-06-20 12:15:09 -07:00
Qi Luo
1c8bacb007
Fix comment typos (#1794)
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
2018-06-14 21:53:31 -07:00
pavel-shirshov
fae346f586
Don't create a pty to run vtysh inside of the docker container (#1792) 2018-06-14 12:11:29 -07:00
Joe LeVeque
832be7b8f4
[dockers] Prevent apt-get from installing suggested and recommended packages by default (#1666)
* [docker-base] Instruct apt-get to NOT install 'recommended' or 'suggested' packages

* Modify docker-fpm-quagga, docker-snmp-sv2 and docker-sonic-vs Dockerfile templates in order to properly install .deb dependencies

* REDIS_SERVER depends on REDIS_TOOLS; ensure REDIS_TOOLS is always installed before REDIS_SERVER
2018-05-02 11:46:21 -07:00
Taoyu Li
bebb7a0fa2 [zebra.conf] Fix template issue with multiple lo addresses (#1662)
* [zebra.conf] Fix template issue with multiple lo addresses

* Add unitest for Loopback1
2018-05-01 20:53:47 -07:00
pavel-shirshov
1ae4db3af7
Quagga: Use bgp keepalive and holdtime timers from configdb (#1661) 2018-04-30 16:52:22 -07:00
pavel-shirshov
f43580492f
quagga container processes could be restarted within a second (#1541) 2018-03-28 12:34:46 -07:00
Joe LeVeque
e1cb2ace36 [base image files] All 'docker exec' wrapper scripts now dynamically adjust their flags depending on whether or not they are run on a terminal (#1507) 2018-03-17 00:43:29 -07:00
Taoyu Li
dd7e9240c8 [dockers] Remove dependency to minigraph (#1179)
* Remove dependency to minigraph

* Remove -m in swssconfig.sh
2017-11-23 16:31:37 -08:00
nikos-li
f18ed0d35c [bgp]: Auto-completion, help (?), cmd navigation (up arrow) not working in vtysh on host system. (#1124) 2017-11-13 09:39:10 -08:00
Taoyu Li
8bc6b55331 [bgpd.conf] Fix template issue with multiple lo addresses (#1060) 2017-10-20 07:15:11 -07:00
Taoyu Li
7a0a2ea5d0 [bgpd.conf] Advertise /64 prefix for ipv6 lo addresses (#1050) 2017-10-17 18:28:27 -07:00
pavel-shirshov
9139c7fe64 Always start with Forwarding State flag set for bgpd (#963) 2017-09-19 12:27:18 -07:00
Shuotian Cheng
aa549f208c [bgp]: Fix the deployment_id with DEVICE_METADATA (#962) 2017-09-18 13:04:29 -07:00
Taoyu Li
2e3975d6ed [config] Fix an issue that bgp asn data type is not consistent (#953)
* Fix an issue that bgp asn data type is not consistent from minigraph parser and DB

* Fix test typo
2017-09-13 21:23:06 -07:00
Taoyu Li
c9cc7aea41 [configdb] Migrate minigraph configurations to DB (#942)
Modify minigraph parser output format so it fit DB schema
Modify configuration templates to fit new schema
Systemd services dependencies are modified so database starts before any configuration consumer
2017-09-12 14:13:27 -07:00
sihuihan88
127a73aac3 [quagga]: Disable ipv4 over ipv6 and enable ipv6 over ipv4 peer group (#922)
* [bgpd]:disable ipv4 over ipv6 and enable ipv6 over ipv4 peer group

* update as comments
2017-08-30 13:06:02 -07:00
Taoyu Li
e4502527d0 Revert "Migrate DEVICE_METADATA to db (#919)" (#928)
This reverts commit 44502b217b.
2017-08-29 17:03:31 -07:00
Taoyu Li
44502b217b Migrate DEVICE_METADATA to db (#919) 2017-08-29 10:47:25 -07:00
Joe LeVeque
ed66588473 [docker-fpm-quagga]: Manage Quagga processes (zebra, bgpd) using supervisor instead of watchquagga (#900) 2017-08-21 13:55:59 -07:00
zhenggen-xu
c52e876697 Fix the network command for ipv6 vlan interfaces (#894) 2017-08-16 21:12:32 -07:00
Taoyu Li
a2fe0212be [ConfigDB] Move all BGP configuration into DB (#861)
- BGP data read from minigraph.py now match DB schema
- BGP templates are updated
- bgpcfgd can now deal with runtime neighbor create/delete
2017-08-08 16:23:58 -07:00
Taoyu Li
b6efe438b5 Introduce ConfigDB (#808)
* [cfggen] Support reading from and writing to configdb
* [bgp] Move bgp_admin_state to configdb, support dynamic admin state change
* [sonic-utilities] Adapt configDB for admin status, support config save and config load
2017-08-01 19:02:00 -07:00
sihuihan88
1176508858 [bgpd]: support multiple peer range in single peer group (#807) 2017-07-13 15:03:10 -07:00
Joe LeVeque
f49cac086f Remove extra trailing newlines at EOF (#804)
Files now end with a single newline
2017-07-12 20:54:37 -07:00
lguohan
4bdcac8e4f [bgp]: move allowas-in into ipv6 section to enable allowas-in for ipv6 (#741) 2017-06-22 19:50:24 -07:00
sihuihan88
3268946de5 [BGPD]: add bgp dynamic neighbor configuration (#708)
* add bgp dynamic neighbor configuration

* [bgpd]: update as comments

* update as comment

* update to deployment_id_asn_map

* minor change
2017-06-21 18:52:50 -07:00
Taoyu Li
95906a6490 [installer] Copy old config files rather than only minigraph (#730) 2017-06-21 11:02:25 -07:00
Taoyu Li
5e6620e19e [bgp] Save bgp admin state (#690)
* [bgp] Save admin state and set default state to shutdown

* Set default behavior to no shutdown

* Add build option SHUTDOWN_BGP_ON_START

* Script change for default admin state to be on

* Address CR comments to bgp_neighbor script

* Fix script bug
2017-06-12 11:05:22 -07:00