- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.
Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.
- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:
The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.
If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.
- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.
Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.
- Which release branch to backport (provide reason below if selected)
201811
201911
[x ] 202006
**- Why I did it**
Align style with slightly modified PEP8 standards (extend maximum line length to 120 chars). This will also help in the transition to Python 3, where it is more strict about whitespace, plus it helps unify style among the SONiC codebase. Will tackle other directories in separate PRs.
**- How I did it**
Using `autopep8 --in-place --max-line-length 120` and some manual tweaks.
**- Why I did it**
We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0.
**- How I did it**
- Remove Makefiles and patches for building supervisor package from source
- Install Python 3 supervisor package version 4.2.1 in Buster base container
- Also install Python 3 version of supervisord-dependent-startup in Buster base container
- Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again.
- Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers
Jinja2 templates rendered using Python 3 interpreter, are required
to conform with Python 3 new semantics.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to two calls during startup when starting frr service.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
**- Why I did it**
Initially, the critical_processes file contains either the name of critical process or the name of group.
For example, the critical_processes file in the dhcp_relay container contains a single group name
`isc-dhcp-relay`. When testing the autorestart feature of each container, we need get all the critical
processes and test whether a container can be restarted correctly if one of its critical processes is
killed. However, it will be difficult to differentiate whether the names in the critical_processes file are
the critical processes or group names. At the same time, changing the syntax in this file will separate the individual process from the groups and also makes it clear to the user.
Right now the critical_processes file contains two different kind of entries. One is "program:xxx" which indicates a critical process. Another is "group:xxx" which indicates a group of critical processes
managed by supervisord using the name "xxx". At the same time, I also updated the logic to
parse the file critical_processes in supervisor-proc-event-listener script.
**- How to verify it**
We can first enable the autorestart feature of a specified container for example `dhcp_relay` by running the comman `sudo config container feature autorestart dhcp_relay enabled` on DUT. Then we can select a critical process from the command `docker top dhcp_relay` and use the command `sudo kill -SIGKILL <pid>` to kill that critical process. Final step is to check whether the container is restarted correctly or not.
* Rename asn/deployment_id_asn_map.yaml to constants/constants.yaml
* Fix bgp templates
* Add community for loopback when bgpd is isolated
* Use correct community value
There are some platforms with less powerful CPU/hard-drive could take
longer to get ready for BGP. For these platforms, 240 seconds would be
a safer threshold.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
There are two minor changes in this PR:
* Adjust quagga's jinja template to enable bgp-gr functionality by default. Currently is only applicable to those devices tagged as TOR/T0.
* Ensure that no bgp-notification is sent out to remote-peers during bgpd shutdown events. The goal here is to make sure that remote-peers kick off bgp-gr-helper logic (i.e. retain restarting-router state), which can be only achieved if an ungraceful-shutdown (tcp pipe/socket down) is perceived. There are other approaches to accomplish this goal, such as draft-ietf-idr-bgp-gr-notification, but this one hasn't been implemented yet by Quagga/FRR.
Signed-off-by: Rodny Molina <rmolina@linkedin.com>
* Restore neighbor table to kernel during system warm-reboot
Added a service: "restore_neighbors" to restore neighbor table into
kernel during system warm reboot. The service is started by supervisord
in swss docker when the docker is started.
In case system warm reboot is enabled, it will try to restore the neighbor
table from appDB into kernel through netlink API calls and update the neighbor
table by sending arp/ns requests to all neighbor entries, then it sets the
stateDB flag for neighsyncd to continue the reconciliation process.
-- Added tcpdump python-scapy debian package into orchagent and vs dockers.
-- Added python module: pyroute2 netifaces into orchagent and vc dockers.
-- Workarounded tcpdump issue in the vs docker
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
* Move the restore_neighbors.py to sonic-swss submodule
Made changes to makefiles accordingly
Make dockerfile.j2 changes and supervisord config changes
Add python monotonic lib for time access
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
* Added PYTHON_SWSSCOMMON as swss runtime dependency
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
Previously use / to separate container name and program name.
However, in rsyslogd:
Precisely, the programname is terminated by either (whichever occurs first):
end of tag
nonprintable character
‘:’
‘[‘
‘/’
The above definition has been taken from the FreeBSD syslogd sources.
Signed-off-by: Guohan Lu <gulv@microsoft.com>
* [docker-base] Instruct apt-get to NOT install 'recommended' or 'suggested' packages
* Modify docker-fpm-quagga, docker-snmp-sv2 and docker-sonic-vs Dockerfile templates in order to properly install .deb dependencies
* REDIS_SERVER depends on REDIS_TOOLS; ensure REDIS_TOOLS is always installed before REDIS_SERVER
Modify minigraph parser output format so it fit DB schema
Modify configuration templates to fit new schema
Systemd services dependencies are modified so database starts before any configuration consumer
* [cfggen] Support reading from and writing to configdb
* [bgp] Move bgp_admin_state to configdb, support dynamic admin state change
* [sonic-utilities] Adapt configDB for admin status, support config save and config load
* [bgp] Save admin state and set default state to shutdown
* Set default behavior to no shutdown
* Add build option SHUTDOWN_BGP_ON_START
* Script change for default admin state to be on
* Address CR comments to bgp_neighbor script
* Fix script bug