- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.
Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.
- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:
The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.
If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.
- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.
Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.
- Which release branch to backport (provide reason below if selected)
201811
201911
[x ] 202006
Mellanox already supports multiple destination IPs in IPinIP tunnel configuration, thus removing mellanox
exception for IPinIP configuration.
- How I did it
Removed "dst_ip" field generation in mellanox platform condition.
Sorted the "dst_ip" list, so that it is easier to test against sample configuration in unit tests.
Aligned unit test sample.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Introduce tunnel manager daemon. Start the process as part of swss container
Submodule update for swss:
9ed3026 - 2020-12-24 : [NAT] ACL Rule with DO_NOT_NAT action is getting failed. (#1502) [Akhilesh Samineni]
c39a4b1 - 2020-12-23 : Mux/IPTunnel orchagent changes (#1497) [Prince Sunny]
bc8df0e - 2020-12-23 : Add support for headroom pool watermark (#1567) [Neetha John]
**- Why I did it**
As part of migrating SONiC codebase from Python 2 to Python 3
**- How I did it**
- No longer install Python 2 in docker-base-buster or docker-config-engine-buster.
- Install Python 2 and pip2 in the following containers until we can completely eliminate it there:
- docker-platform-monitor
- docker-sonic-mgmt-framework
- docker-sonic-vs
- Pin pip2 version <21 where it is still temporarily needed, as pip version 21 will drop support for Python 2
- Also preform some other cleanup, ensuring that pip3, setuptools and wheel packages are installed in docker-base-buster, and then removing any attempts to re-install them in derived containers
Install the necessary python3 dependent packages to convert restore_neighbor.py
to support python3 as python2 is EOL. See: Azure/sonic-swss#1542
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
**- Why I did it**
To support dynamic buffer calculation.
This PR also depends on the following PRs for sub modules
- [sonic-swss: [buffermgr/bufferorch] Support dynamic buffer calculation #1338](https://github.com/Azure/sonic-swss/pull/1338)
- [sonic-swss-common: Dynamic buffer calculation #361](https://github.com/Azure/sonic-swss-common/pull/361)
- [sonic-utilities: Support dynamic buffer calculation #973](https://github.com/Azure/sonic-utilities/pull/973)
**- How I did it**
1. Introduce field `buffer_model` in `DEVICE_METADATA|localhost` to represent which buffer model is running in the system currently:
- `dynamic` for the dynamic buffer calculation model
- `traditional` for the traditional model in which the `pg_profile_lookup.ini` is used
2. Add the tables required for the feature:
- ASIC_TABLE in platform/\<vendor\>/asic_table.j2
- PERIPHERAL_TABLE in platform/\<vendor\>/peripheral_table.j2
- PORT_PERIPHERAL_TABLE on a per-platform basis in device/\<vendor\>/\<platform\>/port_peripheral_config.j2 for each platform with gearbox installed.
- DEFAULT_LOSSLESS_BUFFER_PARAMETER and LOSSLESS_TRAFFIC_PATTERN in files/build_templates/buffers_config.j2
- Add lossless PGs (3-4) for each port in files/build_templates/buffers_config.j2
3. Copy the newly introduced j2 files into the image and rendering them when the system starts
4. Update the CLI options for buffermgrd so that it can start with dynamic mode
5. Fetches the ASIC vendor name in orchagent:
- fetch the vendor name when creates the docker and pass it as a docker environment variable
- `buffermgrd` can use this passed-in variable
6. Clear buffer related tables from STATE_DB when swss docker starts
7. Update the src/sonic-config-engine/tests/sample_output/buffers-dell6100.json according to the buffer_config.j2
8. Remove buffer pool sizes for ingress pools and egress_lossy_pool
Update the buffer settings for dynamic buffer calculation
**- Why I did it**
Align style with slightly modified PEP8 standards (extend maximum line length to 120 chars). This will also help in the transition to Python 3, where it is more strict about whitespace, plus it helps unify style among the SONiC codebase. Will tackle other directories in separate PRs.
**- How I did it**
Using `autopep8 --in-place --max-line-length 120` and some manual tweaks.
**- Why I did it**
We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0.
**- How I did it**
- Remove Makefiles and patches for building supervisor package from source
- Install Python 3 supervisor package version 4.2.1 in Buster base container
- Also install Python 3 version of supervisord-dependent-startup in Buster base container
- Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again.
- Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers
Treat devices that are ToRRouters (ToRRouters and BackEndToRRouters) the same when rendering templates
Except for BackEndToRRouters belonging to a storage cluster, since these devices have extra sub-interfaces created
Treat devices that are LeafRouters (LeafRouters and BackEndLeafRouters) the same when rendering templates
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
**- Why I did it**
As part of moving all SONiC code from Python 2 (no longer supported) to Python 3
**- How I did it**
- Convert enable_counters.py script to Python 3
- Reorganize imports per PEP8 standard
- Two blank lines precede functions per PEP8 standard
Why/How I did:
Make sure first error syslog is triggered based on FAULT TOLERANCE condition.
Added support of repeat clause with alert action. This is used as trigger
for generation of periodic syslog error messages if error is persistent
Updated the monit conf files with repeat every x cycles for the alert action
As part of the transition from Python 2 to Python 3, we are installing both pip2 and pip3 in the slave and config-engine containers. This PR replaces calls to `pip` in these containers with an explicit call to `pip2` to ensure the proper version of pip is executed, no matter which version of pip is aliased to `pip`, as we no longer rely on that alias.
Also some other pip-related cleanup
The orchagent and syncd need to have the same default synchronous mode configuration. This PR adds a template file to translate the default value in CONFIG_DB (empty field) to an explicit mode so that the orchagent and syncd could have the same default mode.
* Install ndppd during image build, and copy config files to image
* Configure proxy settings based on config DB at container start
* Pipe ndppd output to logger inside container to log output in syslog
Jinja2 templates rendered using Python 3 interpreter, are required
to conform with Python 3 new semantics.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
* buildimage: Add gearbox phy device files and a new physyncd docker to support VS gearbox phy feature
* scripts and configuration needed to support a second syncd docker (physyncd)
* physyncd supports gearbox device and phy SAI APIs and runs multiple instances of syncd, one per phy in the device
* support for VS target (sonic-sairedis vslib has been extended to support a virtual BCM81724 gearbox PHY).
HLD is located at b817a12fd8/doc/gearbox/gearbox_mgr_design.md
**- Why I did it**
This work is part of the gearbox phy joint effort between Microsoft and Broadcom, and is based
on multi-switch support in sonic-sairedis.
**- How I did it**
Overall feature was implemented across several projects. The collective pull requests (some in late stages of review at this point):
https://github.com/Azure/sonic-utilities/pull/931 - CLI (merged)
https://github.com/Azure/sonic-swss-common/pull/347 - Minor changes (merged)
https://github.com/Azure/sonic-swss/pull/1321 - gearsyncd, config parsers, changes to orchargent to create gearbox phy on supported systems
https://github.com/Azure/sonic-sairedis/pull/624 - physyncd, virtual BCM81724 gearbox phy added to vslib
**- How to verify it**
In a vslib build:
root@sonic:/home/admin# show gearbox interfaces status
PHY Id Interface MAC Lanes MAC Lane Speed PHY Lanes PHY Lane Speed Line Lanes Line Lane Speed Oper Admin
-------- ----------- --------------- ---------------- --------------- ---------------- ------------ ----------------- ------ -------
1 Ethernet48 121,122,123,124 25G 200,201,202,203 25G 204,205 50G down down
1 Ethernet49 125,126,127,128 25G 206,207,208,209 25G 210,211 50G down down
1 Ethernet50 69,70,71,72 25G 212,213,214,215 25G 216 100G down down
In addition, docker ps | grep phy should show a physyncd docker running.
Signed-off-by: syd.logan@broadcom.com
We want to let Monit to unmonitor the processes in containers which are disabled in `FEATURE` table such that
Monit will not generate false alerting messages into the syslog.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
SWSS config script restore ARP/FDB/Routes. Restore neighbor script
uses config DB ARP information to restore ARP entries and so needs
to be started after swssconfig exits.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
This PR limited the number of calls to sonic-cfggen to one call
per iteration instead of current 3 calls per iteration.
The PR also installs jq on host for future scripts if needed.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Arp update process was not being started due to an issue with
the directory name having an extra 'd' in supervisor as in
'/etc/supervisord/conf.d/arp_update.conf'.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
* Support for multi-asic platform for swssloglevel command
admin@str-acs-1:~$ swssloglevel
Usage: /usr/bin/swssloglevel -n [0 to 3] [OPTION]...
* Update to use the env file to get the PLATFORM string.
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:
```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```
This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.
Resolves https://github.com/Azure/sonic-buildimage/issues/5241
https://github.com/Azure/sonic-buildimage/issues/5255
Root Cause: Waiting on Restore count != 0 can lead to race condition
between orchagent process and swssconfig.sh.
Ideally check of Restore count != 0 is not needed as the State DB
cannot be flushed as if it was flushed then Warm Restart or swss-restart
should not be true also.
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
The following changes are done.
- Multi asic platform have 2 Loopback interfaces, Loopback0 and Loopback4096. IPinIP decap entries need to be added for both of them. Update the ipinip.json.j2 template to add decap entries for Loopback4096.
- Add corressponding unit test
Add a master switch so that the sync/async mode can be configured.
Example usage of the switch:
1. Configure mode while building an image
`make ENABLE_SYNCHRONOUS_MODE=y <target>`
2. Configure when the device is running
Change CONFIG_DB with `sonic-cfggen -a '{"DEVICE_METADATA":{"localhost": {"synchronous_mode": "enable"}}}' --write-to-db`
Restart swss with `systemctl restart swss`
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to one call during startup when starting swss service.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
* [platform] Add Support For Environment Variable
This PR adds the ability to read environment file from /etc/sonic.
the file contains immutable SONiC config attributes such as platform,
hwsku, version, device_type. The aim is to minimize calls being made
into sonic-cfggen during boot time.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
* Changes to add template support for copp.json.
This is needed so that we can install differnt type of
Traps based on Device Role (Tor/Leaf/Mgmt/etc...).
Initial use case is to install DHCP/DHCPv6 tarp only
for tor router.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Fixed based on review comments.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Fixed based on review comment.
Optimizing number of calls made to sonic-cfggen during service
start up as it adds to total system boot up time.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
**- Why I did it**
sonic-cfggen call is slow and it adds to system start up time
**- How I did it**
places all required variable into single template and called into sonic-cfggen using this template
**- How to verify it**
***-Test 1***
there is an average saving of .5 to 1 sec between old script and new script
```
root@str-s6000-acs-14:/# time ./orchagent_old.sh
/usr/bin/orchagent -d /var/log/swss -b 8192 -m f4:8e:38:16:bc:8d
real 0m3.546s
user 0m2.365s
sys 0m0.585s
root@str-s6000-acs-14:/# time ./orchagent_new.sh
/usr/bin/orchagent -d /var/log/swss -b 8192 -m f4:8e:38:16:bc:8d
real 0m2.058s
user 0m1.650s
sys 0m0.363s
```
***-Test 2***
Built an image with this change and orchagent is running with intended params:
```
admin@str-s6000-acs-14:~$ ps -ef | grep orchagent
root 2988 1901 1 02:09 pts/0 00:00:02 /usr/bin/orchagent -d /var/log/swss -b 8192 -m f4:8e:38:16:bc:8d
```
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
also update submodule
* 01f810f 2020-07-02 | fix compiling issue for gcc8.3 (#1339) [lguohan]
* 9b13120 2020-07-03 | Fix in script to avoid orchagent crash when port down followed by fdb delete (#1340) [rupesh-k]
* 9b01844 2020-07-01 | [qosorch] Update QoS scheduler params for shaping features (#1296) [Michael Li]
* 86b5e99 2020-07-02 | [mirrororch] Port Mirroring implementation (#1314) [rupesh-k]
* c05601c 2020-06-24 | [portsyncd]: add debug message if a port cannot be found in port able (#1328) [lguohan]
* a0b6412 2020-06-23 | COPP_DEL_fix: DEL for one trap group from SONIC is resetting all the trap IDs (#1273) [SinghMinu]
Signed-off-by: Guohan Lu <lguohan@gmail.com>
**- Why I did it**
Initially, the critical_processes file contains either the name of critical process or the name of group.
For example, the critical_processes file in the dhcp_relay container contains a single group name
`isc-dhcp-relay`. When testing the autorestart feature of each container, we need get all the critical
processes and test whether a container can be restarted correctly if one of its critical processes is
killed. However, it will be difficult to differentiate whether the names in the critical_processes file are
the critical processes or group names. At the same time, changing the syntax in this file will separate the individual process from the groups and also makes it clear to the user.
Right now the critical_processes file contains two different kind of entries. One is "program:xxx" which indicates a critical process. Another is "group:xxx" which indicates a group of critical processes
managed by supervisord using the name "xxx". At the same time, I also updated the logic to
parse the file critical_processes in supervisor-proc-event-listener script.
**- How to verify it**
We can first enable the autorestart feature of a specified container for example `dhcp_relay` by running the comman `sudo config container feature autorestart dhcp_relay enabled` on DUT. Then we can select a critical process from the command `docker top dhcp_relay` and use the command `sudo kill -SIGKILL <pid>` to kill that critical process. Final step is to check whether the container is restarted correctly or not.
when portsyncd starts, it first enumerates all front panel ports
and marks them as old interfaces. Then, for new front panel ports
it checks if their indexes exist in previous sets. If yes, it will
treats them as old interfaces and ignore them.
The reason we have this check is because broadcom SAI only removes
front panel ports after sai switch init.
So, if portsyncd starts after orchagent, new interfaces could be
created before portsyncd and treated as old interface.
Signed-off-by: Guohan Lu <lguohan@gmail.com>
FDB/ARP/Default routes files are deleted after swssconfig. This
makes debugging/validation of device conversion hard. This PR
saves those files in order to facilitate debugging of device conversion.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
**- Why I did it**
We need RIF counters to be enabled by default. Flex Counter does probe for supported counters. If a platform does not support RIF counters, SAI will return NOT_SUPPORTED and Flex Counter will stop polling the counter.
**- How to verify it**
After fresh install rif counter gropup is enabled by default:
$ counterpoll show
Type Interval (in ms) Status
-------------------- ------------------ --------
QUEUE_STAT default (10000) enable
PORT_STAT default (1000) enable
RIF_STAT default (1000) enable
QUEUE_WATERMARK_STAT default (10000) enable
PG_WATERMARK_STAT default (10000) enable
Signed-off-by: Mykola Faryma <mykolaf@mellanox.com>
* Multi-ASIC platforms will have the ID field in the DEVICE_METADATA, which will be pulled and
will be used when starting the orchagent process with the new option [-i INST_ID]
This is currently added only for Broadcom ASIC based platforms
* Making the asic instance ID passing global across asics/platforms.
Also changed the config DB id field to asic_id
* Minor updates
* Advance sonic-swss submodule
* Advance swss_common submodule as well due to dependencies
libpython2.7, libdaemon0, libdbus-1-3, libjansson4 are common
across different containers. move them into docker-base-stretch
Signed-off-by: Guohan Lu <lguohan@gmail.com>