* [BGP]Adding configuration knob to allow advertise Loopback ipv6 /128 prefix
By default when IPv6 address is configured with /128 as subnet mask in Loopback0 interface, it will be advertised as prefix with /64 subnet.
To control this behavior a new field 'bgp_adv_lo_prefix_as_128' is introduced in DEVICE_METADATA table which when set to true will advertise prefix with /128 subnet as it is.
What/Why I did:
Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability.
Scenario: For example if let's say database service goes in fail state then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created.
Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface.
Issue 2: Use of IPVLAN vs MACVLAN
Currently we are using ipvlan mode. However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device." based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN) it remains in Kernel Database and never goes to free pool even though namespace is deleted.
Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address.
Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis Server.
Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle
Solution: Added explicit PING from namespace that will check this reachability.
Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet.
Solution: Handle gracefully via try..except and log the message.
Why I did it
There is a bug that the Port attributes in CONFIG_DB will be cleared if using sudo config macsec port add Ethernet0 or sudo config macsec port del Ethernet0
How I did it
To fetch the port attributes before set/remove MACsec field in port table.
Signed-off-by: Ze Gan <ganze718@gmail.com>
Fixes#9279
- Why I did it
Part of larger effort to move all SONiC systems to bullseye
- How I did it
1. Update container makefiles with correct dependencies
2. Update container Dockerfile with correct base image
3. Update container Dockerfile with correct apt dependencies
4. Update any other makefiles with dependencies to remove python2 support
5. Minor changes to support bullseye / python3
- How to verify it
Run regression on the switch:
1. Verify PTF community tests work
2. Verify syncd runs and all ports come up / pass traffic
3. Verify all platform tests succeed
Why I did it
When lldpmgrd handled events of other tables besides PORT_TABLE, error message was printed to log.
How I did it
Handle event according to its file descriptor instead of looping all registered selectables for each coming event.
How to verify it
I verified same events are being handled by printing events key and operation, before and after the change.
Also, before the change, in init flow after config reload, when lldpmgrd handled events of other tables besides PORT_TABLE, error messages were printed to log, this issue is solved now.
Currently, the build dockers are created as a user dockers(docker-base-stretch-<user>, etc) that are
specific to each user. But the sonic dockers (docker-database, docker-swss, etc) are
created with a fixed docker name and common to all the users.
docker-database:latest
docker-swss:latest
When multiple builds are triggered on the same build server that creates parallel building issue because
all the build jobs are trying to create the same docker with latest tag.
This happens only when sonic dockers are built using native host dockerd for sonic docker image creation.
This patch creates all sonic dockers as user sonic dockers and then, while
saving and loading the user sonic dockers, it rename the user sonic
dockers into correct sonic dockers with tag as latest.
docker-database:latest <== SAVE/LOAD ==> docker-database-<user>:tag
The user sonic docker names are derived from 'DOCKER_USERNAME and DOCKER_USERTAG' make env
variable and using Jinja template, it replaces the FROM docker name with correct user sonic docker name for
loading and saving the docker image.
Why I did it
Migrate ptftests script to python3, in order to do an incremental migration, add python virtual environment firstly, install all required python packages in virtual env as well.
Then migrate ptftests scripts from python2 to python3 one by one avoid impacting non-changed scripts.
Signed-off-by: Zhaohui Sun zhaohuisun@microsoft.com
How I did it
Add python3 virtual environment for docker-ptf.
Add submodule ptf-py3 and install patched ptf 0.9.3 into virtual environment as well, two ptf issues were reported here:
p4lang/ptf#173p4lang/ptf#174
Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Why I did it
Recirc port is used to only forward traffic from one asic to another asic. So it's not required to configure LLDP on it.
How I did it
Add interface prefix helper for recirc port. Similar to skip configuring LLDP on inband port, add check in lldpmgrd to skip recirc port by checking interface prefix.
Asic PCI ID (PCI address) is collected by chassisd (inside pmon -
Azure/sonic-platform-daemons#175) and saved in CHASSIS_STATE_DB (in
redis_chassis). CHASSIS_STATE_DB is accessible by swss containers.
At docker-init.sh (script is called after swss container is created and before
anything that could run in swss like orchagent...), we wait until asic PCI ID
of the corresponding asic is populated by chassisd. We then update asic_id in
CONFIG_DB of asic's database.
A system supporting dynamic asic PCI ID identification requires to have a file
(empty) use_pci_id_chassis in its platform dir.
When orchagent runs, it has correct asic PCI ID in its CONFIG_DB.
Together with this PR:
Azure/sonic-platform-daemons#175Azure/sonic-platform-common#185
Signed-off-by: Maxime Lorrillere <mlorrillere@arista.com>
Co-authored-by: Maxime Lorrillere <mlorrillere@arista.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.
Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.
The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.
Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:
Program 'container_memory_telemetry'
status Status ok
monitoring status Monitored
monitoring mode active
on reboot start
last exit value 0
last output -
data collected Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:
Program 'container_memory_telemetry'
status Status failed
monitoring status Monitored
monitoring mode active
on reboot start
last exit value 0
last output -
data collected Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.
How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.
How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
sign-off: Jing Zhang zhangjing@microsoft.com
#### Why I did it
As part of the process moving containers from buster to bullseye.
#### How I did it
1. change base image from buster to bullseye.
2. remove unused addition to orchagent run options
#### How to verify it
Tested building locally.
It upgraded scapy to 2.4.5 in docker-ptf container, after this upgrade, all scripts under ansible/roles/test/files/ptftests will import scapy 2.4.5, some test cases will fail because they are not upgraded accordingly.
Reverts #10507 to avoid breaking regression test.
This reverts commit 92efc01270.
Removed python2 support for sonic-platform-daemons that was causing unit
test errors in sonic_pcied.
* Removed config from docker supervisord jinja templates per VD review comment
* Removed space and python3 per QL comments
Why I did it
Existing dataplane tests cannot be tested under MACsec environment due to the traffic under MACsec link is encrypted. So, I will override the dp_poll of ptf to MACsec dp_poll to decrypt the MACsec packets on injected ports (PR: Azure/sonic-mgmt#5490). MACsec decryption library depends on scapy 2.4.5.
How I did it
Upgrade scapy library to 2.4.5 by pip.
How to verify it
Check the scapy version in docker-ptf by
python -c "import scapy; print(scapy.__version__)"
2.4.5
Signed-off-by: Ze Gan <ganze718@gmail.com>
Why I did it
Running warm-reboot in a loop for 500 times leads to this error on 318-th iteration:
Apr 2 15:56:27.346747 sonic INFO swss#/supervisord: restore_neighbors Traceback (most recent call last):
Apr 2 15:56:27.346747 sonic INFO swss#/supervisord: restore_neighbors File "/usr/bin/restore_neighbors.py", line 24, in <module>
Apr 2 15:56:27.346747 sonic INFO swss#/supervisord: restore_neighbors from scapy.all import conf, in6_getnsma, inet_pton, inet_ntop, in6_getnsmac, get_if_hwaddr, Ether, ARP, IPv6, ICMPv6ND_NS, ICMPv6NDOptSrcLLAddr
Apr 2 15:56:27.346795 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/all.py", line 25, in <module>
Apr 2 15:56:27.346956 sonic INFO swss#/supervisord: restore_neighbors from scapy.route import *
Apr 2 15:56:27.346995 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/route.py", line 205, in <module>
Apr 2 15:56:27.347089 sonic INFO swss#/supervisord: restore_neighbors conf.iface = get_working_if()
Apr 2 15:56:27.347129 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/arch/linux.py", line 128, in get_working_if
Apr 2 15:56:27.347213 sonic INFO swss#/supervisord: restore_neighbors ifflags = struct.unpack("16xH14x", get_if(i, SIOCGIFFLAGS))[0]
Apr 2 15:56:27.347250 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/arch/common.py", line 31, in get_if
Apr 2 15:56:27.347345 sonic INFO swss#/supervisord: restore_neighbors return ioctl(sck, cmd, struct.pack("16s16x", iff.encode("utf8")))
Apr 2 15:56:27.347365 sonic INFO swss#/supervisord: restore_neighbors OSError: [Errno 19] No such device
The issue was reported to scapy devs secdev/scapy#3369, the fix is secdev/scapy#3371, however there is no released scapy version with this fix right now, thus decided to build scapy v2.4.5 from sources and apply the fix in a form of a patch.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Change the base image from `docker-config-engine-buster` to
`docker-config-engine-bullseye`, and remove the hardcoded
`radvd` version from the Dockerfile.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
- Why I did it
Fastboot will delay all counters in CONFIG DB, it relies on enable_counters.py to recover the delayed counters. However, enable_counters.py does not recover those non-default counters.
- How I did it
For non-default counters, if it is in CONFIG DB, put delay status to false after the waiting.
- How to verify it
Manual test
#### Why I did it
when adding and removing ports after init stage we saw two issues:
first:
In several cases, after removing a port, lldpmgr is continuing to try to add a port to lldp with lldpcli command. the execution of this command is continuing to fail since the port is not existing anymore.
second:
after adding a port, we sometimes see this warning messgae:
"Command failed 'lldpcli configure ports Ethernet18 lldp portidsubtype local etp5b': 2021-07-27T14:16:54 [WARN/lldpctl] cannot find port Ethernet18"
we added these changes in order to solve it.
#### How I did it
port create events are taken from app db only.
lldpcli command is executed only when linux port is up.
when delete port event is received we remove this command from pending_cmds dictionary
#### How to verify it
manual tests and running lldp tests
#### Description for the changelog
Dynamic port configuration - solve lldp issues when adding/removing ports
#### Why I did it
SONiC is migrating to bullseye. This will update the sonic-pins container to bullseye.
#### How I did it
The [sonic-pins code](https://github.com/Azure/sonic-buildimage/blob/master/rules/p4rt.mk) isn't dependent on any architecture so it will already build successfully for bullseye. This PR updates the docker to use bullseye.
#### How to verify it
Today we cannot build the docker-sonic-p4rt.gz target (e.g. Issue #9885). With this change the docker will build successfully. The P4RT executable will not run, because of a missing runtime library, libgmpxx, which I'll address in a followup PR.
#### Description for the changelog
Update docker-sonic-p4rt.gz target to build with Bullseye instead of Buster.
Python 2 isn't installed by default in Buster and Bullseye containers,
and the scripts/modules can be used with Python 3, so make sure Python 3
is used.
Why I did it
After the Buster and Bullseye upgrade for the restapi container, processes will no longer start because supervisord is trying to call python and python2, both of which are unavailable.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Why I did it
Migration of sonic-mgmt codebase from Python 2 to Python 3
How I did it
Added scapy dependencies to the env-python3 virtual environment.
How to verify it
Run test case:
py.test --testbed=testbed-t0 --inventory=../ansible/lab --testbed_file=../ansible/testbed.csv --host-pattern=testbed-t0 -- module-path=../ansible/library lldp
Signed-off-by: Oleksandr Kozodoi <oleksandrx.kozodoi@intel.com>
# Why I did it
Reduce the disk space taken up during bootup and runtime.
# How I did it
1. Remove python package cache from the base image and from the containers.
2. During bootup, if logs are to be stored in memory, then don't create the `var-log.ext4` file just to delete it later during bootup.
3. For the partition containing `/host`, don't reserve any blocks for just the root user. This just makes sure all disk space is available for all users, if needed during upgrades (for example).
* Remove pip2 and pip3 caches from some containers
Only containers which appeared to have a significant pip cache size are
included here.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* Don't create var-log.ext4 if we're storing logs in memory
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* Run tune2fs on the device containing /host to not reserve any blocks for just the root user
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Why I did it
This fix is to address issue: Azure/sonic-mgmt#5280
In the sonic-mgmt Dockerfile, python package allure-pytest is installed after ENV USER $user.
Consequently the package is installed to path /home/$user/.local and is only available to the $user
account. If we use root account in sonic-mgmt docker container to run tests, any script importing
the allure package will fail with ImportError. We need to install the allure-pytest package to global
directory instead of user local directory.
How I did it
Update the sonic-mgmt Dockerfile to ensure that the allure-pytest package is installed to global directory
How to verify it
Build a new sonic-mgmt docker image based on the changes.
Use sonic-mgmt docker container of the newly built image to run test scripts that depend on the
allure-pytest package. No ImportError is raised.
This PR includes necessary changes for the setup of the Python3 virtual environment in the sonic-mgmt docker container.
How to activate Python3 virtual environment?
Connect to the sonic-mgmt container
$ docker exec -ti sonic-mgmt bash
Activate the virtual environment
$ source /var/user/env-python3/bin/activate
Why I did it
Migration of sonic-mgmt codebase from Python 2 to Python 3
How I did it
Added all necessary dependencies to the env-python3 virtual environment.
Signed-off-by: Oleksandr Kozodoi <oleksandrx.kozodoi@intel.com>
Why I did it
Code review was still in progress when #9858 was merged and upon further testing I have arrived at a better solution.
How I did it
Modified supervisord configuration j2 template for pmon to require no minimum uptime for chassisd_db_init and to remove the redundant exit_codes directive
How to verify it
Boot switch and verify in syslog that there are no errors related to chassis_db_init
- Use the `wait_for_link.sh` script to delay ndppd start until after the VLAN interface is ready
- Avoids issue where ndppd tries to change interface attributes before the interface is ready
- Use the `wait_for_link.sh` script to delay ndppd start until after the VLAN interface is ready
- Avoids issue where ndppd tries to change interface attributes before the interface is ready
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Why I did it
In the recent minigraph changes we add separate BGP session configuration for V4 and V6 internal VoQ neighbors.
This PR is adding different Peer groups for V4 and V6 neighbors
How I did it
Add VOQ_CHASSIS_V4_PEER and VOQ_CHASSIS_V6_PEER groups
Add extra Unit tests
How to verify it
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
* [PTF-SAIv2]Add ptf dockre for sai-ptf (saiv2)
Base on current ptf docker create a new docker for sai-ptf(saiv2)
upgrade related package
use the latest ptf and install it
test done:
NOJESSIE=1 NOSTRETCH=1 NOBULLSEYE=1 ENABLE_SYNCD_RPC=y make target/docker-ptf-sai.gz
BLDENV=buster make -f Makefile.work target/docker-ptf-sai.gz
* upgrade the thrift to 014
Why I did it
Radvd.conf.j2 template creates two copies of the vlan interface when there are more than one ipv6 address assigned to a single vlan interface. Changed the format to add prefixes under the same vlan interface block.
How I did it
Modifies radvd.conf.j2 and added unit tests
How to verify it
Configure multiple ipv6 address to the same vlan, start radvd
Unit test will check if radvd.conf with multiple ipv6 addresses is formed correctly
#### Why I did it
The current redis version of SONiC is `6.0.6`, which contains many high-risky security issues like CVEs that are fixed in the latest version. The Redis release notes also highly recommend to upgrade with SECURITY urgency.
```
================================================================================
Redis 6.0.16 Released Mon Oct 4 12:00:00 IDT 2021
================================================================================
Upgrade urgency: SECURITY, contains fixes to security issues.
Security Fixes:
* (CVE-2021-41099) Integer to heap buffer overflow handling certain string
commands and network payloads, when proto-max-bulk-len is manually configured
to a non-default, very large value [reported by yiyuaner].
* (CVE-2021-32762) Integer to heap buffer overflow issue in redis-cli and
redis-sentinel parsing large multi-bulk replies on some older and less common
platforms [reported by Microsoft Vulnerability Research].
* (CVE-2021-32687) Integer to heap buffer overflow with intsets, when
set-max-intset-entries is manually configured to a non-default, very large
value [reported by Pawel Wieczorkiewicz, AWS].
* (CVE-2021-32675) Denial Of Service when processing RESP request payloads with
a large number of elements on many connections.
* (CVE-2021-32672) Random heap reading issue with Lua Debugger [reported by
Meir Shpilraien].
* (CVE-2021-32628) Integer to heap buffer overflow handling ziplist-encoded
data types, when configuring a large, non-default value for
hash-max-ziplist-entries, hash-max-ziplist-value, zset-max-ziplist-entries
or zset-max-ziplist-value [reported by sundb].
* (CVE-2021-32627) Integer to heap buffer overflow issue with streams, when
configuring a non-default, large value for proto-max-bulk-len and
client-query-buffer-limit [reported by sundb].
* (CVE-2021-32626) Specially crafted Lua scripts may result with Heap buffer
overflow [reported by Meir Shpilraien].
Other bug fixes:
* Fix appendfsync to always guarantee fsync before reply, on MacOS and FreeBSD (kqueue) (#9416)
* Fix the wrong mis-detection of sync_file_range system call, affecting performance (#9371)
* Fix replication issues when repl-diskless-load is used (#9280)
```
#### How I did it
Edit `Dockerfile.j2` file
#### How to verify it
Check redis version
#### Description for the changelog
This PR will upgrade redis-server version to `6.0.16`.
- Why I did it
Error log was shown on switches during boot
pmon#supervisord 2021-12-22 04:27:16,709 INFO exited: chassis_db_init (exit status 0; not expected)
- How I did it
Add exit code zero as an expected exit code and also disable autorestart.
- How to verify it
Boot the switch and ensure the above log line does not appear.