Why I did it
Running warm-reboot in a loop for 500 times leads to this error on 318-th iteration:
Apr 2 15:56:27.346747 sonic INFO swss#/supervisord: restore_neighbors Traceback (most recent call last):
Apr 2 15:56:27.346747 sonic INFO swss#/supervisord: restore_neighbors File "/usr/bin/restore_neighbors.py", line 24, in <module>
Apr 2 15:56:27.346747 sonic INFO swss#/supervisord: restore_neighbors from scapy.all import conf, in6_getnsma, inet_pton, inet_ntop, in6_getnsmac, get_if_hwaddr, Ether, ARP, IPv6, ICMPv6ND_NS, ICMPv6NDOptSrcLLAddr
Apr 2 15:56:27.346795 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/all.py", line 25, in <module>
Apr 2 15:56:27.346956 sonic INFO swss#/supervisord: restore_neighbors from scapy.route import *
Apr 2 15:56:27.346995 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/route.py", line 205, in <module>
Apr 2 15:56:27.347089 sonic INFO swss#/supervisord: restore_neighbors conf.iface = get_working_if()
Apr 2 15:56:27.347129 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/arch/linux.py", line 128, in get_working_if
Apr 2 15:56:27.347213 sonic INFO swss#/supervisord: restore_neighbors ifflags = struct.unpack("16xH14x", get_if(i, SIOCGIFFLAGS))[0]
Apr 2 15:56:27.347250 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/arch/common.py", line 31, in get_if
Apr 2 15:56:27.347345 sonic INFO swss#/supervisord: restore_neighbors return ioctl(sck, cmd, struct.pack("16s16x", iff.encode("utf8")))
Apr 2 15:56:27.347365 sonic INFO swss#/supervisord: restore_neighbors OSError: [Errno 19] No such device
The issue was reported to scapy devs secdev/scapy#3369, the fix is secdev/scapy#3371, however there is no released scapy version with this fix right now, thus decided to build scapy v2.4.5 from sources and apply the fix in a form of a patch.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Change the base image from `docker-config-engine-buster` to
`docker-config-engine-bullseye`, and remove the hardcoded
`radvd` version from the Dockerfile.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
- Why I did it
Fastboot will delay all counters in CONFIG DB, it relies on enable_counters.py to recover the delayed counters. However, enable_counters.py does not recover those non-default counters.
- How I did it
For non-default counters, if it is in CONFIG DB, put delay status to false after the waiting.
- How to verify it
Manual test
#### Why I did it
when adding and removing ports after init stage we saw two issues:
first:
In several cases, after removing a port, lldpmgr is continuing to try to add a port to lldp with lldpcli command. the execution of this command is continuing to fail since the port is not existing anymore.
second:
after adding a port, we sometimes see this warning messgae:
"Command failed 'lldpcli configure ports Ethernet18 lldp portidsubtype local etp5b': 2021-07-27T14:16:54 [WARN/lldpctl] cannot find port Ethernet18"
we added these changes in order to solve it.
#### How I did it
port create events are taken from app db only.
lldpcli command is executed only when linux port is up.
when delete port event is received we remove this command from pending_cmds dictionary
#### How to verify it
manual tests and running lldp tests
#### Description for the changelog
Dynamic port configuration - solve lldp issues when adding/removing ports
#### Why I did it
SONiC is migrating to bullseye. This will update the sonic-pins container to bullseye.
#### How I did it
The [sonic-pins code](https://github.com/Azure/sonic-buildimage/blob/master/rules/p4rt.mk) isn't dependent on any architecture so it will already build successfully for bullseye. This PR updates the docker to use bullseye.
#### How to verify it
Today we cannot build the docker-sonic-p4rt.gz target (e.g. Issue #9885). With this change the docker will build successfully. The P4RT executable will not run, because of a missing runtime library, libgmpxx, which I'll address in a followup PR.
#### Description for the changelog
Update docker-sonic-p4rt.gz target to build with Bullseye instead of Buster.
Python 2 isn't installed by default in Buster and Bullseye containers,
and the scripts/modules can be used with Python 3, so make sure Python 3
is used.
Why I did it
After the Buster and Bullseye upgrade for the restapi container, processes will no longer start because supervisord is trying to call python and python2, both of which are unavailable.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Why I did it
Migration of sonic-mgmt codebase from Python 2 to Python 3
How I did it
Added scapy dependencies to the env-python3 virtual environment.
How to verify it
Run test case:
py.test --testbed=testbed-t0 --inventory=../ansible/lab --testbed_file=../ansible/testbed.csv --host-pattern=testbed-t0 -- module-path=../ansible/library lldp
Signed-off-by: Oleksandr Kozodoi <oleksandrx.kozodoi@intel.com>
# Why I did it
Reduce the disk space taken up during bootup and runtime.
# How I did it
1. Remove python package cache from the base image and from the containers.
2. During bootup, if logs are to be stored in memory, then don't create the `var-log.ext4` file just to delete it later during bootup.
3. For the partition containing `/host`, don't reserve any blocks for just the root user. This just makes sure all disk space is available for all users, if needed during upgrades (for example).
* Remove pip2 and pip3 caches from some containers
Only containers which appeared to have a significant pip cache size are
included here.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* Don't create var-log.ext4 if we're storing logs in memory
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* Run tune2fs on the device containing /host to not reserve any blocks for just the root user
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Why I did it
This fix is to address issue: Azure/sonic-mgmt#5280
In the sonic-mgmt Dockerfile, python package allure-pytest is installed after ENV USER $user.
Consequently the package is installed to path /home/$user/.local and is only available to the $user
account. If we use root account in sonic-mgmt docker container to run tests, any script importing
the allure package will fail with ImportError. We need to install the allure-pytest package to global
directory instead of user local directory.
How I did it
Update the sonic-mgmt Dockerfile to ensure that the allure-pytest package is installed to global directory
How to verify it
Build a new sonic-mgmt docker image based on the changes.
Use sonic-mgmt docker container of the newly built image to run test scripts that depend on the
allure-pytest package. No ImportError is raised.
This PR includes necessary changes for the setup of the Python3 virtual environment in the sonic-mgmt docker container.
How to activate Python3 virtual environment?
Connect to the sonic-mgmt container
$ docker exec -ti sonic-mgmt bash
Activate the virtual environment
$ source /var/user/env-python3/bin/activate
Why I did it
Migration of sonic-mgmt codebase from Python 2 to Python 3
How I did it
Added all necessary dependencies to the env-python3 virtual environment.
Signed-off-by: Oleksandr Kozodoi <oleksandrx.kozodoi@intel.com>
Why I did it
Code review was still in progress when #9858 was merged and upon further testing I have arrived at a better solution.
How I did it
Modified supervisord configuration j2 template for pmon to require no minimum uptime for chassisd_db_init and to remove the redundant exit_codes directive
How to verify it
Boot switch and verify in syslog that there are no errors related to chassis_db_init
- Use the `wait_for_link.sh` script to delay ndppd start until after the VLAN interface is ready
- Avoids issue where ndppd tries to change interface attributes before the interface is ready
- Use the `wait_for_link.sh` script to delay ndppd start until after the VLAN interface is ready
- Avoids issue where ndppd tries to change interface attributes before the interface is ready
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Why I did it
In the recent minigraph changes we add separate BGP session configuration for V4 and V6 internal VoQ neighbors.
This PR is adding different Peer groups for V4 and V6 neighbors
How I did it
Add VOQ_CHASSIS_V4_PEER and VOQ_CHASSIS_V6_PEER groups
Add extra Unit tests
How to verify it
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
* [PTF-SAIv2]Add ptf dockre for sai-ptf (saiv2)
Base on current ptf docker create a new docker for sai-ptf(saiv2)
upgrade related package
use the latest ptf and install it
test done:
NOJESSIE=1 NOSTRETCH=1 NOBULLSEYE=1 ENABLE_SYNCD_RPC=y make target/docker-ptf-sai.gz
BLDENV=buster make -f Makefile.work target/docker-ptf-sai.gz
* upgrade the thrift to 014
Why I did it
Radvd.conf.j2 template creates two copies of the vlan interface when there are more than one ipv6 address assigned to a single vlan interface. Changed the format to add prefixes under the same vlan interface block.
How I did it
Modifies radvd.conf.j2 and added unit tests
How to verify it
Configure multiple ipv6 address to the same vlan, start radvd
Unit test will check if radvd.conf with multiple ipv6 addresses is formed correctly
#### Why I did it
The current redis version of SONiC is `6.0.6`, which contains many high-risky security issues like CVEs that are fixed in the latest version. The Redis release notes also highly recommend to upgrade with SECURITY urgency.
```
================================================================================
Redis 6.0.16 Released Mon Oct 4 12:00:00 IDT 2021
================================================================================
Upgrade urgency: SECURITY, contains fixes to security issues.
Security Fixes:
* (CVE-2021-41099) Integer to heap buffer overflow handling certain string
commands and network payloads, when proto-max-bulk-len is manually configured
to a non-default, very large value [reported by yiyuaner].
* (CVE-2021-32762) Integer to heap buffer overflow issue in redis-cli and
redis-sentinel parsing large multi-bulk replies on some older and less common
platforms [reported by Microsoft Vulnerability Research].
* (CVE-2021-32687) Integer to heap buffer overflow with intsets, when
set-max-intset-entries is manually configured to a non-default, very large
value [reported by Pawel Wieczorkiewicz, AWS].
* (CVE-2021-32675) Denial Of Service when processing RESP request payloads with
a large number of elements on many connections.
* (CVE-2021-32672) Random heap reading issue with Lua Debugger [reported by
Meir Shpilraien].
* (CVE-2021-32628) Integer to heap buffer overflow handling ziplist-encoded
data types, when configuring a large, non-default value for
hash-max-ziplist-entries, hash-max-ziplist-value, zset-max-ziplist-entries
or zset-max-ziplist-value [reported by sundb].
* (CVE-2021-32627) Integer to heap buffer overflow issue with streams, when
configuring a non-default, large value for proto-max-bulk-len and
client-query-buffer-limit [reported by sundb].
* (CVE-2021-32626) Specially crafted Lua scripts may result with Heap buffer
overflow [reported by Meir Shpilraien].
Other bug fixes:
* Fix appendfsync to always guarantee fsync before reply, on MacOS and FreeBSD (kqueue) (#9416)
* Fix the wrong mis-detection of sync_file_range system call, affecting performance (#9371)
* Fix replication issues when repl-diskless-load is used (#9280)
```
#### How I did it
Edit `Dockerfile.j2` file
#### How to verify it
Check redis version
#### Description for the changelog
This PR will upgrade redis-server version to `6.0.16`.
- Why I did it
Error log was shown on switches during boot
pmon#supervisord 2021-12-22 04:27:16,709 INFO exited: chassis_db_init (exit status 0; not expected)
- How I did it
Add exit code zero as an expected exit code and also disable autorestart.
- How to verify it
Boot the switch and ensure the above log line does not appear.
Implement infrastructure that allows enabling address sanitizer
for docker containers. Enable address sanitizer for SWSS container.
- Why I did it
To add a possibility to compile SONiC applications with address sanitizer (ASAN).
ASAN is a memory error detector for C/C++. It finds:
1. Use after free (dangling pointer dereference)
2. Heap buffer overflow
3. Stack buffer overflow
4. Global buffer overflow
5. Use after return
6. Use after the scope
7. Initialization order bugs
8. Memory leaks
- How I did it
By adding new ENABLE_ASAN configuration option.
- How to verify it
By default ASAN is disabled and the SONiC image is not affected.
When ASAN is enabled it inspects all allocation, deallocation, and memory usage that the application does in run time. To verify whether the application has memory errors tests that trigger memory usage of the application should be run. Ideally, the whole regression tests should be run. Memory leaks reports will be placed in /var/log/asan/ directory of SONiC host OS.
Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
- Why I did it
Remove obsolete parameter that enables static VXLAN src port range
provide functionality no generate json config file according to appropriate parameter in config_db
Done for
SN3800:
• Mellanox-SN3800-D28C50
• Mellanox-SN3800-C64
• Mellanox-SN3800-D28C49S1 (New 10G SKU)
SN2700:
• Mellanox-SN2700-D48C8
- How I did it
Remove SAI_VXLAN_SRCPORT_RANGE_ENABLE=1 from appropriate sai.profile files
Created vxlan.json file and added few params that depends on DEVICE_METADATA.localhost.vxlan_port_range
- How to verify it
File /etc/swss/config.d/vxlan.json should be generated inside swss docker when it restart
[
{
"SWITCH_TABLE:switch": {
"vxlan_src": "0xFF00",
"vxlan_mask": "8"
},
"OP": "SET"
}
]
Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
* [y_cable] Support for initialization of new Daemon ycable to support
ycables
This PR also adds the commit in sonic-platform-daemons
94fa239 [y_cable] refactor y_cable to a seperate logic and new daemon from xcvrd (#219)
Why I did it
This PR separates the logic of Y-Cable from xcvrd. Before this change we were utilizing xcvrd daemon to control all aspects of Y-Cable right from initialization to processing requests from other entities like orch,linkmgr.
Now we would have another daemon ycabled which will serve this purpose.
Logically everything still remains the same from the perspective of other daemons.
it also take care aspects like init/delete daemon from Y-Cable perspective.
How I did it
To serve the purpose we build a new wheel sonic_ycabled-1.0-py3-none-any.whl and install it inside pmon.
We also initalize the daemon ycabled which serves our purpose for refactor inside pmon
How to verify it
Ran the changes with an image for dualtor tests on a 7050cx3 platform
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
Why I did it
Add TSA/B/C dualtor support
Signed-off-by: Longxiang Lyu lolv@microsoft.com
How I did it
For TSA, toggle all the mux to standby if the device type is dualtor and there are active mux ports.
For TSC, add mux status output.
How to verify it
Run TSA/B/C on a dualtor setup
As part of this, update the isc-dhcp package to match the Bullseye
version (this fixes some compile errors related to BIND), clean up some
of the build dependencies and runtime dependencies for debian packaging,
and use the default Boost version to compile against instead of
explicitly saying using 1.74.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
#### Why I did it
resolves https://github.com/Azure/sonic-buildimage/issues/8779
snmpd writes the below error message in syslog :
snmp#snmpd[27]: truncating integer value > 32 bits
This message is written in syslog when the hrSystemUptime(1.3.6.1.2.1.25.1.1.0 / system uptime) or sysUpTime(1.3.6.1.2.1.1.3 network management portion or snmpd uptime) is queried when either of these counters overflow beyond 32 bit value. This happens the device uptime or snmpd uptime is more than 497 days.
#### How I did it
Reference: https://access.redhat.com/solutions/367093 and https://linux.die.net/man/1/snmpcmd
To avoid seeing this message if the counter grows, the snmpd error log level is changed to display LOG_EMERG, LOG_ALERT, LOG_CRIT, and LOG_DEBUG.
Without this change, LOG_ERR and LOG_WARNING would also be logged in syslog.
#### How to verify it
On a device which is up for more than 497 days, modify supervisord.conf with the change and restart snmp.
Query 1.3.6.1.2.1.1.3 and verify that log message is not seen.
Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>
Why I did it
Arista 7060 platform has a rare and unreproduceable PCIe timeout that could possibly be solved with increasing the switch PCIe timeout value. To do this we'll call a script for this platform to increase the PCIe timeout on boot-up.
No issues would be expected from the setpci command. From the PCIe spec:
"Software is permitted to change the value in this field at any
time. For Requests already pending when the Completion
Timeout Value is changed, hardware is permitted to use either
the new or the old value for the outstanding Requests, and is
permitted to base the start time for each Request either on when
this value was changed or on when each request was issued. "
How I did it
Add "platform-init" support in swss docker similar to how "hwsku-init" is called, only this would be for any device belonging to a platform. Then the script would reside in device data folder.
Additionally, add pciutils dependency to docker-orchagent so it can run the setpci commands.
How to verify it
On bootup of an Arista 7060, can execute:
lspci -vv -s 01:00.0 | grep -i "devctl2"
In order to check that the timeout has changed.
- Create a script in the orchagent docker container which listens for these encapsulated packets which are trapped to CPU (indicating that they cannot be routed/no neighbor info exists for the inner packet). When such a packet is received, the script will issue a ping command to the packet's inner destination IP to start the neighbor learning process.
- This script is also resilient to portchannel status changes (i.e. interface going up or down). An interface going down does not affect traffic sniffing on interfaces which are still up. When an interface comes back up, we restart the sniffer to start capturing traffic on that interface again.
Why I did it
To enable test support for BFD-related features, the PTF docker needs to have the proper support for BFD. This PR aims to add BFD support in ptf docker.
How I did it
Clone and build OpenBFDD for PTF docker.
How to verify it
Build locally and verify BFD is supported.
What I did:
Updated Jinja Template to enable BGP Graceful Restart based on device role. By default it will be enable only if the device role type is TorRouter.
Why I did:-
By default FRR is configured in Graceful Helper mode. Graceful Restart is needed on T0/TorRouter only since the device can go for warm-reboot. For T1/LeafRouter it need to be in Helper mode only