Why I did it
Fix PR merge failed because 'vstest' step does not install libyang.
How I did it
Install libyang in azure pipeline.
How to verify it
Pass vstest step.
- Why I did it
To update MFT package to the latest version.
- How I did it
Updated MFT_VERSION & MFT_REVISION in platform/mellanox/mft.mk.
- How to verify it
Build an image and deploy to the switch
Check MFT version by dpkg -l | grep mft
Verify that all the SONiC services up and running
Run regression testing using tests from sonic-mgmt
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Why I did it
To include latest fixes and new functionality
SAI fixes and new features
fix#3205239, incorrect object type returned for SG child list
Fix VRF-VNI map entries remove issue
ECC health event and logging
[Port Buffers] restore default queue and pg configuration when all user pools are deleted
Fix EVPN type3 error on removal of uc/bc flood group
Fix EVPN type2 MAC move from local to remote results in SAI failure
Fix Disable learning on VXLAN tunnel
Fix error on VXLAN v6 tunnel removal
Fix port cannot apply schedule group when it is a lag member
Fix BFD add more detailed message on BFD packet not related to any existing session
gcc10 compilation fixes
Disable learning on VXLAN tunnel
Support BFD remote-disc exchange in negotiation stage
Tunnel Loopback packet action attribute implementation (for Dual TOR)
Add KVD resources MIN/MAX functionality (pending CRM issue with MIN only)
Support for CRC2 hash algorithm
Bulk counter support for PGs, queues
Support mirror sample rate attribute (SPC2+)
[Functional] [QoS] | Unable to remove SCHEDULE profile table even if there is no object referencing it
Next hop group optimized bulk API
Reduce verbosity of shared database already exists print
Span mirror policer (SPC2+), optimize pipeline for acl mirror action with policer on SPC2+
use same size descriptor pool for rx/tx
fix bfd - notify Sonic for admin-down event
2201 - empty list for supported fec for RJ45 ports
Fix don't disable used tunnel underlay interfaces
SDK fixes
100GbE FCI DAC (10137628-4050LF/HPE PN: 845408-B21) was recognized by mistake as supporting "cable burning' which caused the switch firmware to read page 0x9f (which unsupported in the cable) and to report this cable as having "bad eeprom".
Added remote peer UDP port information in BFD packet event.
After editing an ECMP, the resilient ECMP next-hop counter may not count correctly.
Fixed potential memory leaks in some APIs related to LPM
If TTL_CMD_COPY is used in Encap direction for a packet with no TTL, then the value passed in the ttl data structure will be used if non-zero (default 255 if zero).
In SN2201: When configuring Force mode, user should configure Speed and FEC on both sides
In Flex Tunnel encapsulation flow, if the encapsulation is with an IPv6 header, the flow label field may not be updated as expected.
In some cases, when changing speed to 400GbE over 8 lanes, the first few packets would be dropped.
In some traffic patterns involving small packets, the PortRcvErrors counter may mistakenly count events of local physical errors due to an internal flow in the hardware that involves link packets.
On Spectrum systems, sometimes during link failure, not all previous firmware indications cleared properly, potentially affecting the next link up attempt.
On the NVIDIA Spectrum-2 switch, when receiving a packet with Symbol Errors on ports that are configured to cut-thought mode, a pipeline might get stuck.
PCI calibration changes from a static to a dynamic mechanism.
SDK debug dump shows "Unknown" Counter in RFC3635 Counter Group.
SDK debug dump shows "Unknown" Counter in the PPCNT Traffic Class Counter Group.
SDK Dump missing column headers in some GC tables may result in difficulty understanding the dump.
SLL configuration is missing in SDK dump.
Spectrum-2 systems, do no support 1GbE on supported 40GbE modules.
When binding a UDP port which is already in use for BFD TX session, the error message appears incorrectly.
When Flex Tunnel was used, Flex Modifier sometimes experienced a brief mis-configuration during ISSU.
When many ports are active (e.g. 70 ports up), and the configuration of shared buffer is applied on the fly, occasionally, the firmware might get stuck.
When running 1GbE speeds on SN4600 system, the port remained active while peer side was closed.
When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted.
When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU.
While toggling the cable, and the low power mode is set to ON, an unexpected PMPE event error is received.
- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Why I did it
get_rx_los and get_tx_fault is not supported via the exisitng interface used, need provide dummy implementation for them.
NOTE: in later releases we will get them back via different interface.
- How I did it
Return False * lane_num for get_rx_los and get_tx_fault
- How to verify it
Added unit test
Why I did it
With continueOnError: true, a failed job returns the result: partiallySuccess, which cause it can't be rerun, since AZP consider it as passed. Then we can't only rerun t0 jobs when it fails.
How I did it
Mark t0 part1 and part2 as continueOnError: false.
How to verify it
The pipeline will verify it.
* Move qsfp eeprom reading to new cached api
* provide reading multiple pages in recursive manner
* workaround with flat memory on cmis
* remove workaround with memory model
* Remove unused imports
Porting sonic_db_dump_load.py from sonic-py-swsssdk to sonic-py-common.
#### Why I did it
sonic-py-swsssdk will be deprecate, so porting sonic_db_dump_load.py to sonic-py-common.
#### How I did it
Copy sonic_db_dump_load.py to sonic-py-common, and fix minor API different.
#### How to verify it
Pass all E2E test.
The platform_tests/test_advanced_reboot.py::test_warm_reboot will cover this script.
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
#### Description for the changelog
Porting sonic_db_dump_load.py from sonic-py-swsssdk to sonic-py-common.
#### Ensure to add label/tag for the feature raised. example - [PR#2174](https://github.com/sonic-net/sonic-utilities/pull/2174) where, Generic Config and Update feature has been labelled as GCU.
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
#### A picture of a cute animal (not mandatory but encouraged)
- Why I did it
Fix a typo in chassis platform API which causes the following error
>>> import sonic_platform as P
>>> c = P.platform.Platform().get_chassis()
>>> sl = c.get_all_sfps()
>>> sl[0].get_lpmode()
Sep 28 07:48:33 INFO LOG: Initializing SX log with STDOUT as output file.
False
>>> del c
Exception ignored in: <function Chassis.__del__ at 0x7f1d166ef8b0>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/sonic_platform/chassis.py", line 126, in __del__
self.sfp_module.deinitialize_sdk_handle(sfp_module.SFP.shared_sdk_handle)
NameError: name 'sfp_module' is not defined
- How I did it
Use self while using the SDK handle
- How to verify it
Manual test
Signed-off-by: Stephen Sun <stephens@nvidia.com>
This reverts commit 9750cb4.
There is a PR to handle 202205 branch revert: #12184
- Why I did it
The PR to be reverted introduced many notice logs every 1 minute if SFP is not plugged:
Cannot get module EEPROM information: Input/output error
Before the "bad" PR, the message format is like this:
INFO pmon#supervisord: xcvrd Cannot get module EEPROM information: Input/output error
It was truncated by rsyslog because every message is the same. However, the "bad" PR introduces SFP index to the message:
NOTICE pmon#xcvrd: Failed to get EEPROM data for sfp 39: Cannot get module EEPROM information: Input/output error
Rsyslog no longer truncate such log and many such messages are flooded to syslog.
- How I did it
Revert the PR
- How to verify it
Manual test
Why I did it
Fix the build unstable issue caused by the kvm 9000 port is not ready to use in 2 seconds.
2022-09-02T10:57:30.8122304Z + /usr/bin/kvm -m 8192 -name onie -boot order=cd,once=d -cdrom target/files/bullseye/onie-recovery-x86_64-kvm_x86_64_4_asic-r0.iso -device e1000,netdev=onienet -netdev user,id=onienet,hostfwd=:0.0.0.0:3041-:22 -vnc 0.0.0.0:0 -vga std -drive file=target/sonic-6asic-vs.img,media=disk,if=virtio,index=0 -drive file=./sonic-installer.img,if=virtio,index=1 -serial telnet:127.0.0.1:9000,server
2022-09-02T10:57:30.8123378Z + sleep 2.0
2022-09-02T10:57:30.8123889Z + '[' -d /proc/284923 ']'
2022-09-02T10:57:30.8124528Z + echo 'to kill kvm: sudo kill 284923'
2022-09-02T10:57:30.8124994Z to kill kvm: sudo kill 284923
2022-09-02T10:57:30.8125362Z + ./install_sonic.py
2022-09-02T10:57:30.8125720Z Trying 127.0.0.1...
2022-09-02T10:57:30.8126041Z telnet: Unable to connect to remote host: Connection refused
How I did it
Waiting more time until the tcp port 9000 is ready, waiting for 60 seconds in maximum.
Why I did it
The python packages azure-kusto-data and azure-kusto-ingest packages for python2 are too old and not really used. The python3 environment has newer version of these packages installed. This change is to deprecate these two packages for python2 in docker-sonic-mgmt image.
How I did it
Removed the lines for installing old version of packages azure-kusto-data and azure-kusto-ingest in python2 in the Dockerfile template.
Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Build swss-common with libyang
#### Why I did it
sonic-swss-common lib add dependency to libyang recently, so need update make file before update sonic-swss-common submodule.
#### How I did it
Add dependency to libyang in rules/swss-common.mk
#### How to verify it
Pass all E2E test case.
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
#### Description for the changelog
Add new Redis database PROFILE_DB
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
#### A picture of a cute animal (not mandatory but encouraged)
Signed-off-by: maipbui <maibui@microsoft.com>
#### Why I did it
`getstatusoutput()` function from `subprocess` module has shell injection issue because it includes `shell=True` in the implementation
Eliminate duplicate code
#### How I did it
Reimplement `getstatusoutput_noshell()` and `getstatusoutput_noshell_pipe()` functions with `shell=False`
Add `check_output_pipe()` function
#### How to verify it
Pass UT
Improve SSHD config to use more secure settings
#### Why I did it
According to Sonic OS review result, SSHD config file /etc/ssh/sshd_config using insecure settings.
#### How I did it
Change build_debian.sh script to set following settings to /etc/ssh/sshd_config:
ClientAliveInterval is set to 300
MaxAuthTries is set to default of 3
Banner set to /etc/issue
LogLevel is set to VERBOSE
#### How to verify it
Pass all E2E test case.
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
#### Description for the changelog
Improve SSHD config to use more secure settings
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
#### A picture of a cute animal (not mandatory but encouraged)
Why I did it
Some Arista products do not have an SSD but use an eMMC instead.
The SsdUtil plugin is therefore extended to support both.
How I did it
Implemented ssd_util.py platform plugin loaded by ssdutil.
This plugin fallback to the default SONiC implementation if the arista one can't be found.
How to verify it
Run show platform ssdhealth on a product with an eMMC
- Why I did it
As part of Persistent log level HLD , LOGLEVEL_DB content is moved to CONFIG_DB.
In addition, it was decided to remove jinja2_cache which currently appear on LOGLEVEL_DB
This cache was added to speed up template rendering in start scripts. There were a lot of them rendered during system start. This caused a delay in warm boot LAG restore time. It was tested and verified that with and without the cache we don't see any difference in this timing now. It is probably due to a lot of other optimizations done to sonic-cfggen. Since there is no noticeable improvement made by j2 cache now it is safe to remove it.
- How I did it
Remove redis_bcc.py file and and remove the bytcode_cache from sonic-sfggen
- How to verify it
Warm boot was tested with \ without this jinja2_cache and it there is no difference in performance
- Why I did it
interfaces-config service restarts networking service, during the restart loopback interface address is being removed and reassigned back, leaving loopback without an ipv4 address for a while.
On SONiC startup and config reload interfaces-config and bgp services start in parallel and sometimes
fpmsyncd in bgp attempts bind to loopback while it does not have an address, fails with the log
Exception "Cannot assign requested address" had been thrown in daemon
and exits with rc 0.
root@sonic:/# supervisorctl status
fpmsyncd EXITED Jul 20 05:04 AM
zebra RUNNING pid 35, uptime 6:15:05
zsocket EXITED Jul 20 05:04 AM
docker logs bgp
INFO exited: fpmsyncd (exit status 0; expected)
With fpmsyncd dead, configured routes do not appear in the database.
- How I did it
Added ordering dependency on interfaces-config service into bgp.config
- How to verify it
Itself the issue reproduces quite rarely, but one can gain the time interval between networking down and networking up in interfaces-config.sh like this:
diff --git a/files/image_config/interfaces/interfaces-config.sh b/files/image_config/interfaces/interfaces-config.sh
index f6aa4147a..87caceeff 100755
--- a/files/image_config/interfaces/interfaces-config.sh
+++ b/files/image_config/interfaces/interfaces-config.sh
@@ -63,7 +63,11 @@ done
# Read sysctl conf files again
sysctl -p /etc/sysctl.d/90-dhcp6-systcl.conf
-systemctl restart networking
+# systemctl restart networking
+
+systemctl start networking
+sleep 10
+systemctl stop networking
# Clean-up created files
rm -f /tmp/ztp_input.json /tmp/ztp_port_data.json
with this change the issue reproduces on every config reload.
Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>
Why I did it
Replace configuration parameter for gnmi write, and we will add other gnmi write features in the future.
How I did it
Update rules/config and other Makefile.
How to verify it
Build sonic image.
Why I did it
test_sai_qos failed because of the following error:
"stderr_lines": [
"Traceback (most recent call last):",
" File \"/usr/bin/ptf\", line 522, in <module>",
" test_modules = load_test_modules(config)",
" File \"/usr/bin/ptf\", line 413, in load_test_modules",
" mod = imp.load_module(modname, *imp.find_module(modname, [root]))",
" File \"saitests/switch.py\", line 19, in <module>",
" import switch_sai_thrift",
"ImportError: No module named switch_sai_thrift"
],
It's because test_sai_qos runs ptf script which imports switch_sai_thrift, switch_sai_thrift is installed from python-saithrift_0.9.4_amd64.deb.
For master image, the deb file is for python3, but ptf only has virtual python3 environment, that's why we add --system-site-packages to allow virtual env to access system site-packeges.
Add thrift package in docker ptf virtual python3 env, because currently env-python3 doesn't have thrift module which is needed in switch_sai_thrift.
How I did it
Enable --system-site-packages for virtual py3 env in ptf docker and install thrift for test_qos_sai
How to verify it
load and login ptf conatiner
dpkg - i python-saithrift_0.9.4_amd64.deb
source /root/env-python3/bin/activate
python
import switch_sai_thrift.switch_sai_rpc
Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Why I did it
Current isc-dhcp uses below code to remove DHCP option:
memmove(sp, op, op[1] + 2);
sp += op[1] + 2;
sp points to the option to be stripped, we can call it as option S.
op points to the option after options S, we can call it as option O.
DHCP option is a typical type-length-value structure, the first byte is type, the second byte is length, and remain parts are value.
In this case, option O length is bigger than option S, and more than 2 bytes, after the memmove, we will get this result:
Now Option S and Option O are overwritten, op[1] was the length of Option O, and it's modified after memmove.
But current implementation is still using op[1] as length to update sp (sp+=op[1]+2), so we get the wrong sp.
How I did it
Create patch from https://github.com/isc-projects/dhcp
The new impelementation use mlen to store the length of Option O before memmove, that's how it fixed the bug.
size_t mlen = op[1] + 2;
memmove(sp, op, mlen);
sp += mlen;
How to verify it
I have a PR for sonic-mgmt to cover this issue:
sonic-net/sonic-mgmt#6330
Signed-off-by: Gang Lv ganglv@microsoft.com
* Make client indentity by AME cert
* Join k8s cluster by ipv6
* Change join test cases
* Test case bug fix
* Improve read node label func
* Configure kubelet and change test cases
* For kubernetes version 1.22.2
* Fix undefine issue
Signed-off-by: Yun Li <yunli1@microsoft.com>
Multi-asic Docker instances are created behind Docker's default bridge
which doesn't allow talking to other Docker instances that are in the
host network (like database-chassis).
On linecards, we configure midplane interfaces to let per-asic docker
containers talk to CHASSIS_DB on the supervisor through internal chassis
network.
On the supervisor we don't need to use chassis internal network, but we
still need a similar setup in order to allow fabric containers to talk
to database-chassis
- Why I did it
Update SDK/FW version - 4.5.2320/2010_2320 in order to have the following fixes:
• Spectrum-3 | PCI calibration changes from a static to a dynamic mechanism.
• [VxLAN] TTL was set to 0 for non IP traffic (such as ARP)
- How I did it
Update pointer for the SDK/FW
- How to verify it
Run regression tests
- Why I did it
ethtool print error logs when EEPROM of a SFP is not available. It prints error like this:
INFO pmon#/supervisord: xcvrd Cannot get module EEPROM information: Input/output error
INFO pmon#/supervisord: xcvrd Cannot get Module EEPROM data: Invalid argument
However, this log does not contain the relevant SFP index which is hard for developer/qa to find the exactly SFP.
- How I did it
Redirect ethtool stderr to subprocess and log it better
- How to verify it
Manual test
Why I did it
Replace unsafe functions in iccpd
How I did it
Replace memset() by zero initialization
Replace strtok() by strtok_r()
Signed-off-by: maipbui <maibui@microsoft.com>
*The sonic-frr was upgraded to FRR 8.2.2 as part of PR #10691. However, sonic-frr/frr submodule was still referring to previous 7.5 version. Update the sonic-frr/frr submodule to 8.2.2 commit id. Fixes issue #11484.
* Adding support for get/set low pwer mode for QSFPs in PDDF common APIs
* Adding support for get/set low pwer mode for QSFPs in PDDF common APIs - Review comments