Backport of #15961
Why I did it
Added the fwtrace config files in order to be able to call mlxstrace utility during show techsupport dump.
Work item tracking
Microsoft ADO (number only):
How I did it
Added fwtrace config files. Added path to these files to sai.profile for each mlnx device.
How to verify it
Execute the show techsupport command and check if mlxstrace output is in system dump.
This is to backport #16096
Why I did it
SONiC changes:
Support Spectrum4 ASIC FW binary building.
Support new SDK sx-obj-desc lib building since new SAI need it.
Remove SX_SCEW debian package from Mellanox SDK build since we are no longer using it (we use libxml2 instead).
Update SAI, SDK, FW to version 4.6.1020/2012.1020/SAIBuild2211.25.1.0
SDK/FW bug fixes
In SPC-1 platforms: Fastboot mode is not operational for Split port with Force mode in 50G speed
SFP modules are kept in disabled state after set LPM (low power mode) on/off for at least 3 minutes.
When preforming fast boot from an old SDK version (currently installed) to a newer one (target version), and the system was initially loaded with a new SDK version (past version), and the system has not been wiped, under specific conditions, the fast boot would use the past version's data and may fail.
SDK/FW Features
On SN2700 all ports can support y cable by credo
SAI bug Fixes
When creating an ACL rule with SAI_ACL_ENTRY_ATTR_FIELD_SRC_IP/SAI_ACL_ENTRY_ATTR_FIELD_DST_IP enabled, and then disabling the field by setting enable=false, a match on L3_type=IPv4 will remain programmed for the rule Issue resolved after the fix
Allow the max scale of virtual routers to be configure for SPC-1, SPC-2, SPC-3 when fastboot enable
Remove default hash key of SRC_MAC, DST_MAC and ETH_TYPE
SAI features
Port init profile
Dual ToR Active-Standby | Additional MAC support
Work item tracking
Microsoft ADO (number only):
How I did it
Update SDK/FW/SAI make files
How to verify it
Run full sonic-mgmt regression on Mellanox platform
Backport #15728
Why I did it
To optimize Mellanox platform SAI build
Work item tracking
Microsoft ADO (number only):
How I did it
SAI debs are now downloaded as Spectrum-SDK-Drivers-SONiC-Bins release.
How to verify it
Configure/build for Mellanox platform, check the image and ensure that correct SAI debs are included.
- Why I did it
Add the patchwork link to the commit description for non-upstream patches if present
- How I did it
Parse the patchwork/<patch_name>.txt file from hw-mgmt
Why I did it
SDK patches for iproute2 were added to SONiC tree as a temporary solution.
Now that SDK with the patches is available, I have removed the patches from SONiC tree and we consume them from SDK github during compilation.
How I did it
During build we download SDK iproute2 patches from SDK github (or from the URL provided by user if compiling SDK from sources) and apply them before compilation.
How to verify it
Compile and load on switch, verify interfaces network devices created successfully.
Verify LLDP shows connections to neighbors.
Verify ping between 2 hosts over 2 router ports is successful.
Manual Cherrypick of #15260
Why I did it
Bug fix:
I2C bus is stuck - Unable to probe I2C bus 2-0048, which causes /var/run/hw-management/config/sfp_counter, module_counter to be zero and pmon docker unable to start.
Work item tracking
Microsoft ADO (number only):
How I did it
Update HW-MGMT package version in the make file
Update HW-MGMT submodule pointer
How to verify it
run full sonic-mgmt regression
- Why I did it
Mellanox syncd container will be based on Debian iproute2 plus patches instead of Nvidia internal version of iproute2
- How I did it
Download iproute2 from Debian repository, apply patches and compile to create a new target.
The target is then deployed in syncd container of Mellanox switches only.
The new target is called IPROUTE2_MLNX.
- How to verify it
Compile and load on switch, verify interfaces network devices created successfully.
Verify LLDP shows connections to neighbors.
Verify ping between 2 hosts over 2 router ports is successful.
- Why I did it
In sfplpm API, the number of logical ports is hardcoded as 64. When a system contains more port than this, the SDK APIs would fail with a syslog as below
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [MGMT_LIB.ERR] Slot [0] Module [0] has logport [0x00010069] in enabled state
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [SDK_MGMT_LIB.ERR] Failed in __sdk_mgmt_phy_module_pwr_attr_set, error: Internal Error
Mar 7 03:53:58.106118 r-leopard-58 ERR pmon#-c: Error occurred when setting power mode for SFP module 0, slot 0, error code 1
- How I did it
Remove the hardcoded value of 64. Obtained the number of logical ports from SDK
- How to verify it
Manual testing
- Why I did it
To include latest fixes:
Fix traffic loss on all routed traffic when moving from 4.4.3372/XX_2008_3388 to 4.5.4118-012/XX_2010_4120-010. Issue occurred after ISSU process in Spectrum 1 only, When upgrading from older version to a new one. Neighbor entries are overwritten.
Fix When using mirror session policer on SPC2/3, the actual CIR was 1.28 times more than the configured CIR value.
Fix Creation of router interface of type bridge may occasionally fail if create is performed immediately after delete.
Fix False errors during SDK deinitialization may be seen in the syslog
- How I did it
Updated SDK submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
- Why I did it
Sometimes Nvidia watchdog device isn't ready when watchdog-control service is up after first installation from ONIE
need to delay watchdog control service to go up after hw-mgmt which gets devices up and ready
- How I did it
Delay Nvidia watchdog-control service before hw-mgmt has started on Mellanox platform in order to avoid missing or not ready watchdog device.
- How to verify it
verification test of ONIE installation of image in a loop
making sure watchdog service is always up (not failed) after first installation from ONIE
Fixes#13568
Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:
admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb 8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.
- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.
- How to verify it
Upgrade from 201911 to master
master to 201911 downgrade
master -> master reboot
ONIE -> master boot (First FW burn)
Which release branch to backport (provide reason below if selected)
- Why I did it
On Mellanox platform, system EEPROM is a soft link provided by hw-management. There is chance that config-setup service accessing the EEPROM before hw-management creating it. It causes errors. The PR is aim to fix it.
- How I did it
Waiting EEPROM creation in platform API up to 10 seconds.
- How to verify it
Manual test
- Why I did it
Advance hw-mgmt service to V.7.0020.4100
Add missing thermal sensors that are supported by hw-mgmt package
Delay system health service before hw-mgmt has started on Mellanox platform in order to avoid reading some sensors before ready.
Depends on sonic-net/sonic-linux-kernel#305
- How I did it
1. Update hw mgmt version
2. Add missing sensors
3. Delay service
- How to verify it
Regression test.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
Support per PSU slope value for PSU power threshold according to hardware team requirement
- How I did it
Pass the PSU number as a parameter when fetching the slope value of PSU.
- How to verify it
Running regression and manual test
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
Added platform specific script to be invoked during SAI failure dump. Added some generic changes to mount /var/log/sai_failure_dump as read write in the syncd docker
- How I did it
Added script in docker-syncd of mellanox and copied it to /usr/bin
- How to verify it
Manual UT and new sonic-mgmt tests
- Why I did it
To include latest fixes and new functionality
SDK/FW
1. Fixed bug in recovery mechanism in case of I2C error when trying to access the XSFP module.
2. On the NVIDIA Spectrum-2 switch, when receiving a packet with Symbol Errors on ports that are configured to cut-thought mode, a pipeline might get stuck.
3. On the Spectrum-2 and Spectrum-3 switch, if you enable ECN marking and the port is in split mode, traffic sent to the port under congestion (for example, when connecting two ports with a total speed of 50GbE to a single 25GbE port) is not marked.
4. Modifying existing entry/Adding new one when switch is at its maximum capacity (full by maximum allowed entries from any type such as routes, FDB, and so forth), will fail with an error.
5. When many ports are active (e.g., 70 ports up), and the configuration of shared buffer is applied on the fly, occasionally, the firmware might get stuck.
6. When a system has more than 256 ACL rules, on rare occasion, removing/adding rules may cause some ACL rules not to work.
7. On SN2201 system, on RJ45 port, the link might appear in 'down' state even if it operations properly.
8. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event.
9. When setting LAG as a SPAN analyzer, the distributor mode of the LAG members was not taken into account. It may happen that the LAG member with distributor mode disabled will be set as a SPAN analyzer port.
- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
Add script usage and more information to script description being printed in help option.
- Why I did it
Missing information in script description in help option.
- How I did it
Expand script description and add script usage.
- How to verify it
Run the script with -h option.
- Why I did it
In order to prevent the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py test failing on the log analyzer step.
The mentioned test is performing the sfputil reset EthernetX for every interface on the SONiC switch, this action will flap the SFP device status (INSTERTED -> REMOVED -> INSTERTED).
The SONiC XCVRD daemon will catch this SFP device status change (because it is monitoring the presence status of the cable).
To judge the cable presence status, currently, we are still leveraging to read the first bytes of the EEPROM, and the EEPROM could be not ready at some moment and the SONiC XCVRD daemon will print the error log to Syslog:
ERR pmon#xcvrd: Error! Unable to read data for 'xx' port, page 'xx' offset 128, rc = 1, err msg: Sending access register
- How I did it
Change logging severity from ERR to WARNING
- How to verify it
Run the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py
OR much faster way to run the next script on the switch:
#!/bin/bash
START=0
END=248
for (( intf=$START; intf<=$END; intf+=8))
do
sfputil reset Ethernet"${intf}"
done
sfputil show presence
- Why I did it
Remove TODO comments which are no longer needed
- How I did it
Remove TODO comments which are no longer needed
- How to verify it
Only comment change
- Why I did it
Following code to judge whether a process is running inside a docker could get stuck on the simx platform
subprocess.Popen(["docker", "--version"],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
universal_newlines=True)
When it gets stuck, the config-chassisdb service can not be successfully started, thus the system can not be booted up.
root@sonic:/# service config-chassisdb status
config-chassisdb.service - Config chassis_db
Loaded: loaded (/lib/systemd/system/config-chassisdb.service; enabled; vendor preset: enabled)
Active: activating (start) since Thu 2022-12-15 09:23:02 UTC; 29min ago
Main PID: 571 (config-chassisd)
Tasks: 14 (limit: 9501)
Memory: 132.4M
CGroup: /system.slice/config-chassisdb.service
├─571 /bin/bash /usr/bin/config-chassisdb
├─575 /usr/bin/python3 /usr/local/bin/sonic-cfggen -H -v DEVICE_METADATA.localhost.platform
├─602 /bin/sh -c sudo decode-syseeprom -m
├─603 sudo decode-syseeprom -m
├─607 /usr/bin/python3 /usr/local/bin/decode-syseeprom -m
├─616 /bin/sh -c docker --version 2>/dev/null
└─617 docker --version
- How I did it
Use an alternative way to implement this function and issue can be avoided:
docker_env_file = '/.dockerenv'
return os.path.exists(docker_env_file) is False
- How to verify it
run regression on real hardware and simx platform.
- Why I did it
In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.
- How I did it
Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.
- How to verify it
a. Manual test:
- Perform a power loss
- Perform a warm/fast reboot
- Check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"
b. Run reboot cause related regression test.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Backport of https://github.com/sonic-net/sonic-buildimage/pull/12490 into 202211
- Why I did it
Support syslog rate limit configuration feature
- How I did it
Remove unused rsyslog.conf from containers
Modify docker startup script to generate rsyslog.conf from template files
Add metadata/init data for syslog rate limit configuration
- How to verify it
Manual test
New sonic-mgmt regression cases