- Why I did it
Sometimes Nvidia watchdog device isn't ready when watchdog-control service is up after first installation from ONIE
need to delay watchdog control service to go up after hw-mgmt which gets devices up and ready
- How I did it
Delay Nvidia watchdog-control service before hw-mgmt has started on Mellanox platform in order to avoid missing or not ready watchdog device.
- How to verify it
verification test of ONIE installation of image in a loop
making sure watchdog service is always up (not failed) after first installation from ONIE
- Why I did it
To include latest fixes:
Fix traffic loss on all routed traffic when moving from 4.4.3372/XX_2008_3388 to 4.5.4118-012/XX_2010_4120-010. Issue occurred after ISSU process in Spectrum 1 only, When upgrading from older version to a new one. Neighbor entries are overwritten.
Fix When using mirror session policer on SPC2/3, the actual CIR was 1.28 times more than the configured CIR value.
Fix Creation of router interface of type bridge may occasionally fail if create is performed immediately after delete.
Fix False errors during SDK deinitialization may be seen in the syslog
- How I did it
Updated SDK submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
- Why I did it
In sfplpm API, the number of logical ports is hardcoded as 64. When a system contains more port than this, the SDK APIs would fail with a syslog as below
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [MGMT_LIB.ERR] Slot [0] Module [0] has logport [0x00010069] in enabled state
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [SDK_MGMT_LIB.ERR] Failed in __sdk_mgmt_phy_module_pwr_attr_set, error: Internal Error
Mar 7 03:53:58.106118 r-leopard-58 ERR pmon#-c: Error occurred when setting power mode for SFP module 0, slot 0, error code 1
- How I did it
Remove the hardcoded value of 64. Obtained the number of logical ports from SDK
- How to verify it
Manual testing
Fixes#13568
Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:
admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb 8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.
- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.
- How to verify it
Upgrade from 201911 to master
master to 201911 downgrade
master -> master reboot
ONIE -> master boot (First FW burn)
Which release branch to backport (provide reason below if selected)
- Why I did it
To optimize Mellanox platform build
- How I did it
sdk debs are now downloaded as Spectrum-SDK-Drivers-SONiC-Bins release
sx kernel is downloaded as zip from Spectrum-SDK-Drivers
- How to verify it
configure/build for Mellanox platform
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
- Why I did it
FW for Spectrum-4 ASIC not yet available
- How I did it
Remove in Mellanox fw make files to Spectrum-4 ASIC firmware binaries.
Remove from firmware upgrade scripts to be able Spectrum-4 ASIC.
- How to verify it
Run regression test
- Why I did it
On Mellanox platform, system EEPROM is a soft link provided by hw-management. There is chance that config-setup service accessing the EEPROM before hw-management creating it. It causes errors. The PR is aim to fix it.
- How I did it
Waiting EEPROM creation in platform API up to 10 seconds.
- How to verify it
Manual test
- Why I did it
Add support for systems 4600/4600C/2201 that are using sonic interface names aligned to 4 instead of 8 (which is the max number of lanes per port).
Improve DB access calls, now we use Python library functions.
- How I did it
Use addition information taken from Config DB in order to create map from SDK logical index to sonic interface name.
- How to verify it
Run ECMP calculator on 4600, 4600C and 2201 platforms.
- Why I did it
sfp_event.py gets a PMPE message when a cable event is available. In PMPE message, there is no label port available. Current sfp_event.py is using sx_api_port_device_get to get 64 logical ports attributes, and find the label port from those 64 attributes. However, if there are more than 64 ports, sfp_event.py might not be able to find the label port and drop the PMPE message.
- How I did it
Don't use hardcoded 64, get logical port number instead.
- How to verify it
Manual test
- Why I did it
Add PYTHON3_SWSSCOMMON as build time dependency to Mellanox platform API to avoid issue like:
19:34:11 ImportError while loading conftest '/sonic/platform/mellanox/mlnx-platform-api/tests/conftest.py'.
19:34:11 tests/conftest.py:28: in <module>
19:34:11 from sonic_platform import utils
19:34:11 sonic_platform/__init__.py:18: in <module>
19:34:11 from sonic_platform import *
19:34:11 sonic_platform/platform.py:28: in <module>
19:34:11 raise ImportError(str(e) + "- required module not found")
19:34:11 E ImportError: No module named 'swsscommon'- required module not found- required module not found
19:34:11 [ FAIL LOG END ] [ target/python-wheels/bullseye/mlnx_platform_api-1.0-py3-none-any.whl ]
The issue only happens when calling below command:
make target/python-wheels/bullseye/mlnx_platform_api-1.0-py3-none-any.whl
- How I did it
Add PYTHON3_SWSSCOMMON as build time dependency to Mellanox platform API
- How to verify it
Run build
- Why I did it
Add non-upstream kernel patches for the Nvidia platforms
These patches are not yet upstream but needed for new technology.
A flow to upstream them is in progress and once they will be approved they will be moved officially to sonic-linux-kernel.
Till then to include them in the build (not must) the build option INCLUDE_EXTERNAL_PATCH_TAR=y should be included
- How I did it
Zip all the patches in to a tar.gz tarball.
- How to verify it
Manually test
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
Currently, when building MFT, it can only download the source code from the official download site: http://www.mellanox.com/downloads/MFT/, it's not possible to integrate an internal version that has not been officially released yet.
The intention of this PR is to make it possible to download the source code from any valid link.
- How I did it
Add a new parameter "MLNX_MFT_INTERNAL_SOURCE_BASE_URL", if an URL is given, it will download the source code from the given URL, otherwise, it downloads from the default official site.
- How to verify it
Specify a valid URL in the make file, the MFT debs should be built successfully.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
Support per PSU slope value for PSU power threshold according to hardware team requirement
- How I did it
Pass the PSU number as a parameter when fetching the slope value of PSU.
- How to verify it
Running regression and manual test
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.
To fix issue #13591.
- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected
- How to verify it
Manual test
- Why I did it
Advance hw-mgmt service to V.7.0020.4100
Add missing thermal sensors that are supported by hw-mgmt package
Delay system health service before hw-mgmt has started on Mellanox platform in order to avoid reading some sensors before ready.
Depends on sonic-net/sonic-linux-kernel#305
- How I did it
1. Update hw mgmt version
2. Add missing sensors
3. Delay service
- How to verify it
Regression test.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
Added platform specific script to be invoked during SAI failure dump. Added some generic changes to mount /var/log/sai_failure_dump as read write in the syncd docker
- How I did it
Added script in docker-syncd of mellanox and copied it to /usr/bin
- How to verify it
Manual UT and new sonic-mgmt tests
- Why I did it
To include latest fixes and new functionality
SDK/FW
1. Fixed bug in recovery mechanism in case of I2C error when trying to access the XSFP module.
2. On the NVIDIA Spectrum-2 switch, when receiving a packet with Symbol Errors on ports that are configured to cut-thought mode, a pipeline might get stuck.
3. On the Spectrum-2 and Spectrum-3 switch, if you enable ECN marking and the port is in split mode, traffic sent to the port under congestion (for example, when connecting two ports with a total speed of 50GbE to a single 25GbE port) is not marked.
4. Modifying existing entry/Adding new one when switch is at its maximum capacity (full by maximum allowed entries from any type such as routes, FDB, and so forth), will fail with an error.
5. When many ports are active (e.g., 70 ports up), and the configuration of shared buffer is applied on the fly, occasionally, the firmware might get stuck.
6. When a system has more than 256 ACL rules, on rare occasion, removing/adding rules may cause some ACL rules not to work.
7. On SN2201 system, on RJ45 port, the link might appear in 'down' state even if it operations properly.
8. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event.
9. When setting LAG as a SPAN analyzer, the distributor mode of the LAG members was not taken into account. It may happen that the LAG member with distributor mode disabled will be set as a SPAN analyzer port.
- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Why I did it
To improve ASIC FW upgrade logging and have information about the cause of FW update failure in the log.
- How I did it
Added syslog logger support
In case the FW update has failed the update tool will give the cause of the failure in the output in the last line, starting with "Fail".
When running the tool, in case of a failed update, we will parse the output to retrieve the cause and log it.
Device #1:
----------
Device Type: ConnectX6DX
Part Number: MCX623106AN-CDA_Ax
Description: ConnectX-6 Dx EN adapter card; 100GbE; Dual-port QSFP56; PCIe 4.0/3.0 x16;
PSID: MT_0000000359
PCI Device Name: /dev/mst/mt4125_pciconf0
Base GUID: 0c42a103007d22d4
Base MAC: 0c42a17d22d4
Versions: Current Available
FW 22.32.0498 22.32.0498
PXE 3.6.0500 3.6.0500
UEFI 14.25.0015 14.25.0015
Status: Forced update required
---------
Found 1 device(s) requiring firmware update...
Device #1: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Fail : The Digest in the signature is wrong
- How to verify it
mlnx-fw-upgrade.sh --upgrade
Add script usage and more information to script description being printed in help option.
- Why I did it
Missing information in script description in help option.
- How I did it
Expand script description and add script usage.
- How to verify it
Run the script with -h option.
- Why I did it
In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.
- How I did it
Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.
- How to verify it
a. Manual test:
- Perform a power loss
- Perform a warm/fast reboot
- Check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"
b. Run reboot cause related regression test.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
Support syslog rate limit configuration feature
- How I did it
Remove unused rsyslog.conf from containers
Modify docker startup script to generate rsyslog.conf from template files
Add metadata/init data for syslog rate limit configuration
- How to verify it
Manual test
New sonic-mgmt regression cases
- Why I did it
In order to prevent the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py test failing on the log analyzer step.
The mentioned test is performing the sfputil reset EthernetX for every interface on the SONiC switch, this action will flap the SFP device status (INSTERTED -> REMOVED -> INSTERTED).
The SONiC XCVRD daemon will catch this SFP device status change (because it is monitoring the presence status of the cable).
To judge the cable presence status, currently, we are still leveraging to read the first bytes of the EEPROM, and the EEPROM could be not ready at some moment and the SONiC XCVRD daemon will print the error log to Syslog:
ERR pmon#xcvrd: Error! Unable to read data for 'xx' port, page 'xx' offset 128, rc = 1, err msg: Sending access register
- How I did it
Change logging severity from ERR to WARNING
- How to verify it
Run the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py
OR much faster way to run the next script on the switch:
#!/bin/bash
START=0
END=248
for (( intf=$START; intf<=$END; intf+=8))
do
sfputil reset Ethernet"${intf}"
done
sfputil show presence
- Why I did it
Following code to judge whether a process is running inside a docker could get stuck on the simx platform
subprocess.Popen(["docker", "--version"],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
universal_newlines=True)
When it gets stuck, the config-chassisdb service can not be successfully started, thus the system can not be booted up.
root@sonic:/# service config-chassisdb status
config-chassisdb.service - Config chassis_db
Loaded: loaded (/lib/systemd/system/config-chassisdb.service; enabled; vendor preset: enabled)
Active: activating (start) since Thu 2022-12-15 09:23:02 UTC; 29min ago
Main PID: 571 (config-chassisd)
Tasks: 14 (limit: 9501)
Memory: 132.4M
CGroup: /system.slice/config-chassisdb.service
├─571 /bin/bash /usr/bin/config-chassisdb
├─575 /usr/bin/python3 /usr/local/bin/sonic-cfggen -H -v DEVICE_METADATA.localhost.platform
├─602 /bin/sh -c sudo decode-syseeprom -m
├─603 sudo decode-syseeprom -m
├─607 /usr/bin/python3 /usr/local/bin/decode-syseeprom -m
├─616 /bin/sh -c docker --version 2>/dev/null
└─617 docker --version
- How I did it
Use an alternative way to implement this function and issue can be avoided:
docker_env_file = '/.dockerenv'
return os.path.exists(docker_env_file) is False
- How to verify it
run regression on real hardware and simx platform.
Why I did it
Update ECMP calculator README file with new instructions how to run the calculator.
How I did it
Update README file.
How to verify it
Read README file.
- Why I did it
Remove TODO comments which are no longer needed
- How I did it
Remove TODO comments which are no longer needed
- How to verify it
Only comment change
- Why I did it
Added ECMP calculator tool.
- How I did it
New files were added.
- How to verify it
Manual tests performed according to tests chapter in HLD
Automated tests will be added by verification.