Commit Graph

545 Commits

Author SHA1 Message Date
Vivek
f49ae28948 [Mellanox] Fix the hw-mgmt intg tool case sensitivity for KConfig (#14709)
Fix the script to consider case sensitivity while writing the kconfig

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
2023-05-06 12:32:19 +08:00
mssonicbld
10e635be93
[Mellanox] Facilitate automatic integration of new hw-mgmt (#14594) (#14966) 2023-05-06 09:08:54 +08:00
Lior Avramov
d7d8d7754d
[Mellanox] [202211] Replace iproute2 supplied by SDK to iproute2 downloaded from Debian repository (#14726) (#14724)
- Why I did it
Mellanox syncd container will be based on Debian iproute2 plus patches instead of Nvidia internal version of iproute2

- How I did it
Download iproute2 from Debian repository, apply patches and compile to create a new target.
The target is then deployed in syncd container of Mellanox switches only.
The new target is called IPROUTE2_MLNX.

- How to verify it
Compile and load on switch, verify interfaces network devices created successfully.
Verify LLDP shows connections to neighbors.
Verify ping between 2 hosts over 2 router ports is successful.
2023-05-02 10:29:02 +03:00
mssonicbld
6781c4a4fb
Made non-upstream patch design order aware (#14434) (#14650) 2023-04-14 03:29:35 +08:00
Sudharsan Dhamal Gopalarathnam
156189dbad [Mellanox]Fix lpmode set when logical port is larger than 64 (#14138)
- Why I did it
In sfplpm API, the number of logical ports is hardcoded as 64. When a system contains more port than this, the SDK APIs would fail with a syslog as below

Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [MGMT_LIB.ERR] Slot [0] Module [0] has logport [0x00010069] in enabled state
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [SDK_MGMT_LIB.ERR] Failed in __sdk_mgmt_phy_module_pwr_attr_set, error: Internal Error
Mar 7 03:53:58.106118 r-leopard-58 ERR pmon#-c: Error occurred when setting power mode for SFP module 0, slot 0, error code 1

- How I did it
Remove the hardcoded value of 64. Obtained the number of logical ports from SDK

- How to verify it
Manual testing
2023-03-19 20:50:58 +08:00
Dror Prital
ba14f728de Update SDK/FW to version 4.5.4206/4.5.4204 (#14164)
- Why I did it
To include latest fixes:

Fix traffic loss on all routed traffic when moving from 4.4.3372/XX_2008_3388 to 4.5.4118-012/XX_2010_4120-010. Issue occurred after ISSU process in Spectrum 1 only, When upgrading from older version to a new one. Neighbor entries are overwritten.
Fix When using mirror session policer on SPC2/3, the actual CIR was 1.28 times more than the configured CIR value.
Fix Creation of router interface of type bridge may occasionally fail if create is performed immediately after delete.
Fix False errors during SDK deinitialization may be seen in the syslog

- How I did it
Updated SDK submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".
2023-03-19 20:50:49 +08:00
dbarashinvd
d7ba89a95b [Mellanox] fix for watchdog device not found, adding dependency on hw-management (#14182)
- Why I did it
Sometimes Nvidia watchdog device isn't ready when watchdog-control service is up after first installation from ONIE
need to delay watchdog control service to go up after hw-mgmt which gets devices up and ready

- How I did it
Delay Nvidia watchdog-control service before hw-mgmt has started on Mellanox platform in order to avoid missing or not ready watchdog device.

- How to verify it
verification test of ONIE installation of image in a loop
making sure watchdog service is always up (not failed) after first installation from ONIE
2023-03-19 20:50:44 +08:00
Volodymyr Samotiy
cc5ed4b632 [Mellanox] Update MFT to 4.22.1-15 (#14133)
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2023-03-19 18:33:57 +08:00
Stepan Blyshchak
969166d769 [Mellanox] Place FW binaries under platform directory instead of squashfs (#13837)
Fixes #13568

Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:

admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb  8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa

- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.

- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.

- How to verify it
Upgrade from 201911 to master
master to 201911 downgrade
master -> master reboot
ONIE -> master boot (First FW burn)
Which release branch to backport (provide reason below if selected)
2023-03-08 13:50:18 +08:00
mssonicbld
aea96da04d
[Mellanox] Fix issue: cannot find label port for logical port when logical port number is larger than 64 (#13710) (#13962) 2023-03-06 16:47:31 +08:00
mssonicbld
1757f53290
[Mellanox] update sdk/fw build procedure (#14025) (#14059) 2023-03-03 02:43:19 +08:00
mssonicbld
18bc044179
Remove support to Mellanox SPC4 ASIC (#13932) (#13957) 2023-02-23 22:22:35 +08:00
mssonicbld
310827c26c
Add PYTHON3_SWSSCOMMON as build time dependency to Mellanox platform API (#13847) (#13959) 2023-02-23 20:32:15 +08:00
mssonicbld
50aaf92590
[Mellanox] Non upstream patches for hw-mgmt V.4.0020.4104 (#13792) (#13960) 2023-02-23 20:32:09 +08:00
Junchao-Mellanox
e8789a2e11 [Mellanox] Check system eeprom existence in a retry manner (#13884)
- Why I did it
On Mellanox platform, system EEPROM is a soft link provided by hw-management. There is chance that config-setup service accessing the EEPROM before hw-management creating it. It causes errors. The PR is aim to fix it.

- How I did it
Waiting EEPROM creation in platform API up to 10 seconds.

- How to verify it
Manual test
2023-02-23 20:31:29 +08:00
mssonicbld
6a12ca9332
[Mellanox] [ECMP calculator] Add support for 4600/4600C/2201 platforms with different interface naming method (#13814) (#13931) 2023-02-22 22:14:09 +08:00
Stephen Sun
b0416a5c2c [Mellanox] Advance hw-mgmt to v.7.0020.4104 (#13372)
- Why I did it
Advance hw-mgmt service to V.7.0020.4100
Add missing thermal sensors that are supported by hw-mgmt package
Delay system health service before hw-mgmt has started on Mellanox platform in order to avoid reading some sensors before ready.
Depends on sonic-net/sonic-linux-kernel#305

- How I did it
1. Update hw mgmt version
2. Add missing sensors
3. Delay service 

- How to verify it
Regression test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-02-20 14:38:53 +08:00
Stephen Sun
4f3b649f8e [Mellanox] Support per PSU slope value for PSU power threshold (#13757)
- Why I did it
Support per PSU slope value for PSU power threshold according to hardware team requirement

- How I did it
Pass the PSU number as a parameter when fetching the slope value of PSU.

- How to verify it
Running regression and manual test

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-02-20 12:38:20 +08:00
Sudharsan Dhamal Gopalarathnam
a993fc205f [Mellanox][sai_failure_dump]Added platform specific script to be invoked during SAI failure dump (#13533)
- Why I did it
Added platform specific script to be invoked during SAI failure dump. Added some generic changes to mount /var/log/sai_failure_dump as read write in the syncd docker

- How I did it
Added script in docker-syncd of mellanox and copied it to /usr/bin

- How to verify it
Manual UT and new sonic-mgmt tests
2023-02-18 06:34:29 +08:00
mssonicbld
94e59a841e
[Mellanox] Enhance MFT make file to download source code from any valid URL (#13801) (#13868) 2023-02-18 02:14:00 +08:00
Volodymyr Samotiy
e849455742 [Mellanox] Update SDK/FW to 4.5.4150/2010.4150 (#13480)
- Why I did it
To include latest fixes and new functionality

SDK/FW
1. Fixed bug in recovery mechanism in case of I2C error when trying to access the XSFP module.
2. On the NVIDIA Spectrum-2 switch, when receiving a packet with Symbol Errors on ports that are configured to cut-thought mode, a pipeline might get stuck.
3. On the Spectrum-2 and Spectrum-3 switch, if you enable ECN marking and the port is in split mode, traffic sent to the port under congestion (for example, when connecting two ports with a total speed of 50GbE to a single 25GbE port) is not marked.
4. Modifying existing entry/Adding new one when switch is at its maximum capacity (full by maximum allowed entries from any type such as routes, FDB, and so forth), will fail with an error.
5. When many ports are active (e.g., 70 ports up), and the configuration of shared buffer is applied on the fly, occasionally, the firmware might get stuck.
6. When a system has more than 256 ACL rules, on rare occasion, removing/adding rules may cause some ACL rules not to work.
7. On SN2201 system, on RJ45 port, the link might appear in 'down' state even if it operations properly.
8. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event.
9. When setting LAG as a SPAN analyzer, the distributor mode of the LAG members was not taken into account. It may happen that the LAG member with distributor mode disabled will be set as a SPAN analyzer port.

- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2023-02-16 18:36:43 +08:00
Lior Avramov
e6b1ed366b [Mellanox] [ECMP calculator] Add script usage and more information to script description in help option (#13493)
Add script usage and more information to script description being printed in help option.

- Why I did it
Missing information in script description in help option.

- How I did it
Expand script description and add script usage.

- How to verify it
Run the script with -h option.
2023-02-16 18:36:36 +08:00
mssonicbld
8832ddd60b
[Mellanox] Improve FW upgrade logging (#13465) (#13681) 2023-02-12 23:53:33 +08:00
Vadym Hlushko
3530fdbea1 [SFP] Change logging severity when failed to read EEPROM (#13011)
- Why I did it
In order to prevent the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py test failing on the log analyzer step.

The mentioned test is performing the sfputil reset EthernetX for every interface on the SONiC switch, this action will flap the SFP device status (INSTERTED -> REMOVED -> INSTERTED).

The SONiC XCVRD daemon will catch this SFP device status change (because it is monitoring the presence status of the cable).
To judge the cable presence status, currently, we are still leveraging to read the first bytes of the EEPROM, and the EEPROM could be not ready at some moment and the SONiC XCVRD daemon will print the error log to Syslog:

ERR pmon#xcvrd: Error! Unable to read data for 'xx' port, page 'xx' offset 128, rc = 1, err msg: Sending access register

- How I did it
Change logging severity from ERR to WARNING

- How to verify it
Run the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py

OR much faster way to run the next script on the switch:

#!/bin/bash

START=0
END=248

for (( intf=$START; intf<=$END; intf+=8))
do
    sfputil reset Ethernet"${intf}"
done

sfputil show presence
2023-02-04 02:36:51 +08:00
Junchao-Mellanox
cf6f31b215 [Mellanox] Remove TODO comments which are no longer needed (#13023)
- Why I did it
Remove TODO comments which are no longer needed

- How I did it
Remove TODO comments which are no longer needed

- How to verify it
Only comment change
2023-02-04 02:36:47 +08:00
Kebo Liu
9680479661 [Mellanox] change the implementation of is_host() to fix a stuck issue on simx platform (#13100)
- Why I did it
Following code to judge whether a process is running inside a docker could get stuck on the simx platform

subprocess.Popen(["docker", "--version"],
                                stdout=subprocess.PIPE,
                                stderr=subprocess.STDOUT,
                                universal_newlines=True)
When it gets stuck, the config-chassisdb service can not be successfully started, thus the system can not be booted up.

root@sonic:/# service config-chassisdb status
     config-chassisdb.service - Config chassis_db
     Loaded: loaded (/lib/systemd/system/config-chassisdb.service; enabled; vendor preset: enabled)
     Active: activating (start) since Thu 2022-12-15 09:23:02 UTC; 29min ago
   Main PID: 571 (config-chassisd)
      Tasks: 14 (limit: 9501)
     Memory: 132.4M
     CGroup: /system.slice/config-chassisdb.service
                        ├─571 /bin/bash /usr/bin/config-chassisdb
			├─575 /usr/bin/python3 /usr/local/bin/sonic-cfggen -H -v DEVICE_METADATA.localhost.platform
			├─602 /bin/sh -c sudo decode-syseeprom -m
			├─603 sudo decode-syseeprom -m
			├─607 /usr/bin/python3 /usr/local/bin/decode-syseeprom -m
			├─616 /bin/sh -c docker --version 2>/dev/null
			└─617 docker --version

- How I did it
Use an alternative way to implement this function and issue can be avoided:

docker_env_file = '/.dockerenv'
return os.path.exists(docker_env_file) is False

- How to verify it
run regression on real hardware and simx platform.
2023-02-04 02:36:43 +08:00
Kebo Liu
ab54549d53 [Mellanox] Skip the leftover hardware reboot cause in case of last boot is warm/fast reboot (#13246)
- Why I did it
In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.

- How I did it
Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.

- How to verify it
a. Manual test:
    - Perform a power loss
    - Perform a warm/fast reboot
    - Check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"
b. Run reboot cause related regression test.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-01-31 18:34:36 +08:00
Junchao-Mellanox
e631f426f4
[infra] Support syslog rate limit configuration (#12490) (#13535)
Backport of https://github.com/sonic-net/sonic-buildimage/pull/12490 into 202211

- Why I did it
Support syslog rate limit configuration feature

- How I did it
Remove unused rsyslog.conf from containers
Modify docker startup script to generate rsyslog.conf from template files
Add metadata/init data for syslog rate limit configuration

- How to verify it
Manual test
New sonic-mgmt regression cases
2023-01-30 20:11:44 +02:00
Dror Prital
d12c3b79bc
[202211][Mellanox] Add ASIC simulation version tag to fw.mk (#13473)
Signed-off-by: dprital <drorp@nvidia.com>
2023-01-23 13:28:19 +02:00
mssonicbld
1dc71aa4ff
[Mellanox] Update ECMP calculator README (#13051) (#13362) 2023-01-14 11:46:42 +08:00
mssonicbld
1e522ff3a9
Add ECMP calculator tool (#12482) (#13301) 2023-01-09 00:48:56 +08:00
Kebo Liu
28f8da80ea [Mellanox] Add support to Mellanox Spectrum-4 ASIC Firmware compiling and upgrade (#12844)
- Why I did it
Add support for compiling Spectrum-4 ASIC firmware to the SONiC image
Add support for Spectrum-4 ASIC firmware upgrade

- How I did it
Update Mellanox fw make files to include Spectrum-4 ASIC firmware binaries.
Update firmware upgrade scripts to be able to detect Spectrum-4 ASIC.

- How to verify it
Run regression tests

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-12-10 10:33:21 +08:00
Lior Avramov
f3821c6d2f [Mellanox] Add SDK hash calculator debian and update SDK makefile to compile it (#12840)
- Why I did it
Add SDK hash calculator Debian and update SDK makefile to compile it.

- How I did it
SDK hash calculator Debian will be used by ECMP calculator (PR #12482)

- How to verify it
Compile sonic-buildimage and verify SDK hash calculator Debian exist in target folder.
2022-12-10 10:33:21 +08:00
Stephen Sun
91e12d7b49 [Mellanox] Support PSU power threshold checking (#11863)
* Support power threshold

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* get_psu_power_warning_threshold => get_psu_power_warning_suppress_threshold

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Fix comments

Signed-off-by: Stephen Sun <stephens@nvidia.com>

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-12-10 10:33:21 +08:00
Richard.Yu
c34e3ff86b
[submodule]Advance sairdis with sai 1.11 and add brcm and mlnx sai sdk (#12471) (#12820)
* Why I did it*
Advance submodule sairdis with sai 1.11 and add brcm and mlnx sai sdk

*How I did it*
Advance sairedis which contains
Todo: cause sairedis 202211 branch blocked by some dependences repo, map to sairedis master, will move to 202211 when branch ready
[submodule][SAI]Advance SAI head pointer sonic-sairedis#1155
[Recorder]: Acquire lock for ofstream changes sonic-sairedis#1145
[SAI submodule update] Enable support for SAI v1.11.0 sonic-sairedis#1140
Add brcm sdk 7.1 which update with sai 1.11
Add mlnx sdk which update with sai 1.11
*How to verify it*
Test with pipeline which enable RPC build as well https://github.com/sonic-net/sonic-buildimage/pull/12770/files
Test with sonic smoke test cases
Test with sai test cases

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Co-authored-by: Kebo Liu <kebol@nvidia.com>
Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Co-authored-by: Kebo Liu <kebol@nvidia.com>
2022-11-24 23:30:54 +08:00
Junchao-Mellanox
20d885dbc2
[Mellanox] Add new thermal sensors for SN5600 (#12671)
- Why I did it
Add new thermal sensors for SN5600

- How I did it
Add new thermal sensors for SN5600: PCH and SODIMM

- How to verify it
Manual test
2022-11-14 11:10:33 -08:00
Kebo Liu
c8c2b7fc45
[Mellanox] [Platform API] Update SN2201 dynamic minimum fan speed table (#12602)
- Why I did it
Update SN2201 dynamic minimum fan speed table according to data provided by the thermal team.

- How I did it
Update the thermal table in device_data.py

- How to verify it
Run platform related regression

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-11-08 13:37:10 +02:00
Junchao-Mellanox
830b7d8cb4
[Mellanox] Use sdk sysfs instead of ethtool (#12480) 2022-11-03 11:17:44 -07:00
Vivek
5d83d424b1
Added BUILD flags to provision for building the kernel with non-upstream patches (#12428)
* Added ENV vars for non-upstream patches

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>

* Made MLNX_PATCH_LOC an absolute path

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>

* Added non-upstream-patches dir

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>

* Update README.md

* Addressed comments

* Env vars updated

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>

* Readme updated

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>
2022-10-31 12:16:05 -07:00
Dror Prital
917ad1ffe0
[Mellanox] Update SDK/FW to version 4.5.3186/2010.3186 (#12542)
- Why I did it
Update SDK/FW version - 4.5.3186/2010_3186 in order to have the following changes:

New functionality:
1. Added support for 6.5W (Class 8) in ports 49-50, 53-54, 57-58, and 61-62 on SN4600 system

Fix the following issues:
1. On very rare occasion (~1/100K), during I2C transaction with MMS1V50-WM and MMS1V90-WR modules on SN4700 system, the module may send unexpected stop which violate the I2C specification, possibly affecting the link up flow
2. When running 1GbE speeds on SN4600 system, the port remained active while peer side was closed
3. While toggling the cable with ‘sfputil lpmode on/off’, error msg like “ERR pmon#xcvrd: Receive PMPE error event on module 1: status {X} error type {y}” could be received
4. When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted
5. When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU
6. While moving from lossless to lossy mode while shared headroom was used, reduction of the shared headroom can only be done prior to pool type change and when shared headroom is not utilized
7. SLL configuration is missing in SDK dump
8. If TTL_CMD_COPY is used in Encap direction for a packet with no TTL, then the value passed in the ttl data structure will be used if non-zero (default 255 if zero)
9. PCI calibration changes from a static to a dynamic mechanism
10. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event
11. SDK returned error when FEC mode is set on twisted pair, when FEC was set to None

- How I did it
Update pointer for the SDK/FW

- How to verify it
Run regression tests

Signed-off-by: dprital <drorp@nvidia.com>
2022-10-30 09:31:09 +02:00
Stephen Sun
8c73e68468
Remove \n from the end of fs_path in ONIEUpdater (#12465)
This fixes the following error

```
admin@sonic:~$ sudo fwutil show status
mount: /mnt/onie-fs: special device /dev/sda2
 does not exist.
Error: Command '['mount', '-n', '-r', '-t', 'ext4', '/dev/sda2\n', '/mnt/onie-fs']' returned non-zero exit status 32.. Aborting...
Aborted!
admin@sonic:~$ sudo vi /usr/local/lib/python3.9/dist-packages/sonic_platform/

```
Seems like #11877 the rstrip('\n') was removed. Probably by mistake.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-10-23 09:59:20 +03:00
Mai Bui
648ca075c7
[device/mellanox] Mitigation for security vulnerability (#11877)
Signed-off-by: maipbui <maibui@microsoft.com>
Dependency: [PR (#12065)](https://github.com/sonic-net/sonic-buildimage/pull/12065) needs to merge first.
#### Why I did it
`subprocess.Popen()` and `subprocess.check_output()` is used with `shell=True`, which is very dangerous for shell injection.
#### How I did it
Disable `shell=True`, enable `shell=False`
#### How to verify it
Tested on DUT, compare and verify the output between the original behavior and the new changes' behavior.
[testresults.zip](https://github.com/sonic-net/sonic-buildimage/files/9550867/testresults.zip)
2022-10-06 17:51:31 -04:00
Dror Prital
44356fa8d7
[Mellanox] Add NVIDIA copyright header for NVIDIA added files (#12130)
- Why I did it
Add NVIDIA Copyright header for new "NVIDIA" files

- How I did it
Add the copyright header as remark at the head of the file
2022-10-02 11:34:24 +03:00
Volodymyr Samotiy
eea8ebd0a9
[Mellanox] Update MFT to v4.21.0-100 (#11758)
- Why I did it
To update MFT package to the latest version.

- How I did it
Updated MFT_VERSION & MFT_REVISION in platform/mellanox/mft.mk.

- How to verify it
Build an image and deploy to the switch
Check MFT version by dpkg -l | grep mft
Verify that all the SONiC services up and running
Run regression testing using tests from sonic-mgmt

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2022-09-30 09:48:40 +03:00
Volodymyr Samotiy
92bd6dae28
[Mellanox] Update SAI to v2205.22.1.19 and SDK/FW to v4.5.3168/v2010.3170 (#12205)
- Why I did it
To include latest fixes and new functionality

SAI fixes and new features
fix #3205239, incorrect object type returned for SG child list
Fix VRF-VNI map entries remove issue
ECC health event and logging
[Port Buffers] restore default queue and pg configuration when all user pools are deleted
Fix EVPN type3 error on removal of uc/bc flood group
Fix EVPN type2 MAC move from local to remote results in SAI failure
Fix Disable learning on VXLAN tunnel
Fix error on VXLAN v6 tunnel removal
Fix port cannot apply schedule group when it is a lag member
Fix BFD add more detailed message on BFD packet not related to any existing session
gcc10 compilation fixes
Disable learning on VXLAN tunnel
Support BFD remote-disc exchange in negotiation stage
Tunnel Loopback packet action attribute implementation (for Dual TOR)
Add KVD resources MIN/MAX functionality (pending CRM issue with MIN only)
Support for CRC2 hash algorithm
Bulk counter support for PGs, queues
Support mirror sample rate attribute (SPC2+)
[Functional] [QoS] | Unable to remove SCHEDULE profile table even if there is no object referencing it
Next hop group optimized bulk API
Reduce verbosity of shared database already exists print
Span mirror policer (SPC2+), optimize pipeline for acl mirror action with policer on SPC2+
use same size descriptor pool for rx/tx
fix bfd - notify Sonic for admin-down event
2201 - empty list for supported fec for RJ45 ports
Fix don't disable used tunnel underlay interfaces

SDK fixes
100GbE FCI DAC (10137628-4050LF/HPE PN: 845408-B21) was recognized by mistake as supporting "cable burning' which caused the switch firmware to read page 0x9f (which unsupported in the cable) and to report this cable as having "bad eeprom".
Added remote peer UDP port information in BFD packet event.
After editing an ECMP, the resilient ECMP next-hop counter may not count correctly.
Fixed potential memory leaks in some APIs related to LPM
If TTL_CMD_COPY is used in Encap direction for a packet with no TTL, then the value passed in the ttl data structure will be used if non-zero (default 255 if zero).
In SN2201: When configuring Force mode, user should configure Speed and FEC on both sides
In Flex Tunnel encapsulation flow, if the encapsulation is with an IPv6 header, the flow label field may not be updated as expected.
In some cases, when changing speed to 400GbE over 8 lanes, the first few packets would be dropped.
In some traffic patterns involving small packets, the PortRcvErrors counter may mistakenly count events of local physical errors due to an internal flow in the hardware that involves link packets.
On Spectrum systems, sometimes during link failure, not all previous firmware indications cleared properly, potentially affecting the next link up attempt.
On the NVIDIA Spectrum-2 switch, when receiving a packet with Symbol Errors on ports that are configured to cut-thought mode, a pipeline might get stuck.
PCI calibration changes from a static to a dynamic mechanism.
SDK debug dump shows "Unknown" Counter in RFC3635 Counter Group.
SDK debug dump shows "Unknown" Counter in the PPCNT Traffic Class Counter Group.
SDK Dump missing column headers in some GC tables may result in difficulty understanding the dump.
SLL configuration is missing in SDK dump.
Spectrum-2 systems, do no support 1GbE on supported 40GbE modules.
When binding a UDP port which is already in use for BFD TX session, the error message appears incorrectly.
When Flex Tunnel was used, Flex Modifier sometimes experienced a brief mis-configuration during ISSU.
When many ports are active (e.g. 70 ports up), and the configuration of shared buffer is applied on the fly, occasionally, the firmware might get stuck.
When running 1GbE speeds on SN4600 system, the port remained active while peer side was closed.
When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted.
When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU.
While toggling the cable, and the low power mode is set to ON, an unexpected PMPE event error is received.

- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2022-09-30 09:40:12 +03:00
Junchao-Mellanox
1d69f0916e
[Mellanox] Provide dummy implementation for get_rx_los and get_tx_fault (#12231)
- Why I did it
get_rx_los and get_tx_fault is not supported via the exisitng interface used, need provide dummy implementation for them.
NOTE: in later releases we will get them back via different interface.

- How I did it
Return False * lane_num for get_rx_los and get_tx_fault

- How to verify it
Added unit test
2022-09-30 09:38:05 +03:00
Stephen Sun
4d317aff94
[Mellanox] Fix typo in platform API (#12136)
- Why I did it
Fix a typo in chassis platform API which causes the following error

>>> import sonic_platform as P
>>> c = P.platform.Platform().get_chassis()
>>> sl = c.get_all_sfps()
>>> sl[0].get_lpmode()
Sep 28 07:48:33 INFO    LOG: Initializing SX log with STDOUT as output file.
False
>>> del c
Exception ignored in: <function Chassis.__del__ at 0x7f1d166ef8b0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/sonic_platform/chassis.py", line 126, in __del__
    self.sfp_module.deinitialize_sdk_handle(sfp_module.SFP.shared_sdk_handle)
NameError: name 'sfp_module' is not defined

- How I did it
Use self while using the SDK handle

- How to verify it
Manual test

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-09-28 11:09:18 +03:00
Junchao-Mellanox
f890606d82
Revert "[Mellanox] Redirect ethtool stderr to subprocess for better error log (#12038)" (#12183)
This reverts commit 9750cb4.

There is a PR to handle 202205 branch revert: #12184

- Why I did it
The PR to be reverted introduced many notice logs every 1 minute if SFP is not plugged:

Cannot get module EEPROM information: Input/output error
Before the "bad" PR, the message format is like this:

INFO pmon#supervisord: xcvrd Cannot get module EEPROM information: Input/output error
It was truncated by rsyslog because every message is the same. However, the "bad" PR introduces SFP index to the message:

NOTICE pmon#xcvrd: Failed to get EEPROM data for sfp 39: Cannot get module EEPROM information: Input/output error
Rsyslog no longer truncate such log and many such messages are flooded to syslog.

- How I did it
Revert the PR

- How to verify it
Manual test
2022-09-28 10:15:26 +03:00
Dror Prital
54b146f56c
[Mellanox] Update SDK/FW to version 4.5.2320/2010.2320 (#11990)
- Why I did it
Update SDK/FW version - 4.5.2320/2010_2320 in order to have the following fixes:
• Spectrum-3 | PCI calibration changes from a static to a dynamic mechanism.
• [VxLAN] TTL was set to 0 for non IP traffic (such as ARP)

- How I did it
Update pointer for the SDK/FW

- How to verify it
Run regression tests
2022-09-14 20:43:38 +03:00
Junchao-Mellanox
9750cb48c6
[Mellanox] Redirect ethtool stderr to subprocess for better error log (#12038)
- Why I did it
ethtool print error logs when EEPROM of a SFP is not available. It prints error like this:

INFO pmon#/supervisord: xcvrd Cannot get module EEPROM information: Input/output error
INFO pmon#/supervisord: xcvrd Cannot get Module EEPROM data: Invalid argument
However, this log does not contain the relevant SFP index which is hard for developer/qa to find the exactly SFP.

- How I did it
Redirect ethtool stderr to subprocess and log it better

- How to verify it
Manual test
2022-09-14 20:41:43 +03:00