Commit Graph

435 Commits

Author SHA1 Message Date
Stepan Blyshchak
bc58e2d841
[202012][mlnx-ffb.sh] Update issu-version location (#14927)
BACKPORT OF https://github.com/sonic-net/sonic-buildimage/pull/14925

#### Why I did it

ISSU version check fails due to inability to mount squashfs from 202211 on 201911

#### How I did it

Put ISSU version file under platform directory

#### How to verify it

202012 (with [202012][mlnx-ffb.sh] Update issu-version location  #14927) to master
2023-07-01 23:43:51 -07:00
Yakiv Huryk
ab5115846d
[202012][Mellanox] update sdk/fw build procedure (#14025) (#14220)
- Why I did it
To optimize Mellanox platform build

- How I did it
sdk debs are now downloaded as Spectrum-SDK-Drivers-SONiC-Bins release
sx kernel is downloaded as zip from Spectrum-SDK-Drivers
2023-03-16 12:42:19 +02:00
Sudharsan Dhamal Gopalarathnam
79548e472d
[Mellanox]Fix lpmode set when logical port is larger than 64 (#14138) (#14202)
Manual cherry-pick of https://github.com/sonic-net/sonic-buildimage/pull/14138
- Why I did it
In sfplpm API, the number of logical ports is hardcoded as 64. When a system contains more port than this, the SDK APIs would fail with a syslog as below

Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [MGMT_LIB.ERR] Slot [0] Module [0] has logport [0x00010069] in enabled state
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [SDK_MGMT_LIB.ERR] Failed in __sdk_mgmt_phy_module_pwr_attr_set, error: Internal Error
Mar 7 03:53:58.106118 r-leopard-58 ERR pmon#-c: Error occurred when setting power mode for SFP module 0, slot 0, error code 1

- How I did it 
Remove the hardcoded value of 64. Obtained the number of logical ports from SDK

- How to verify it 
Manual testing
2023-03-14 10:19:02 -07:00
Sudharsan Dhamal Gopalarathnam
ca17198f04
[202012][Mellanox] Change MFT version to 4.21.0-100 (#13956)
- Why I did it
Update MFT version to 4.21.0-100 to include a fix for an issue reported using mlxlink on qsfp-dd

- How I did it
Update mft.mk

- How to verify it
Run regression on Mellanox platforms
2023-02-26 09:42:52 +02:00
Junchao-Mellanox
0f47c5be59
[202012] [Mellanox] Fix issue: cannot lable port for logical port is logical port number larger than 64 (#13709)
- Why I did it
sfp_event.py gets a PMPE message when a cable event is available. In PMPE message, there is no label port available. Current sfp_event.py is using sx_api_port_device_get to get 64 logical ports attributes, and find the label port from those 64 attributes. However, if there are more than 64 ports, sfp_event.py might not be able to find the label port and drop the PMPE message.

- How I did it
Don't use hardcoded 64, get logical port number instead.

- How to verify it
Manual test
2023-02-23 08:27:21 +02:00
Stepan Blyshchak
73c7ced753
[202012][Mellanox] Place FW binaries under platform directory instead of squashfs (#13890)
Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:

admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb  8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa

- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.

- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.

- How to verify it
Upgrade from 201911 to 202012
202012 to 201911 downgrade
202012 -> 202012 reboot
ONIE -> 202012 boot (First FW burn)

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2023-02-22 17:38:54 +02:00
Junchao-Mellanox
7543993af3
[202012] [Mellanox] Fix issue: SFP eeprom corrupted after replacing cable with different sfp type (#13543)
- Why I did it
There are 3 tasks in xcvrd:

main task, run a loop to recover missing SFP static information to DB every 1 minute
SFP state task, a process which listens cable plug in/out event, insert SFP static information to DB while a cable is inserted
SFP DOM update task, a thread which handles cable DOM information update every 1 minute
Let assume user replaces QSFP with QSFP-DD. There are two issues:

Only SFP state task listens cable plug in/out event, main task and SFP DOM update task does not know SFP type has changed, they still “think” the SFP type is QSFP. So, main task and SFP DOM update task uses QSFP standard to parse QSFP-DD EEPROM which causes corrupted data.
There is a race condition between main task and SFP state task. They both insert SFP static information to DB. Depends on timing, it is possible that main task using wrong SFP type to override SFP static information.
The PR is to fix these two issues.

There is no such issue on 202205 and above because there is a refactor for xcvrd:

SFP state task was changed from process to thread, so that all 3 tasks share the same memory space, they always have correct SFP type.
Recover missing SFP information logical was moved from main task to SFP state task. There is no race condition anymore.

- How I did it
It is difficult to back port latest xcvrd because there are many refactor/new features in xcvrd after 202012 release. It will be huge effort to do so. Based on that, we decided to fix the issue on Nvidia platform API side. The fix is that: refreshing SFP type before any SFP API which accessing SFP EEPROM. Refreshing SFP type before any SFP API would cause a small performance down: Due to my test on 202012 branch, accessing transceiver INFO and DOM INFO for 32 ports takes 1.7 seconds before the change. The number changes to 2.4 seconds after the change. I suppose the performance down is acceptable.

- How to verify it
Manual test
Regression
2023-02-19 09:47:32 +02:00
Nazarii Hnydyn
83b6518ae2
[202012][mellanox]: Add BIOS upgrade infra (#13571)
- Why I did it
Added BIOS upgrade infra

- How I did it
Added new make target

- How to verify it
Copy msn3800_bios.tar.gz to platform/mellanox/bios
make configure PLATFORM=mellanox
make target/files/buster/msn3800_bios.tar.gz

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2023-02-02 10:07:03 +02:00
Junchao-Mellanox
46a774e294
[202012] [Mellanox] Fix select timeout in sfp event (#13347)
- Why I did it
Backport #9795
Python select.select accept a optional timeout value in seconds, however, the value passes to it is a value in millisecond.

- How I did it
Transfer the value to millisecond.

- How to verify it
Manual test
2023-01-19 17:29:41 +02:00
Kebo Liu
a569bfc9eb
skip hw reboot cause if warm/fast reboot found from the proc cmdline (#13378)
#### Why I did it
Backport https://github.com/sonic-net/sonic-buildimage/pull/13246 to 202012 branch.

In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.

#### How I did it

Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.

#### How to verify it

a. Manual test:
> 1. Perform a power loss
> 2. Perform a warm/fast reboot
> 3. check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"

b. Run reboot cause related regression test.
2023-01-17 13:21:31 -08:00
Nazarii Hnydyn
5193a96895
[202012][Mellanox]: Update ONiE FW tool: manual reboot control. (#13359)
Partial cherry-pick of: [Mellanox] Modified Platform API to support all firmware updates in single boot #9608

- Why I did it
To allow user manual reboot control over ONiE FW upgrade

- How I did it
Added a dedicated script argument handling

- How to verify it
mlnx-onie-fw-update.sh update --no-reboot

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2023-01-16 15:27:48 +02:00
Dror Prital
e8c7a7c61e
[202012][Mellanox] Update SDK/FW to version 4.5.3196/2010_3196 (#12989)
- Why I did it
Update SDK/FW version - 4.5.3196/2010_3196 in order to have the following fixes:

1. ON SPC2/3 in some cases, after many ACL region resize will corrupt internal DB that in return will fail future ACLs configuration
2.. Lag Port as Analyzer Port | when removing port from distributer list SDK does not reselect another port for mirroring
3. Due to critical race at initial configuration, SDK RDQ test may test RDQ configured for WJH and fail the test

Add support for new HW SKU of SN4700

- How I did it
Update pointer for the SDK/FW

- How to verify it
Run regression tests
2022-12-08 12:16:54 +02:00
Kebo Liu
db03698ba5
fix DOM support caoability issues on QSFP and CMIS cables (#12500)
Signed-off-by: Kebo Liu <kebol@nvidia.com>

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-10-30 23:20:57 -07:00
Kebo Liu
78043b828c
[202012] [Mellanox] Read transceiver EEPROM via sdk sysfs (#12399)
- Why I did it
ethtool is not able to read certain pages(eg. page 11h) of CMIS cables.
SDK provides a set of sysfs to expose the transceiver EEPROM, now we migrate from using ethtool to read these sysfs for transceiver EEPROM reading.

- How I did it
replace ethtool with accessing the SDK sysfs for cable EEPROM reading.
Adjust the offset according to the SDK sysfs memory map.

- How to verify it
run sonic-mgmt sfp-related regression test case.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-10-30 09:34:39 +02:00
Dror Prital
5de7ae449a
Update SDK/FW to version 4.5.3186/2010_3186 (#12531)
- Why I did it
Update SDK/FW version - 4.5.3186/2010_3186 in order to have the following changes:

New functionality:
1. Added support for 6.5W (Class 8) in ports 49-50, 53-54, 57-58, and 61-62 on SN4600 system

Fix the following issues:
1. On very rare occasion (~1/100K), during I2C transaction with MMS1V50-WM and MMS1V90-WR modules on SN4700 system, the module may send unexpected stop which violate the I2C specification, possibly affecting the link up flow
2. When running 1GbE speeds on SN4600 system, the port remained active while peer side was closed
3. While toggling the cable with ‘sfputil lpmode on/off’, error msg like “ERR pmon#xcvrd: Receive PMPE error event on module 1: status {X} error type {y}” could be received
4. When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted
5. When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU
6. While moving from lossless to lossy mode while shared headroom was used, reduction of the shared headroom can only be done prior to pool type change and when shared headroom is not utilized
7. SLL configuration is missing in SDK dump
8. If TTL_CMD_COPY is used in Encap direction for a packet with no TTL, then the value passed in the ttl data structure will be used if non-zero (default 255 if zero)
9. PCI calibration changes from a static to a dynamic mechanism
10. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event
11. SDK returned error when FEC mode is set on twisted pair, when FEC was set to None

- How I did it
Update pointer for the SDK/FW

- How to verify it
Run regression tests

Signed-off-by: dprital <drorp@nvidia.com>
2022-10-30 09:29:45 +02:00
Dror Prital
edc4485d30
[202012][Mellanox] Update SDK/FW to version 4.5.2320/2010_2320 (#11975)
Update SDK/FW version - 4.5.2320/2010_2320 in order to have the following fixes:
• Spectrum-3 | PCI calibration changes from a static to a dynamic mechanism.
• [VxLAN] TTL was set to 0 for non IP traffic (such as ARP)
2022-09-07 08:33:18 +03:00
Dror Prital
db37325f76
[202012][Mellanox] Update SAI version to 1.22.0.0 and SDK/FW to version 4.5.2318/2010_2318 (#11534)
- Why I did it
Update SAI version - 1.22.0.0
Update SDK/FW version - 4.5.2318/2010_2318

SAI Changes:
1. Port FEC fix for multiple speeds
2. Next hop group optimized bulk API
3. Support BFD remote-disc exchange in negotiation stage
4. Reduce verbosity of shared database already exists print

SDK/FW Fixes:
1. Cr space timeout on Hold and Release GW - at warmboot
2. SPC-1 Port in stuck PHY_UP after peer side rebooted
3. memory leak in sx_api_router_ecmp_update_set

- How I did it
Update pointer for the new SAI and SDK/FW

- How to verify it
Run regression tests
2022-07-26 21:01:36 +03:00
Kebo Liu
c60bf90590
[202012] [Mellanox] Update hw-mgmt package to V.7.0010.2349 (#11421)
- Why I did it
New changes in this new HW-MGMT package:

1. hw-mgmt: chassis events: Fix voltmon address conflict on connecting
2. hw-mgmt: topology: Add COMEX BRDWL respin support
  a. Removed A2D sensor from all COMEX BRDWL boards
  b. Add COMEX BRDWL boards with register defined (config3)

- How I did it
Advance the hw-mgmt repo pointer and update the hw-mgmt version number

- How to verify it
Run platform-related regression test cases on the new testbed.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-07-20 09:00:17 +03:00
Alexander Allen
851bd9bff8 [Mellanox] Add arch folder to SDK binary location (#11278)
- Why I did it
This is for the eventual support of multiple architectures for the mellanox platform.

- How I did it
Change the location of the binaries in Switch-SDK-drivers so that the path specifies the target architecture in addition to the target distribution that the debians are built for.

This is the most straightforward way to separate binaries built against different architectures and selectively target them for installation in the mellanox SONiC image.

- How to verify it
Build SONiC for mellanox and verify it compiles successfully.
2022-07-05 20:58:01 +00:00
Nazarii Hnydyn
05ff95fdfc
[Mellanox]: Advance SAI submodule. (#11164)
[Mellanox]: Advance SAI submodule. (#11164)
Fix #3074227 - don't disable used tunnel underlay interfaces
fix bfd - notify Sonic for admin-down event
2022-06-16 18:09:59 -07:00
Volodymyr Samotiy
6b029a613b
[202012] [Mellanox] Update SAI to 1.21.1.2 and SDK/FW to 4.5.2262/xx.2010.2262 (#10880)
- Why I did it
To include latest fixes:
1. Warmboot | When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU.
2. Link Up | When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted.
3. Shared buffer | While moving from lossless to lossy while shared headroom was used, reduction of the shared headroom can only be done prior to pool type change and when shared headroom is not utilized.

- How I did it
Updated SAI & SDK submodules along with the relevant Makefiles

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2022-05-22 09:48:37 +03:00
Sudharsan Dhamal Gopalarathnam
2a232730b0
[202012][Mellanox] Update SDK/FW to 4.5.1500/2010.1500 and SAI version to 1.21.1.2 (#10464)
* [Mellanox] Update SDK/FW to 4.5.1500/2010.1500 and SAI version to 1.21.0.1

Signed-off-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>

* Updating Switch-SDK-drivers submodule pointer

* Updating SAI version
2022-05-04 06:07:10 +03:00
Kebo Liu
68b38b325b
[202012][Mellanox] Change MFT version to 4.18.0-106 (#10305)
- Why I did it
With the previous MFT 4.18.1-16 there is a bug in mstdump tool accessing wrong address. it is confirmed this issue does not exist in official 4.18.0-106.

- How I did it
Update the MFT version to 4.18.0-106

- How to verify it
Run regression on Mellanox platforms
2022-03-21 19:37:34 +02:00
Junchao-Mellanox
0c859fb036
[Mellanox] [202012] Fix issue: 4600C is using wrong thermal profile (#10258)
- Why I did it
4600C is using wrong thermal profile and it displays 2 CPU core thermal in show platform temperature output, there should be 4 CPU core thermal.

- How I did it
Change 4600C to use thermal profile 10.

- How to verify it
Manual test
2022-03-20 10:31:59 +02:00
Dror Prital
6293a091a8 [Mellanox] Upgrade ASIC FW tool to 4.18.1-16 (#9981)
- Why I did it
Update MFT to version 4.18.1-16 for bugs fixes and new SN2201 support

- How I did it
Advance to MFT tool version to 4.18.1-16

- How to verify it
Manually tested on all Mellanox platforms (ASIC FW Upgrade, link debug tools, CPLD upgrade, etc.)
2022-02-15 23:56:58 +00:00
Volodymyr Samotiy
e6b22b1942
[Mellanox][202012] Update SAI to 1.20.2.6 and SDK/FW to 4.5.1208/2010.1218 (#9818)
- Why I did it
To include latest fixes.
1. On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting SN4600C, 100GbE port with CWDM4 module (Gen 3.0), link up time is 30 seconds.
3. Add T1 ECMP Overlay support

- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2022-01-26 10:58:19 +02:00
Junchao-Mellanox
8e924b9a70
[Mellanox] Optimize thermal policies (#9665)
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.

- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable

- How to verify it
1. Manual test
2. Regression
2022-01-19 11:42:55 +02:00
Stepan Blyshchak
31065ccb93
[Mellanox] [202012] fail the build when hw-mgmt patches do not apply (#9566)
Taken from https://github.com/Azure/sonic-buildimage/pull/9539

####  Why I did it
To fix an issue that hw-mgmt patches were not applied. One patch was already in upstream hw-mgmt package thus applying it again caused an error and no other patches were applied. Also, I did it to improve the Makefile, so that the make will fail in case patches fail to apply.

####  How I did it
Removed obsolete patch, made applying patches a hard failure in the build.

####  How to verify it
Run the make and verify patches are applied.
2022-01-13 15:08:27 -08:00
DavidZagury
57abd5914e [Mellanox] Upgrade Mellanox firmware tools to 4.17.2-12 (#8978)
- Why I did it
Bug fix:
bad_param request due to missing parser rest command while running mlxlink

- How I did it
Advance to MFT tool version to 4.17.2-12.

- How to verify it
Manually tested on all mellanox platforms.
2022-01-12 22:36:11 +00:00
Kebo Liu
16a3929159
[202012][Mellanox] Update hw-mgmt package to V.7.0010.2347 (#9594)
- Why I did it
Update hw-mgmt to a new version to pick up support for the SN4600C A1 system.

- How I did it
Update the pointer of the hw-mgmt submodule
Update the hw-mgmt version number
Remove the staled code patch to hw-mgmt userspace code.

- How to verify it
Run platform regression on Mellanox platforms.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-12-28 09:40:58 +02:00
Stepan Blyshchak
bdf31a6556 [Mellanox][SDK] Build SDK with PRM sniffer support (#9500)
- Why I did it
To have an ability to use PRM sniffer.

- How I did it
Enabled the option in configure flags.

- How to verify it
Built and ran on switch. Enabled the feature in runtime and checked the sniffer recording.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-12-20 19:25:52 +00:00
Junchao-Mellanox
0197855d5d
[Mellanox] [202012] Allow user to set LED to orange (#9514)
Backport https://github.com/Azure/sonic-buildimage/pull/9259 to 202012

#### Why I did it

Nvidia platform API does not support set LED to orange. 

#### How I did it

Allow user to set LED to orange

#### How to verify it

Manual test
2021-12-13 16:04:06 -08:00
Stephen Sun
acac848858
[Reclaim buffer][202012] Reclaim unused buffers by applying zero buffer profiles (#9063)
- Why I did it
Support zero buffer profiles

1. Add buffer profiles and pool definition for zero buffer profiles
2. Support applying zero profiles on INACTIVE PORTS
3. Enable dynamic buffer manager to load zero pools and profiles from a JSON file

- How I did it
Add buffer profiles and pool definition for zero buffer profiles

If the buffer model is static:
 * Apply normal buffer profiles to admin-up ports
 * Apply zero buffer profiles to admin-down ports
If the buffer model is dynamic:
 * Apply normal buffer profiles to all ports
 * buffer manager will take care when a port is shut down

Update buffers_config.j2 to support INACTIVE PORTS by extending the existing macros to generate the various buffer objects, including PGs, queues, ingress/egress profile lists

Originally, all the macros to generate the above buffer objects took active ports only as an argument.
Now that buffer items need to be generated on inactive ports as well, an extra argument representing the inactive ports need to be added.
To be backward compatible, a new series of macros are introduced to take both active and inactive ports as arguments
The original version (with active ports only) will be checked first. If it is not defined, then the extended version will be called.
Only vendors who support zero profiles need to change their buffer templates
Enable buffer manager to load zero pools and profiles from a JSON file:

The JSON file is provided on a per-platform basis
It is copied from platform/<vendor> folder to /usr/share/sonic/temlates folder in compiling time and rendered when the swss container is being created.
To make code clean and reduce redundant code, extract common macros from buffer_defaults_t{0,1}.j2 of all SKUs to two common files:
One in Mellanox-SN2700-D48C8 for single ingress pool mode
The other in ACS-MSN2700 for double ingress pool mode
Those files of all other SKUs will be symbol link to the above files

Update sonic-cfggen test accordingly:
 * Adjust example output file of JSON template for unit test
 * Add unit test in for Mellanox's new buffer templates.

- How to verify it
Regression test.
Unit test in sonic-cfggen
Run regression test and manually test.

Signed-off-by: stephens <stephens@nvidia.com>
2021-12-09 17:34:56 +02:00
Volodymyr Samotiy
0831635b1c
[Mellanox] Update SDK to v4.4.3360 and FW to v2008.3358 (#9403)
- Why I did it
To include latest fixes.

1. On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting Spectrum devices with optical transceivers that support RXLOS, remote side port down might cause the switch firmware to get stuck and cause unexpected switch behavior.
3. On rare occasions, when working with port rates of 1GbE or 10GbE and congestion occurs, packets may get stuck in the chip and may cause switch to hang.
4. When ECMP has high amount of next-hops based on VLAN interfaces, in some rare cases, packets will get a wrong VLAN tag and will be dropped.
5. Using SN4600C with copper or optics loopback cables in NRZ speeds, link may raise in long link up times ( up to 70 seconds).
6. When connecting SN4600C to SN4600C after Fastboot in 50GbE No_FEC mode with a copper cable, the link up time may take ~20 seconds.

- How I did it
Updated SDK submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "soni-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-12-06 11:01:43 +02:00
Junchao-Mellanox
227f2f8aec [Mellanox] Fan speed should not be 100% when PSU is powered off (#9258)
- Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.

- How I did it
When PSU is powered of, don't treat it as absent.

- How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
2021-12-01 02:28:37 +00:00
Junchao-Mellanox
d69564a1e7 [Mellanox] Change thermal recover threshold from temp_trip_norm to temp_trip_high (#8792)
- Why I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high, so that thermal algorithm would set fan speed to minimum allowed earlier and save power.

- How I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high

- How to verify it
Manual test
2021-10-05 22:17:30 +00:00
Nazarii Hnydyn
70b9ea5409 [Mellanox] Advance hw-mgmt to V.7.0010.2346. (#8667)
Commits on Sep 01, 2021
hw-mgmt: attributes: Add PSU power sensor attributes d8fce39

Commits on Sep 02, 2021
Remove MFT package flint tool from hw-management dump generation. 53d06b2
hw-mgmt: debug: Add timeout to generate-dump.sh b661fa3 

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2021-09-09 12:03:44 +00:00
shlomibitton
c0f9bb9720
[202012] [Mellanox] Update SDK\FW to version 4.4.3326\2008.3326 (#8602)
- Why I did it
Update SDK\FW version to 4.4.3326\2008.3326. This version contains:

New Features:
1. Add support for Fast Boot for SN3800

Bug Fixing:
1. In some cases, when the total number of allocations exceeds the resource limit, an error can occur due to incorrect resource release procedure. This issue is most likely to affect the following resources: flow counters, ACL actions, PBS, WJH filter, Tunnels, ECMP containers, MC (L2 &L3)

2. On Spectrum systems, when using Async Router API with IPV6, an error message in the log regarding failing to remove ECMP container may show up. This error is not functional and can be safely ignored.

3. On Spectrum-2 systems and above, when using warm boot, setting max_bridge_num to a value greater than 1968 will cause an error and potential crash.

4. Some Molex cables do not support speed after reboot

- How I did it
Update submodule and .mk files

- How to verify it
Verified by running regression tests that includes complete sonic-mgmt tests supported

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-09-03 10:59:18 +03:00
Junchao-Mellanox
49f4ef6438 [Mellanox] Read PSU fan max/min speed per PSU (#8563)
#### Why I did it
New PSU could install different type of fan, so fan max/min speed should be read per PSU

#### How I did it
The existing implementation read PSU max/min fan speed from a common file, change it to read from per PSU file

#### How to verify it
Manual test
2021-08-27 02:27:00 +00:00
Alexander Allen
196fcffb6f [Mellanox] Upgrade Mellanox firmware tools to 4.17.0 (#8299)
- Why I did it
New release of MFT has the following changelog / RN
 Fixed an issue that resulted in getting MVPD read errors from the mlxfwmanager during fast reboot.
 Fixed mlxuptime sometimes generating a time less than previous due the wrong frequency calculation

- How I did it
Update makefile pointer to new version.

- How to verify it
Manually tested on all Mellanox platforms.
2021-08-23 03:05:20 +00:00
Junchao-Mellanox
8285cf2329
[Mellanox] [202012] Upgrade hw-mgmt to 7.0100.2344 (#8408)
To support new PSU fan on Mellanox platforms
2021-08-11 02:04:55 -07:00
DavidZagury
0551fed754 [Mellanox][Pcie] Fix issue on pcied with an id that contains only decimal digits was treated as a decimal number (#8309)
A device that contains only decimal digits was mistreated as a decimal integer resulting in failure to find it in the id to bus map.
2021-08-05 15:22:48 +00:00
DavidZagury
45e100b61b [Mellanox][pcied] Ignore bus on pcie.yaml for Mellanox switches (#8063)
Why I did it
BIOS upgrade on rare cases cannot guarantee bus value remain the same on every BIOS release. Ignoring this field in order for pcied not to fail but still verify device id in a different way. The solution is future proof and will not require changes in code when new BIOS version is available

How I did it
Since bus is not a fixed value (it is determined by the bios version) we are ignoring this field, and instead checking if there is a device that match on all other fields that and in addition has a matching device id.

How to verify it
Verify no errors or failures in pcied on different BIOS version with the same code base.
2021-07-27 10:46:31 +00:00
Dror Prital
be6cd44ddf Update SDK\FW to version 4.4.3222\2008.3224 (#8247)
*Update SDK\FW Version to 4.4.3222\2008.3224.

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-26 11:05:29 -07:00
tomer-israel
13a62666d9 [WARM-REBOOT] fix issue of watchdog on simx when executing warm-reboot command (#8132)
- Why I did it
to prevent python exception error when executing warm-reboot command on mellanox simulator platform

- How I did it
return None on the watchdog python script on cases that watchdog file is not exist

- How to verify it
warm-reboot is running well without the python error. error message will appear on log on these cases.
in order to avoid this error message we can simulate the watchdog on mellanox simulator platform
2021-07-20 10:18:17 +00:00
Vivek Reddy
1b6634765c
SAI fix (#8142)
[0e4f0b] Fix saisdkdump

#### Why I did it

Fix the saisdkdump failure when the vxlan src port flag is enabled in the sai.profile
2021-07-11 02:35:17 -07:00
Dror Prital
526dd3c4fb [Mellanox] Update FW version to 2008.3218 (#8079)
Update FW version to 2008.3218, fixing the following issues:
- 50G/100G links that are operationally down before warm-reboot are not coming up after warm-reboot
- 50G/100G links with admin shut / no shut commands are not coming up after warm-reboot

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-07 09:41:35 +00:00
Dror Prital
fb89c28c95
[202012] [Mellanox] Update SDK\FW ver. 4.4.3216\2008.3216 (#8056)
- Changes and new features:

1. Added support in SN4600C systems for new module Finisar ET7402-CWDM4 (100G CWDM4 QSFP28 1310nm SM 2KM).
2. Added support for new module MMS1W50-HM (2km transceiver FR4) for 200GbE
3. Improved performance of "per-port-buffer" counters
4. Added support for Kernel 5.10

- Bug fix:
On rare occasions (0.5%), in SN4600C systems, when using 100GbE NRZ mode and Fastboot flow, the link up time may take up to 10 seconds

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-06 07:31:34 +03:00
shlomibitton
b9d21a5779
Update SAI submodule (#7926)
- Why I did it
Split and bulk counter bug fixes:
Init port auto neg to default on static (SAI XML) port split for 2nd+ port

- How I did it
Update submodule hash pointer.

- How to verify it
Verify the above is handled properly and reported issues are assumed to be fixed.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-06-23 20:44:33 +03:00
Junchao-Mellanox
ccb663c39b
[Mellanox] [202012] Backport 'Read EEPROM data from DB if possible'(7808) to 202012 (#7928)
- Why I did it
Remove EEPROM cache file and use DB instead

- How I did it
Read EEPROM data from DB if possible
If data is not ready in DB, read from hardware using a visitor pattern

- How to verify it
Manual test and regression
2021-06-23 18:09:53 +03:00