Commit Graph

335 Commits

Author SHA1 Message Date
Stephen Sun
1b168c36c4
[Mellanox][201911] Upgrade Mellanox-SAI to 1.19.3 to support reclaiming reserved buffer on admin down ports (#8735)
#### Why I did it

Upgrade Mellanox-SAI to 1.19.3 to support reclaiming reserved buffer on admin down ports

#### How I did it

To support reclaiming reserved buffer on admin down ports.

#### How to verify it

Regression test and manual test.
2021-09-19 20:16:10 -07:00
Junchao-Mellanox
30f2503ab3
[Mellanox] Read PSU fan max/min speed per PSU (#8563) (#8728)
New PSU could install different type of fan, so fan max/min speed should be read per PSU
2021-09-13 08:32:28 -07:00
shlomibitton
55f86768a6
[Mellanox] Update SDK\FW to version 4.4.3326\2008.3326 (#8568)
- Why I did it
Update SDK\FW version to 4.4.3326\2008.3326. This version contains:

New Features:
1. Add support for Fast Boot for SN3800

Bug Fixing:
1. In some cases, when the total number of allocations exceeds the resource limit, an error can occur due to incorrect resource release procedure. This issue is most likely to affect the following resources: flow counters, ACL actions, PBS, WJH filter, Tunnels, ECMP containers, MC (L2 &L3)
2. On Spectrum systems, when using Async Router API with IPV6, an error message in the log regarding failing to remove ECMP container may show up. This error is not functional and can be safely ignored.
3. On Spectrum-2 systems and above, when using warm boot, setting max_bridge_num to a value greater than 1968 will cause an error and potential crash.
4. Some Molex cables do not support speed after reboot

- How I did it

- How to verify it
Was verified by running regression tests that includes complete sonic-mgmt tests supported
2021-08-25 16:34:42 +03:00
Junchao-Mellanox
c647c7ce2b
[Mellanox] Upgrade hw-mgmt to 7.0100.2344 (#8378)
Why I did it
To support new PSU fan on mellanox platforms

How I did it
Upgrade hw-mgmt to 7.0100.2344
2021-08-19 18:07:58 -07:00
Dror Prital
2a34e8aca5
[mellanox]: Update SDK\FW to version 4.4.3228\2008_3224 (#8352)
Fix the following issue: Resource KVD hash Table tries to deallocate more resources than allocated.

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-08-05 19:05:26 -07:00
Dror Prital
949fcd21a8
Update SDK\FW to version 4.4.3222\2008.3224 (#8248)
*Update SDK\FW Version to 4.4.3222\2008.3224.
Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-22 18:17:05 -07:00
Vivek Reddy
fcc7d3102a
[201911][Mellanox] Update SDK\FW ver. 4.4.3216\2008.3218 (#8145)
Signed-off-by: Dror Prital <drorp@nvidia.com>
* [Mellanox] Update FW version to 2008.3218 (#8079)
Update FW version to 2008.3218, fixing the following issues:
- 50G/100G links that are operationally down before warm-reboot are not coming up after warm-reboot
- 50G/100G links with admin shut / no shut commands are not coming up after warm-reboot
2021-07-09 17:54:25 -07:00
Vivek Reddy
d958b6c664
Update SAI Commit (#8141)
[ba669c3] Fix saisdkdump
Co-authored-by: Vivek Reddy Karri <vkarri@r-build-sonic06.mtr.labs.mlnx>
2021-07-09 15:28:00 -07:00
Dror Prital
e2eb4e49ab
[Mellanox][201911] Update FW version to 2008_3110 (#7806)
- Why I did it
Update FW version to 2008_3110 fixing SN3800 specific warm boot scenario:

1. Disable interface
2. Warm Boot
3. Enable Interface --> link will remain down.

- How I did it
Use new FW that contains the fix for the problem mentioned above

- How to verify it
Run the scenario mentioned above and make sure that the link is up after warm boot

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-06-08 14:06:43 +03:00
Kebo Liu
7fe493aca1
[201911][Mellanox] Align PSU name convention returned from psu.get_name platform API (#7793)
Make PSU name returned from platform API aligned with the convention "PSU {X}" instead of "PSU{X}".

This PR is to backport https://github.com/Azure/sonic-buildimage/pull/7783
2021-06-04 10:38:16 -07:00
Volodymyr Samotiy
8405d2deef
[Mellanox] Update SDK to 4.4.3106 and FW to xx.2008.3106 (#7787)
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-06-03 10:02:23 -07:00
Kebo Liu
77ba74be04
update hw-mgmt package to version 2304 (#7726) 2021-05-27 10:39:16 -07:00
shlomibitton
b869ad1122 [Mellanox] Update FW to xx.2008.2526 (#7511)
- Why I did it
Updated FW to xx.2008.2526 version.

Fixed issues:
1. Spectrum-2, Spectrum-3 | sFlow | High CPU load and high on fully loaded switch.
2. Spectrum-2, Spectrum-3 | Fine grain LAG | in rare cases doesn’t update the right entry

- How I did it
Updated submodule pointer and version in a Makefile.

- How to verify it
Full regression and bugs validation

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-05-14 11:42:22 -07:00
Junchao-Mellanox
6128ff6def
[Mellanox] [201911] Upgrade hw-mgmt to 7.0100.2303 (#7418)
- Update hw-mgmt pointer
- Remove unused patches
- Fix existing patch to make sure it apply successfully
2021-05-01 10:30:51 -07:00
Volodymyr Samotiy
efe0515519
[Mellanox] [201911] Update SAI submodule pointer (#7498)
* Set monitoring VLAN hostif up dy default (for VNET ping tool)

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-05-01 10:23:57 -07:00
Dror Prital
b773412a96
[mellanox]: Integrate SAI version 1.18.3.1 into 201911 branch (#7426)
- Fix ACL ANY debug counter to correctly track ACL drops
- Add VXLAN source port hard coded range, controlled by K/V

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-04-28 23:01:58 -07:00
Kebo Liu
80f0836643 [Mellanox] Update SDK to 4.4.2522 and FW to 2008.2520 (#7391)
New features and fixes in the new SDK/FW:

SN4600C | AN/LT support
SN2700 | AN/LT bugs fixes
WJH | FID_MISS support

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-04-28 16:06:37 -07:00
Volodymyr Samotiy
d35c31b8fc [Mellanox] Update SDK to 4.4.2508 and FW to xx.2008.2508 (#7141)
Fix the following issues:

Spectrum-2, Spectrum-3 | Port | Fix link issue when using 25 GbE rate between two ports while one is on Spectrum-2-based system and the other is on Spectrum-3-based system
All | warmboot | fail to upgrade from earlier SONiC versions with official SDK/FW 4.4.2306 (was on SONiC 201911)
All | What-Just-Happened | When enabling or disabling WJH under high traffic load to the host CPU, in very specific and low probability conditions, an error could occur, that may result in loss of data, channel failure or in extreme cases SW failure

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-04-07 18:06:46 -07:00
Joe LeVeque
72b32a96fc
[201911][dockers][supervisor] Increase event buffer size for process exit listener (#7106)
Backport of https://github.com/Azure/sonic-buildimage/pull/7083 to the 201911 branch.

#### Why I did it

To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:

```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```

This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.

This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).

I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-29 10:07:43 -07:00
Stephen Sun
746a64e483
[mellanox]: Integrate hw-mgmt V.7.0010.1002 (#7149)
Bug fixes

-Removing critical thermal zones to prevent unexpected software system shutdown:
   Kernel 4.9 -0071-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch
   Kernel 4.19 -076-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch

- hw-mgmt: thermal: Add hardcoded critical trip point

- Removing redundant link for cpld3 for fixed systems (SN2100, SN2010).

- Fix an issue with a missed attribute for cpld3 (port CPLD) for SN2700, SN2410.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-03-28 19:23:02 -07:00
Volodymyr Samotiy
88de361f96 [Mellanox] Update FW to xx.2008.2424 (#7118)
Fixed issues:
* Mellanox SN-2700 breakout port not linking up with QSA

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-03-22 19:35:32 -07:00
Kebo Liu
f2cd1ee2db
update SDK/FW and SAI to new version (#7040)
- Why I did it
To pick up new features and fix from SDK/FW and SAI

SDK/FW new Feature:

All | Added support for multiple modules and cable types. For full list contact Nvidia networking support
Spectrum-3 | SN46000C | Added support for up to 5W on ports 49 to 64 .
SDK/FW bugs' fix:

All | fast reboot | fast boot failure from latest 201811 to 201911 and above
Spectrum | 10GbE/1GbE Transceiver (FTLX8574D3BCV) stopped working after firmware upgrade
Spectrum-2 | When device is rebooted with locked Optical Transceivers in split mode, the firmware may get stuck
Spectrum-2 | SN3700 | When connecting at 200GbE to Ixia K400, Ixia receives CRC errors
Spectrum-2 | SN3800 | On rare occasions packets loss may be experienced due to signal integrity issues
Spectrum-2 | When the port is a member of a LAG, after a warmboot and port toggle on the peer-side, the port remains down
Spectrum-3 | SN4700 | While using Optic cable in Split 4x1 mode in PAM4, when two first ports are toggled, the other 2 ports go down
Spectrum-3 | SN4700 | When working in 400GbE, deleting the headroom configuration (changing buffer size to zero) on the fly may cause continual packet drops
SAI

All | Counters | Update tunnel decap counter to capture VNI miss
- How I did it
Update the related version number in the make files and update the submodule pointer accordingly.

- How to verify it
Run regression test and everything works good.
2021-03-14 08:36:03 +02:00
Kebo Liu
5ec164b694 [Mellanox] [platform API] Fix “local variable 'label_port' referenced before assignment” error (#6419)
In rare case can see that xcvrd failed due to "UnboundLocalError: local variable 'label_port' referenced before assignment"

Init "label_port" as None at the beginning of the function, to avoid the case that "label_port" not assigned.
2021-02-18 18:10:54 -08:00
Volodymyr Samotiy
dc5eaf618f
[201911][Mellanox][SAI] update submodule pointer (#6805)
* Open ACL Outer VLAN ID for egress for ports part of VLAN RIF

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-02-17 22:26:17 -08:00
Stepan Blyshchak
d328af4016
[mellanox] update FW to *.2008.2314 (#6790)
Bring in a fix for thermal shutdown observed while executing warm-reboot:

- All | prevent FW access during ISSU

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-02-17 16:23:25 -08:00
Volodymyr Samotiy
4742eaacc3
[201911][Mellanox] Update SDK to 4.4.2318, FW to *.2008.2312 (#6752)
To have the following fixes:
* All | Port status remains down after warm boot and flapping the port on peer side
* All | LAG HASH  | IPv6 SRC_IP is not accounted in LAG hashing [
* All | ASIC driver | Kernel crash observed when driver reload is initiated before it fully loaded
* Spectrum-3 | Buffer | In lossless configuration, headroom is been evicted only when the shared buffers is free

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-02-10 23:28:33 -08:00
Stepan Blyshchak
313bfdfc4c
[Mellanox][SAI] update submodule pointer (#6730)
Include SAI bug fixes:

Apply device MAC on port host interface when port is removed from LAG.
[Shared Headroom]: fixed watermark handling for SHP flow
Decrease verbosity of policer unbind message when no policer is attached

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-02-09 14:47:41 -08:00
Eran Dahan
9c9f0453f9
[MLNX] update SAI submodule (#6666)
** Why I did it **
Disable SDK extended dump due to issue found

** How I did it ** 
Update SAI submodule

** How to verify it **
Verify the SDK extended dump is not called.

Signed-off-by: Eran Dahan <erand@nvidia.com>
2021-02-04 09:03:51 +02:00
lguohan
22a19e87aa [build]: wait for conflicts package to be uninstalled (#5039)
when parallel build is enabled, both docker-fpm-frr and docker-syncd-brcm
is built at the same time, docker-fpm-frr requires swss which requires to
install libsaivs-dev. docker-syncd-brcm requires syncd package which requires
to install libsaibcm-dev.

since libsaivs-dev and libsaibcm-dev install the sai header in the same
location, these two packages cannot be installed at the same time. Therefore,
we need to serialize the build between these two packages. Simply uninstall
the conflict package is not enough to solve this issue. The correct solution
is to have one package wait for another package to be uninstalled.

For example, if syncd is built first, then it will install libsaibcm-dev.
Meanwhile, if the swss build job starts and tries to install libsaivs-dev,
it will first try to query if libsaibcm-dev is installed or not. if it is
installed, then it will wait until libsaibcm-dev is uninstalled. After syncd
job is finished, it will uninstall libsaibcm-dev and swss build job will be
unblocked.

To solve this issue, _UNINSTALLS is introduced to uninstall a package that
is no longer needed and to allow blocked job to continue.

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-01-27 14:07:30 -08:00
lguohan
8bcdefbc34 [docker-orchagent]: make build depends only on sairedis package (#6467)
backport c4b5b002c3

make swss build depends only on libsairedis instead of syncd. This allows to build swss without depending
on vendor sai library.

Currently, libsairedis build also buils syncd which requires vendor SAI lib. This makes difficult to build
swss docker in buster while still keeping syncd docker in stretch, as swss requires libsairedis which also
build syncd and requires vendor to provide SAI for buster. As swss docker does not really contain syncd
binary, so it is not necessary to build syncd for swss docker.

[submodule]: update sonic-sairedis
1e42517996bfe41ac58d4c25ee3f93502befcb9d (HEAD -> 201911) [build]: add option to build without syncd

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-01-27 13:51:24 -08:00
Kebo Liu
35d93ff8a3
[201911][Mellanox] Add hw-mgmt patch to support SDK OFFLINE event handling during ISSU (#6551)
In order to prevent "mlxsw_minimal" driver accessing ASIC during in
service firmware upgrade flow, SDK will raise "OFFLINE" 'udev' event
at early beginning of such flow. When this event is received,
hw-managemnet will remove "mlxsw_minimal" driver.
There is no need to implement opposite "ONLINE" event, since this flow
is ended up with "kexec".

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-01-26 16:49:13 -08:00
Kebo Liu
687e1b9931
[mellanox]: Update SDK to 4.4.2308, FW to *.2008.2308 (#6553)
Bugs fixes:
    All | Kernel | During system reload when CPU is loaded with heavy traffic, a Kernel Panic may occur.
    All | Modules, Port split | FW stuck when device rebooted with locked Optical Transceivers in split mode
    Spectrum-3 | PFC | On Spectrum-3 systems, slow reaction time to Rx pause packets on 40GbE ports may lead to buffer overflow on servers.
    Spectrum-3 | SN4700, Port Split | On rare occasion SN4700, conducting 100G split (4x25G) in NRZ when splitter port 1 or 2 are down, ports 3 and 4 will also go down.

Enahncments:
    All | Kernel | new notification on ISSU start, so other kernel drivers can disable any interface to ASIC

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-01-25 20:10:15 -08:00
lguohan
a90eac73bf [mellanox]: fix mellanox hw-management build (#6471)
use dpkg-buildpackage build with fakeroot

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-01-25 12:44:50 -08:00
Kebo Liu
dea38d1558
Update Mellanox SDK to 4.4.2208 FW to *.2008.2208 (#6342) 2021-01-04 14:10:37 +02:00
shlomibitton
6d38654034
[Mellanox] PSU led platform API fixes (#6214)
- Why I did it
Fix setting PSU led to 'green' or 'red' states.
Fix return False if unsupported color request.
Remove 'off' option for PSU led API since it is not supported in Mellanox.

- How I did it
Fix import missing information.
Return 'False' when unsupported led color is requested, preventing an exception.

- How to verify it
Try to set PSU LED to different status with Mellanox platform device.
Try to set PSU LED color to unsupported color with Mellanox platform device.
2020-12-24 01:11:48 -08:00
Volodymyr Samotiy
78c44d1808
[Mellanox] Update SAI submodule (#6235)
To add VNET route diff tool (SAI/SDK part) to 201911 release

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2020-12-17 09:11:50 -08:00
Volodymyr Samotiy
39e1c27525
update SDK to 4.4.2112, FW to *.2008.2112, SAI to 1.18.0.1 (#6147)
Co-authored-by: keboliu <kebol@mellanox.com>
2020-12-08 07:54:50 +02:00
Junchao-Mellanox
8f45bfa1be [Mellanox] Remove eeprom cache file when first time init eeprom object (#6071)
EEPROM cache file is not refreshed after install a new ONIE version even if the eeprom data is updated. The current Eeprom class always try to read from the cache file when the file exists. The PR is aimed to fix it.
2020-12-04 13:26:23 -08:00
Abhishek Dosi
8c0df39c96 Revert "Advance SDK/SAI (#6004)"
This reverts commit 33a6e56833.
2020-11-26 11:55:52 -08:00
Junchao-Mellanox
37eb088b74
[Mellanox] [201911] Fix issue: set fan led in certain order causes incorrect physical fan led color (#6019)
* Fix issue: fan led colo status

* Fix LGTM warning

* Support fan led management for non-swapable fan
2020-11-26 10:09:48 +02:00
Stephen Sun
33a6e56833
Advance SDK/SAI (#6004)
SDK 4.4.2018
FW XX_2008_2018
SAI 1.17.9

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2020-11-26 09:43:50 +02:00
Junchao-Mellanox
ebc84bee94
Fix issue: fan.get_presence always return false (#5983) 2020-11-23 09:28:12 +02:00
Junchao-Mellanox
500395c56e
[Mellanox] Support max/min speed for PSU fan (#5682) (#5801)
As new hw-mgmt expose the sysfs for PSU fan max speed, we need support max/min speed for PSU fan in mellanox platform API.
Conflicts:
	platform/mellanox/mlnx-platform-api/sonic_platform/fan.py
2020-11-17 18:17:37 +02:00
shlomibitton
1b10f86554 [Mellanox] Fix for QSFP-DD channel status (#5900)
Wrong object init broke the API. Replace object to the correct type.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2020-11-14 12:26:28 -08:00
shlomibitton
4088872bb5 [Mellanox] Enhance QSFP-DD DOM information (#5776)
New driver support fetching additional pages from the cable EEPROM.
There are additional information to parse now: RX/TX power, TX bias, TX fault and RX LOS.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2020-11-14 12:25:57 -08:00
Nazarii Hnydyn
781abed79e
[Mellanox] Update SAI to v.1.17.7. (#5766)
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2020-11-02 10:51:49 +02:00
Junchao-Mellanox
712d97f911
[Mellanox] Update SDK 4.4.1956 and FW *.2008.1956 for 201911 (#5769)
Update SDK 4.4.1956 and FW *.2008.1956

Bugs fixes:

1.	Link | Clear operational speed when link is not active
2.	Spectrum-2, SN3800 | On rare occasion, link flapping due to bad BER causes traffic loss
3.	Spectrum-3 | On rare occasion, link flapping due to bad BER causes traffic loss as a result of new PAM4 link maintenance flow on Spectrum-3 devices
4.	Shared Buffers | On rare occasion, modifying shared buffers on a system with split port while traffic is running may cause the firmware to get stuck
5.	Spectrum-3, SN4700 | Fence may fail while running 400GbE 8x port when modifying mirror session configurations under traffic
2020-11-01 23:20:27 -08:00
abdosi
0fad6bdc7f [monit] Adding patch to enhance syslog error message generation for monit alert action when status is failed. (#5720)
Why/How I did:

Make sure first error syslog is triggered based on FAULT TOLERANCE condition.

Added support of repeat clause with alert action. This is used as trigger
for generation of periodic syslog error messages if error is persistent

Updated the monit conf files with repeat every x cycles for the alert action
2020-11-01 10:27:10 -08:00
Junchao-Mellanox
06b5ad02ac [Mellanox] Re-initialize SFP object when detecting a new SFP insertion (#5695)
When detecting a new SFP insertion, read its SFP type and DOM capability from EEPROM again.

SFP object will be initialized to a certain type even if no SFP present. A case could be:

1. A SFP object is initialized to QSFP type by default when there is no SFP present
2. User insert a SFP with an adapter to this QSFP port
3. The SFP object fail to read EEPROM because it still treats itself as QSFP.

This PR fixes this issue.
2020-10-30 09:04:26 -07:00
Stepan Blyshchak
f7d753fd70 [Mellanox] Configure SAI to log to syslog instead of stdout. (#5634)
Example of syslog message from Mellanox SAI:

"Oct  7 15:39:11.482315 arc-switch1025 INFO syncd#supervisord: syncd Oct 07 15:39:11 NOTICE  SAI_BUFFER: mlnx_sai_buffer.c[3893]- mlnx_clear_buffer_pool_stats: Clear pool stats pool id:1"

There is a log INFO from supervisord which actually printed NOTICE and
date again. This confusion happens becuase if SAI is not built to log
to syslog it will log everything to stdout with format "[date] [level]
[message]" so supervisord sends it to syslog with level INFO.

New logs look like:

"Oct  7 15:40:21.488055 arc-switch1025 NOTICE syncd#SDK  [SAI_BUFFER]: mlnx_sai_buffer.c[3893]- mlnx_clear_buffer_pool_stats: Clear pool stats pool id:17"

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2020-10-30 08:57:21 -07:00