Commit Graph

412 Commits

Author SHA1 Message Date
Junchao-Mellanox
0c859fb036
[Mellanox] [202012] Fix issue: 4600C is using wrong thermal profile (#10258)
- Why I did it
4600C is using wrong thermal profile and it displays 2 CPU core thermal in show platform temperature output, there should be 4 CPU core thermal.

- How I did it
Change 4600C to use thermal profile 10.

- How to verify it
Manual test
2022-03-20 10:31:59 +02:00
Dror Prital
6293a091a8 [Mellanox] Upgrade ASIC FW tool to 4.18.1-16 (#9981)
- Why I did it
Update MFT to version 4.18.1-16 for bugs fixes and new SN2201 support

- How I did it
Advance to MFT tool version to 4.18.1-16

- How to verify it
Manually tested on all Mellanox platforms (ASIC FW Upgrade, link debug tools, CPLD upgrade, etc.)
2022-02-15 23:56:58 +00:00
Volodymyr Samotiy
e6b22b1942
[Mellanox][202012] Update SAI to 1.20.2.6 and SDK/FW to 4.5.1208/2010.1218 (#9818)
- Why I did it
To include latest fixes.
1. On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting SN4600C, 100GbE port with CWDM4 module (Gen 3.0), link up time is 30 seconds.
3. Add T1 ECMP Overlay support

- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2022-01-26 10:58:19 +02:00
Junchao-Mellanox
8e924b9a70
[Mellanox] Optimize thermal policies (#9665)
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.

- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable

- How to verify it
1. Manual test
2. Regression
2022-01-19 11:42:55 +02:00
Stepan Blyshchak
31065ccb93
[Mellanox] [202012] fail the build when hw-mgmt patches do not apply (#9566)
Taken from https://github.com/Azure/sonic-buildimage/pull/9539

####  Why I did it
To fix an issue that hw-mgmt patches were not applied. One patch was already in upstream hw-mgmt package thus applying it again caused an error and no other patches were applied. Also, I did it to improve the Makefile, so that the make will fail in case patches fail to apply.

####  How I did it
Removed obsolete patch, made applying patches a hard failure in the build.

####  How to verify it
Run the make and verify patches are applied.
2022-01-13 15:08:27 -08:00
DavidZagury
57abd5914e [Mellanox] Upgrade Mellanox firmware tools to 4.17.2-12 (#8978)
- Why I did it
Bug fix:
bad_param request due to missing parser rest command while running mlxlink

- How I did it
Advance to MFT tool version to 4.17.2-12.

- How to verify it
Manually tested on all mellanox platforms.
2022-01-12 22:36:11 +00:00
Kebo Liu
16a3929159
[202012][Mellanox] Update hw-mgmt package to V.7.0010.2347 (#9594)
- Why I did it
Update hw-mgmt to a new version to pick up support for the SN4600C A1 system.

- How I did it
Update the pointer of the hw-mgmt submodule
Update the hw-mgmt version number
Remove the staled code patch to hw-mgmt userspace code.

- How to verify it
Run platform regression on Mellanox platforms.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-12-28 09:40:58 +02:00
Stepan Blyshchak
bdf31a6556 [Mellanox][SDK] Build SDK with PRM sniffer support (#9500)
- Why I did it
To have an ability to use PRM sniffer.

- How I did it
Enabled the option in configure flags.

- How to verify it
Built and ran on switch. Enabled the feature in runtime and checked the sniffer recording.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-12-20 19:25:52 +00:00
Junchao-Mellanox
0197855d5d
[Mellanox] [202012] Allow user to set LED to orange (#9514)
Backport https://github.com/Azure/sonic-buildimage/pull/9259 to 202012

#### Why I did it

Nvidia platform API does not support set LED to orange. 

#### How I did it

Allow user to set LED to orange

#### How to verify it

Manual test
2021-12-13 16:04:06 -08:00
Stephen Sun
acac848858
[Reclaim buffer][202012] Reclaim unused buffers by applying zero buffer profiles (#9063)
- Why I did it
Support zero buffer profiles

1. Add buffer profiles and pool definition for zero buffer profiles
2. Support applying zero profiles on INACTIVE PORTS
3. Enable dynamic buffer manager to load zero pools and profiles from a JSON file

- How I did it
Add buffer profiles and pool definition for zero buffer profiles

If the buffer model is static:
 * Apply normal buffer profiles to admin-up ports
 * Apply zero buffer profiles to admin-down ports
If the buffer model is dynamic:
 * Apply normal buffer profiles to all ports
 * buffer manager will take care when a port is shut down

Update buffers_config.j2 to support INACTIVE PORTS by extending the existing macros to generate the various buffer objects, including PGs, queues, ingress/egress profile lists

Originally, all the macros to generate the above buffer objects took active ports only as an argument.
Now that buffer items need to be generated on inactive ports as well, an extra argument representing the inactive ports need to be added.
To be backward compatible, a new series of macros are introduced to take both active and inactive ports as arguments
The original version (with active ports only) will be checked first. If it is not defined, then the extended version will be called.
Only vendors who support zero profiles need to change their buffer templates
Enable buffer manager to load zero pools and profiles from a JSON file:

The JSON file is provided on a per-platform basis
It is copied from platform/<vendor> folder to /usr/share/sonic/temlates folder in compiling time and rendered when the swss container is being created.
To make code clean and reduce redundant code, extract common macros from buffer_defaults_t{0,1}.j2 of all SKUs to two common files:
One in Mellanox-SN2700-D48C8 for single ingress pool mode
The other in ACS-MSN2700 for double ingress pool mode
Those files of all other SKUs will be symbol link to the above files

Update sonic-cfggen test accordingly:
 * Adjust example output file of JSON template for unit test
 * Add unit test in for Mellanox's new buffer templates.

- How to verify it
Regression test.
Unit test in sonic-cfggen
Run regression test and manually test.

Signed-off-by: stephens <stephens@nvidia.com>
2021-12-09 17:34:56 +02:00
Volodymyr Samotiy
0831635b1c
[Mellanox] Update SDK to v4.4.3360 and FW to v2008.3358 (#9403)
- Why I did it
To include latest fixes.

1. On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting Spectrum devices with optical transceivers that support RXLOS, remote side port down might cause the switch firmware to get stuck and cause unexpected switch behavior.
3. On rare occasions, when working with port rates of 1GbE or 10GbE and congestion occurs, packets may get stuck in the chip and may cause switch to hang.
4. When ECMP has high amount of next-hops based on VLAN interfaces, in some rare cases, packets will get a wrong VLAN tag and will be dropped.
5. Using SN4600C with copper or optics loopback cables in NRZ speeds, link may raise in long link up times ( up to 70 seconds).
6. When connecting SN4600C to SN4600C after Fastboot in 50GbE No_FEC mode with a copper cable, the link up time may take ~20 seconds.

- How I did it
Updated SDK submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "soni-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-12-06 11:01:43 +02:00
Junchao-Mellanox
227f2f8aec [Mellanox] Fan speed should not be 100% when PSU is powered off (#9258)
- Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.

- How I did it
When PSU is powered of, don't treat it as absent.

- How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
2021-12-01 02:28:37 +00:00
Junchao-Mellanox
d69564a1e7 [Mellanox] Change thermal recover threshold from temp_trip_norm to temp_trip_high (#8792)
- Why I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high, so that thermal algorithm would set fan speed to minimum allowed earlier and save power.

- How I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high

- How to verify it
Manual test
2021-10-05 22:17:30 +00:00
Nazarii Hnydyn
70b9ea5409 [Mellanox] Advance hw-mgmt to V.7.0010.2346. (#8667)
Commits on Sep 01, 2021
hw-mgmt: attributes: Add PSU power sensor attributes d8fce39

Commits on Sep 02, 2021
Remove MFT package flint tool from hw-management dump generation. 53d06b2
hw-mgmt: debug: Add timeout to generate-dump.sh b661fa3 

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2021-09-09 12:03:44 +00:00
shlomibitton
c0f9bb9720
[202012] [Mellanox] Update SDK\FW to version 4.4.3326\2008.3326 (#8602)
- Why I did it
Update SDK\FW version to 4.4.3326\2008.3326. This version contains:

New Features:
1. Add support for Fast Boot for SN3800

Bug Fixing:
1. In some cases, when the total number of allocations exceeds the resource limit, an error can occur due to incorrect resource release procedure. This issue is most likely to affect the following resources: flow counters, ACL actions, PBS, WJH filter, Tunnels, ECMP containers, MC (L2 &L3)

2. On Spectrum systems, when using Async Router API with IPV6, an error message in the log regarding failing to remove ECMP container may show up. This error is not functional and can be safely ignored.

3. On Spectrum-2 systems and above, when using warm boot, setting max_bridge_num to a value greater than 1968 will cause an error and potential crash.

4. Some Molex cables do not support speed after reboot

- How I did it
Update submodule and .mk files

- How to verify it
Verified by running regression tests that includes complete sonic-mgmt tests supported

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-09-03 10:59:18 +03:00
Junchao-Mellanox
49f4ef6438 [Mellanox] Read PSU fan max/min speed per PSU (#8563)
#### Why I did it
New PSU could install different type of fan, so fan max/min speed should be read per PSU

#### How I did it
The existing implementation read PSU max/min fan speed from a common file, change it to read from per PSU file

#### How to verify it
Manual test
2021-08-27 02:27:00 +00:00
Alexander Allen
196fcffb6f [Mellanox] Upgrade Mellanox firmware tools to 4.17.0 (#8299)
- Why I did it
New release of MFT has the following changelog / RN
 Fixed an issue that resulted in getting MVPD read errors from the mlxfwmanager during fast reboot.
 Fixed mlxuptime sometimes generating a time less than previous due the wrong frequency calculation

- How I did it
Update makefile pointer to new version.

- How to verify it
Manually tested on all Mellanox platforms.
2021-08-23 03:05:20 +00:00
Junchao-Mellanox
8285cf2329
[Mellanox] [202012] Upgrade hw-mgmt to 7.0100.2344 (#8408)
To support new PSU fan on Mellanox platforms
2021-08-11 02:04:55 -07:00
DavidZagury
0551fed754 [Mellanox][Pcie] Fix issue on pcied with an id that contains only decimal digits was treated as a decimal number (#8309)
A device that contains only decimal digits was mistreated as a decimal integer resulting in failure to find it in the id to bus map.
2021-08-05 15:22:48 +00:00
DavidZagury
45e100b61b [Mellanox][pcied] Ignore bus on pcie.yaml for Mellanox switches (#8063)
Why I did it
BIOS upgrade on rare cases cannot guarantee bus value remain the same on every BIOS release. Ignoring this field in order for pcied not to fail but still verify device id in a different way. The solution is future proof and will not require changes in code when new BIOS version is available

How I did it
Since bus is not a fixed value (it is determined by the bios version) we are ignoring this field, and instead checking if there is a device that match on all other fields that and in addition has a matching device id.

How to verify it
Verify no errors or failures in pcied on different BIOS version with the same code base.
2021-07-27 10:46:31 +00:00
Dror Prital
be6cd44ddf Update SDK\FW to version 4.4.3222\2008.3224 (#8247)
*Update SDK\FW Version to 4.4.3222\2008.3224.

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-26 11:05:29 -07:00
tomer-israel
13a62666d9 [WARM-REBOOT] fix issue of watchdog on simx when executing warm-reboot command (#8132)
- Why I did it
to prevent python exception error when executing warm-reboot command on mellanox simulator platform

- How I did it
return None on the watchdog python script on cases that watchdog file is not exist

- How to verify it
warm-reboot is running well without the python error. error message will appear on log on these cases.
in order to avoid this error message we can simulate the watchdog on mellanox simulator platform
2021-07-20 10:18:17 +00:00
Vivek Reddy
1b6634765c
SAI fix (#8142)
[0e4f0b] Fix saisdkdump

#### Why I did it

Fix the saisdkdump failure when the vxlan src port flag is enabled in the sai.profile
2021-07-11 02:35:17 -07:00
Dror Prital
526dd3c4fb [Mellanox] Update FW version to 2008.3218 (#8079)
Update FW version to 2008.3218, fixing the following issues:
- 50G/100G links that are operationally down before warm-reboot are not coming up after warm-reboot
- 50G/100G links with admin shut / no shut commands are not coming up after warm-reboot

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-07 09:41:35 +00:00
Dror Prital
fb89c28c95
[202012] [Mellanox] Update SDK\FW ver. 4.4.3216\2008.3216 (#8056)
- Changes and new features:

1. Added support in SN4600C systems for new module Finisar ET7402-CWDM4 (100G CWDM4 QSFP28 1310nm SM 2KM).
2. Added support for new module MMS1W50-HM (2km transceiver FR4) for 200GbE
3. Improved performance of "per-port-buffer" counters
4. Added support for Kernel 5.10

- Bug fix:
On rare occasions (0.5%), in SN4600C systems, when using 100GbE NRZ mode and Fastboot flow, the link up time may take up to 10 seconds

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-06 07:31:34 +03:00
shlomibitton
b9d21a5779
Update SAI submodule (#7926)
- Why I did it
Split and bulk counter bug fixes:
Init port auto neg to default on static (SAI XML) port split for 2nd+ port

- How I did it
Update submodule hash pointer.

- How to verify it
Verify the above is handled properly and reported issues are assumed to be fixed.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-06-23 20:44:33 +03:00
Junchao-Mellanox
ccb663c39b
[Mellanox] [202012] Backport 'Read EEPROM data from DB if possible'(7808) to 202012 (#7928)
- Why I did it
Remove EEPROM cache file and use DB instead

- How I did it
Read EEPROM data from DB if possible
If data is not ready in DB, read from hardware using a visitor pattern

- How to verify it
Manual test and regression
2021-06-23 18:09:53 +03:00
Stephen Sun
346b916c0e
[Mellanox] Enhance Python3 support for platform API (#7410) (#7910)
- Why I did it
This is to back-port Azure 7410 to 202012 branch.
Enhance the Python3 support for platform API. Originally, some platform APIs call SDK API which didn't support Python 3. Now the Python 3 APIs have been supported in SDK 4.4.3XXX, Python3 is completely supported by platform API

- How I did it
Start all platform daemons from python3
1. Remove #/usr/bin/env python at the beginning of each platform API file as the platform API won't be started as daemons but be imported from other daemons.
2. Adjust SDK API calls accordingly

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-06-18 09:46:41 -07:00
DavidZagury
49388fd595 [Mellanox] Install MFT packages on Syncd container (#7844)
To have access to MFT tools in the Syncd container on Mellanox switches due to SAI dump API implementation enhancements
2021-06-17 07:09:50 +00:00
Stephen Sun
a2e729122d [Mellanox] Adjust Makefile for SDK/python-sdk-api to support both python2 and python3 (#7848)
- Why I did it
Adjust the Makefile for SDK/python-SDK-API to support both python2 and python3

- How to verify it
Build the image and check whether python2 and python3 are both supported by SDK API.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-06-16 12:41:07 +00:00
yozhao101
fb2c995f53
[202012][Monit] Deprecate the feature of monitoring the critical processes by Monit (#7823)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.

How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.

How to verify it
I verified this on the device str-7260cx3-acs-1.
2021-06-09 09:04:22 -07:00
Stepan Blyshchak
ce3bdaf697 [nvidia/mellanox] add MLNX_SDK_DEB_VERSION to SDK packages flags list. (#7747)
This is due to the fact that we use SONIC_OVERRIDE_BUILD_VARS internally
in our build jobs and this is not accounted in caching framework.
So we add MLNX_SDK_DEB_VERSION to force rebuild if we changed it via
SONIC_OVERRIDE_BUILD_VARS.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-06-09 08:28:13 +00:00
Dror Prital
9b5e0694e3
[Mellanox][202012] Update FW version to 2008_3110 (#7807)
- Why I did it
Update FW version to 2008_3110 fixing SN3800 specific warm boot scenario:

1. Disable interface
2. Warm Boot
3. Enable Interface --> link will remain down.

- How I did it
Use new FW that contains the fix for the problem mentioned above

- How to verify it
Run the scenario mentioned above and make sure that the link is up after warm boot

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-06-08 14:06:14 +03:00
Kebo Liu
33cb83cbd1 [Mellanox] Align PSU name convention returned from psu.get_name platform API (#7783)
Make PSU name returned from platform API aligned with the convention "PSU {X}" instead of "PSU{X}".
2021-06-07 06:02:32 +00:00
Volodymyr Samotiy
754e4fea17
[Mellanox] Update SDK to 4.4.3106 and FW to xx.2008.3106 (#7785)
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-06-03 19:18:07 -07:00
Junchao-Mellanox
74216f8710 [Mellanox] clear fan from chassis._fan_list (#7682)
#### Why I did it

According to thermalctld hld, each fan must belong to a fan drawer, if the fan drawer does not physically exist, put fan into a virtual fan drawer. This PR is to clear fan from chassis._fan_list

#### How I did it

1. Don't put fan to chassis._fan_list
2. Always query fan from fan_drawer
2021-05-31 04:32:40 +00:00
Kebo Liu
babaaaad6b [Mellanox] Add support for MSN4600 A1 system (#7732)
Add new sensor conf for MSN4600 A1 system
Add a Mellanox hw-management patch to support MSN4600 A1 system
2021-05-27 22:30:39 +00:00
Kebo Liu
7ca8b458ee update hw-mgmt version to 2304 (#7725)
- Why I did it
Pick up fix from new hw-management package:
Fix gearbox thermal zone name, which was lack suffix thermal zone number

- How I did it
Update the hw-management version number in the make file
Update hw-management submodule pointer

- How to verify it
Run platform related test cases on Mellanox platform
2021-05-27 22:30:33 +00:00
Junchao-Mellanox
7a69e0ffb2 [mellaonox]: No need enable thermal zones in thermal_manager.deinitialize since they are enabled by default (#7556)
No need enable thermal zones in thermal_manager.deinitialize since they are enabled by default. And removing this will faster thermalctld exit speed
2021-05-24 22:07:14 +00:00
shlomibitton
ad05c98d34 [Mellanox] Update FW to xx.2008.2526 (#7511)
- Why I did it
Updated FW to xx.2008.2526 version.

Fixed issues:
1. Spectrum-2, Spectrum-3 | sFlow | High CPU load and high on fully loaded switch.
2. Spectrum-2, Spectrum-3 | Fine grain LAG | in rare cases doesn’t update the right entry

- How I did it
Updated submodule pointer and version in a Makefile.

- How to verify it
Full regression and bugs validation

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-05-05 09:36:14 -07:00
Junchao-Mellanox
b9680e9e25 [Mellanox] Adjust PSU fan name to align with sysfs file name (#7490)
Change PSU fan name from psu_{psu_index}fan{fan_index} to psu{psu_index}_fan{fan_index}
2021-05-05 09:35:31 -07:00
Stephen Sun
a554ddc91d [Mellanox] Adopt single way to get fan direction for all ASIC types (#7386)
#### Why I did it
Adopt a single way to get fan direction for all ASIC types.
It depends on hw-mgmt V.7.0010.2000.2303. Depends on https://github.com/Azure/sonic-buildimage/pull/7419

#### How I did it
Originally, the get_direction was implemented by fetching and parsing `/var/run/hw-management/system/fan_dir` on the Spectrum-2 and the Spectrum-3 systems. It isn't supported on the Spectrum system.
Now, it is implemented by fetching `/var/run/hw-management/thermal/fanX_dir` for all the platforms.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-05-05 09:34:25 -07:00
Volodymyr Samotiy
4152c3e337
[Mellanox] [202012] Update SAI submodule pointer (#7499)
- Why I did it
To include below changes:
Set monitoring VLAN hostif up dy default (for VNET ping tool)

- How I did it
Updated SAI submodule pointer

- How to verify it
Create VLAN hostif according to changes in PR: Azure/sonic-swss#1645
Verify it is admin up by default

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-05-04 17:29:47 +03:00
Stephen Sun
48908b1c5a Fix issue: exception occurred during chassis object being destroyed (#7446)
The following error message is observed during chassis object being destroyed

"Exception ignored in: <function Chassis.__del__ at 0x7fd22165cd08>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sonic_platform/chassis.py", line 83, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
The chassis tries to import deinitialize_sdk_handle during being destroyed for the purpose of releasing the sdk_handle.
However, importing another module during shutting down can cause the error because some of the fundamental infrastructures are no longer available."

This error occurs when a chassis object is created and then destroyed in the Python shell.

- How I did it
To fix it, record the deinitialize_sdk_handle in the chassis object when sdk_handle is being initialized and call the deinitialize handler when the chassis object is being destroyed

- How to verify it
Manually test.
2021-04-29 10:10:25 -07:00
Junchao-Mellanox
d4e8c3f666 [Mellanox] Upgrade hw-mgmt to 7.0100.2303 (#7419)
- Why I did it
Upgrade hw-mgmt to 7.0100.2303

Bug fixes

1. Fan direction feature fix for fixed FAN system (using shell instead of binutils/strings)
2. Remove cpld 4th link on systems with only 3 CPLD's
3. hw-mgmt: thermal: Add hardcoded critical trip point. Follow-up after patch "Removing critical thermal zones to prevent unexpected software system shutdown".
4. Fix sensor attribute mapping to be label based instead of index based to allow common handling of voltage regulator names independently of hardware changes.
5. Update 'lm-sensors' custom configuration file. Relevant only for users utilizing sensors.conf files coming along with hw-management package.
6. For full feature list please follow https://github.com/Mellanox/hw-mgmt/blob/V.7.0010.2300_BR/debian/Release.txt

- How I did it
Update hw-mgmt pointer
Remove unused patches
Fix existing patch to make sure it apply successfully

- How to verify it
Full platform regression on all mellanox platforms
2021-04-29 10:09:58 -07:00
Dror Prital
16dc30b944
[Mellanox] [202012] Update SAI version 1.18.3.0 (#7427)
- Why I did it
Changes in the new release:
1. Fix 10G and 50G speeds in SAI XML to support all interface types
2. Enable SMAC=DMAC and SMAC MC in tunnel debug counter
3. Add tunnel statistics
4. Add isolation group API implementation
5. Fix ACL ANY debug counter to correctly track ACL drops
6. Add VXLAN source port hard coded range, controlled by K/V
7. FW dump me now feature
8. Add mlxtrace to saidump
9. Speed lane setting and AN control
10. Implement query stats API
11. VNI miss part of tunnel decal drop reason

- How I did it
Update the version number in SAI make file, update the mlnx-sai submodule pointer.

- How to verify it
Run full regression tests on Mellanox platforms

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-04-26 20:44:36 +03:00
Kebo Liu
2208c9212a [Mellanox] Update SDK to 4.4.2522 and FW to 2008.2520 (#7391)
New features and fixes in the new SDK/FW:

SN4600C | AN/LT support
SN2700 | AN/LT bugs fixes
WJH | FID_MISS support

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-04-21 14:05:56 -07:00
Stephen Sun
1312feef1e Bug fix: Support dynamic buffer calculation on ACS-MSN3420 and ACS-MSN4410 (#7113)
- Why I did it
Add missed files for dynamic buffer calculation for ACS-MSN3420 and ACS-MSN4410

- How I did it
asic_table.j2: Add mapping from platform to ASIC
Add buffer_dynamic.json.j2 for ACS-MSN4410.

- How to verify it
Check whether the dynamic buffer calculation daemon starts successfully.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-04-08 18:36:27 +00:00
Junchao-Mellanox
d79a456a62
[Mellanox] Initialize PSU API on both host and docker side (#7075)
- Why I did it
There was a change to replace platform utils with sonic platform API in psuutil. However, psu API is not initialized on host side. The PR is to fix it.
Backport of #7016 to the 202012 branch.

- How I did it
Initialize PSU API on both host and non-host side

- How to verify it
Manual test
2021-04-07 20:36:02 +03:00
Joe LeVeque
dd9be59cd1
[202012][dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7203)
#### Why I did it

Backport of https://github.com/Azure/sonic-buildimage/pull/7083 to the 202012 branch.

To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:

```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```

This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.

This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).

I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-04-01 12:52:19 -07:00