Commit Graph

480 Commits

Author SHA1 Message Date
Stephen Sun
ba853348d5
[Reclaim buffer] Reclaim unused buffers by applying zero buffer profiles (#8768)
Signed-off-by: Stephen Sun stephens@nvidia.com

Why I did it
Support zero buffer profiles

Add buffer profiles and pool definition for zero buffer profiles
Support applying zero profiles on INACTIVE PORTS
Enable dynamic buffer manager to load zero pools and profiles from a JSON file
Dependency: It depends on Azure/sonic-swss#1910 and submodule advancing PR once the former merged.

How I did it
Add buffer profiles and pool definition for zero buffer profiles

If the buffer model is static:
Apply normal buffer profiles to admin-up ports
Apply zero buffer profiles to admin-down ports
If the buffer model is dynamic:
Apply normal buffer profiles to all ports
buffer manager will take care when a port is shut down
Update buffers_config.j2 to support INACTIVE PORTS by extending the existing macros to generate the various buffer objects, including PGs, queues, ingress/egress profile lists

Originally, all the macros to generate the above buffer objects took active ports only as an argument
Now that buffer items need to be generated on inactive ports as well, an extra argument representing the inactive ports need to be added
To be backward compatible, a new series of macros are introduced to take both active and inactive ports as arguments
The original version (with active ports only) will be checked first. If it is not defined, then the extended version will be called
Only vendors who support zero profiles need to change their buffer templates
Enable buffer manager to load zero pools and profiles from a JSON file:

The JSON file is provided on a per-platform basis
It is copied from platform/<vendor> folder to /usr/share/sonic/temlates folder in compiling time and rendered when the swss container is being created.
To make code clean and reduce redundant code, extract common macros from buffer_defaults_t{0,1}.j2 of all SKUs to two common files:

One in Mellanox-SN2700-D48C8 for single ingress pool mode
The other in ACS-MSN2700 for double ingress pool mode
Those files of all other SKUs will be symbol link to the above files

Update sonic-cfggen test accordingly:

Adjust example output file of JSON template for unit test
Add unit test in for Mellanox's new buffer templates.

How to verify it
Regression test.
Unit test in sonic-cfggen
Run regression test and manually test.
2021-11-29 08:04:01 -08:00
Junchao-Mellanox
e79c7155b7
[Mellanox] Fan speed should not be 100% when PSU is powered off (#9258)
- Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.

- How I did it
When PSU is powered of, don't treat it as absent.

- How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
2021-11-24 14:56:00 +02:00
Stephen Sun
f9d0921d09
Support PSU voltage high/low thresholds and power max threshold (#8983)
- Why I did it
Support PSU voltage high/low thresholds and power max threshold
1. Add thresholds support for voltage and power.
2. As thresholds are not supported on all platforms, we need to check the capability first and fetch thresholds only if it is supported.

- How I did it

- How to verify it
Run regression test and manual test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-11-22 12:41:16 +02:00
Alexander Allen
53891c5602
Fix cache related mellanox bullseye build failures (#9234)
#### Why I did it

Mellanox builds were failing intermittently due to the `issue_version` file and MFT package not building correctly in the Azure pipeline environment (both of these packages were patched to build correctly with bullseye running on the host and buster running on the dockers)

#### How I did it

Fixed two problems:

1. BLDENV is not passed to the Makefiles so the references to this were replaced with correct logic
2. `issue_version` was not defined as a target for bullseye and as such was not cached. Altered the build such that it is defined as a target for bullseye (in the case of buster it builds the file, in the case of bullseye it copies from buster) 

The previous PR fixing this was reverted as it is no longer necessary for a passing build and was not a long-term fix. https://github.com/Azure/sonic-buildimage/pull/9235

#### How to verify it

Build on AZP and verify success.
2021-11-16 14:49:47 -08:00
Saikrishna Arcot
a980e836c1
[mellanox]: Disable caching for the issu-version file (#9235)
The issu-version file for Mellanox is generated from the Mellanox SDK
libraries. The SDK is installed into a Buster docker container, but the
issu-version file goes onto the base OS, which is Bullseye. To work
around this, the issu-version build rules explicitly copies the
issu-version file to target/files/bullseye/ during the Buster build.

Because of our build infra, if caching is enabled and a cache is being
used, then for issu-version, since it is technically built as part of
Buster, then only target/files/buster/issu-version is saved into the
cache, and target/files/bullseye/issu-version isn't cached. If this
cache gets used, then target/files/bullseye/issu-version is missing, and
the final image build fails.

This is to work around the current build issue where Mellanox builds are
failing. This is so that issu-version is always "built", so that copy is
made into the bullseye directory.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-11 19:39:25 -08:00
Saikrishna Arcot
d4bc0095a5 Fix Mellanox hw-mgmt package version
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-10 15:27:22 -08:00
Alexander Allen
cc5a2f3d54 Update pointer (#12)
Updated the hw-mgmt pointer to include some bugfixes related to power supply voltages.
2021-11-10 15:27:22 -08:00
Alexander Allen
857937d592 Mellanox bullseye merge (#1)
* Make neccesary changed to mellanox platform code to build on Debian 11

* Revert use of backported kernel to build mft and elect to only build kernel module under bullseye
2021-11-10 15:27:22 -08:00
Alexander Allen
2847265bfd Mellanox bullseye merge (#1)
Allow mellanox platform to build and successfully switch packets in
Debian 11

Upgraded

* Mellanox SDK
* Mellanox Hardware Management
* Mellanox Firmware
* Mellanox Kernel Patches

Adjusted build system to support host system running bullseye and
dockers running buster.
2021-11-10 15:27:22 -08:00
Saikrishna Arcot
1d00613305 Add support for building Mellanox image
ISSU will likely be broken. As of right now, the issu-version file is
not being generated during build.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-10 15:27:22 -08:00
Stepan Blyshchak
2ef97bb5df
[dockers] change RPC, DBG dockers version: put RPG, DBG sign in build metadata part of the version (#8920)
- Why I did it
In case an app.ext requires a dependency syncd^1.0.0, the RPC version of syncd will not satisfy this constraint, since 1.0.0-rpc < 1.0.0. This is not correct to put 'rpc' as a prerelease identifier. Instead put 'rpc' as build metadata in the version: 1.0.0+rpc which satisfies the constraint ^1.0.0.

- How I did it
Changed the way how to version in RPC and DBG images are constructed.

- How to verify it
Install app.ext with syncd^1.0.0 dependency on a switch with RPC syncd docker.
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
2021-11-01 19:02:57 +02:00
Junchao-Mellanox
e8b4c2a1f4
[Mellanox] Refactor Mellanox platform API to support dynamic port configuration (#8422)
- Why I did it
* To support systems with dynamic port configuration
* Apply lazy initialization to faster the speed of loading platform API

- How I did it
* Add module.py to implement dynamic port configuration (aka line card model)
* Adjust chassis.py, platform.py, thermal.py, sfp.py to support dynamic port configuration
* Optimize existing code

- How to verify it
Platform regression on MSN4700, MSN3800 and MSN2700, 100% pass
Unit test covers all new changes.
2021-10-25 07:59:06 +03:00
vmittal-msft
8b5f33dbb7
[sonic-sairedis submodule] Update SAI header to ver 1.9.1 for MLNX SDK/SAI (#9012)
* Updated sonic-sairedis to point to SAI 1.9.1 and MLNX SAI to 1.19.5(API v1.9.1)
2021-10-22 13:49:55 -07:00
Vadym Hlushko
cd0d407690
[fan] Fixed dynamic minimum fan speed table for SN4410 (#8960)
Put new values to the dynamic minimum fan speed table for SN4410.
2021-10-20 18:12:58 -07:00
DavidZagury
9527cbe53b
[Mellanox] Upgrade Mellanox firmware tools to 4.17.2-12 (#8978)
- Why I did it
Bug fix:
bad_param request due to missing parser rest command while running mlxlink

- How I did it
Advance to MFT tool version to 4.17.2-12.

- How to verify it
Manually tested on all mellanox platforms.
2021-10-20 09:24:25 +03:00
Dror Prital
5356244e53
[Mellanox] Add NVIDIA Copyright header to "mellanox" files (#8799)
- Why I did it
Add NVIDIA Copyright header to "mellanox" files

- How I did it
Add NVIDIA Copyright header as a comment for Mellanox files

- How to verify it
Sanity tests and PR checkers.
2021-10-17 19:03:02 +03:00
Alexander Allen
e6e6f414cd
[mellanox] Remove validation for fw filenames with no extension (#8956)
Why I did it
Currently the mellanox platform API is validating the file extensions of firmware packages to be installed for basic sanity checking. However, ONIE packages do not have an extension and as such if there is a "." in the name it is taken to be an extension and then fails the sanity check.

How I did it
I removed the check which ensures that ONIE images don't have a file extension.

How to verify it
Name the ONIE updater file 2021.onie and attempt to install it via fwutil install fw 2021.onie --yes
2021-10-15 13:29:24 -07:00
Alexander Allen
cefb9c1946
[platform] [mellanox] Use correct API call to update firmware in auto_update_firmware (#8961)
Why I did it
The fwutil update all utility expects the auto_update_firmware method in the Platform API to execute the update_firmware() call and not the install_firmware() call.

How I did it
Changed the method in the mellanox platform API component implementation.

How to verify it
Run fwutil update all with a CPLD update on a Mellanox platform and verify that it properly updates the firmware using the MPFA file.
2021-10-15 13:28:52 -07:00
Volodymyr Samotiy
ce7abad3ba
[Mellanox] Update SAI to v1.19.4 (#8929)
- Why I did it
Advance to new Nvidia SAI release with the following changes:
New features:
- Align with new SDK/FW version 4.5.1006 and above and in parallel to existing used SDK/FW bundle
- Implement timestamp and egress_queue_index hostif packet attributes.

Bugs fixes:
- Fix compilation issues with gcc10
- Fix return code for buffer overflow for query enum values and query statistics capabilities
- Reduce verbosity of print in case packet ingress on invalid port
- Fix mirror Qos settings

- How I did it
Updated SAI version and submodule pointer

- How to verify it
Run regression tests from sonic-mgmt

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-10-12 10:33:35 +03:00
Junchao-Mellanox
552963ab0e
[Mellanox] Change thermal recover threshold from temp_trip_norm to temp_trip_high (#8792)
- Why I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high, so that thermal algorithm would set fan speed to minimum allowed earlier and save power.

- How I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high

- How to verify it
Manual test
2021-10-04 20:20:33 +03:00
Nazarii Hnydyn
f36952fea3
[Mellanox]: Update SAI to v1.19.2 (#8618)
- Why I did it
Advance to Mellanox SAI ver 1.19.2 to pick up dynamic Policy Based Hashing support.
For this version above the static Policy Based Hashing is no longer supported.
For detailed release notes check https://github.com/Mellanox/SAI-Implementation/blob/sonic2111/release%20notes.txt

- How I did it
Updated SAI-Implementation submodule

- How to verify it
1. make configure PLATFORM=mellanox
2. make target/sonic-mellanox.bin
Run full regression as well as new dynamic PBH tests

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2021-09-10 17:30:00 +03:00
Nazarii Hnydyn
63ba489c6b
[Mellanox] Advance hw-mgmt to V.7.0010.2346. (#8667)
Commits on Sep 01, 2021
hw-mgmt: attributes: Add PSU power sensor attributes d8fce39

Commits on Sep 02, 2021
Remove MFT package flint tool from hw-management dump generation. 53d06b2
hw-mgmt: debug: Add timeout to generate-dump.sh b661fa3 

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2021-09-08 09:59:50 -07:00
Junchao-Mellanox
ed64eb94d9
[Mellanox] Read PSU fan max/min speed per PSU (#8563)
#### Why I did it
New PSU could install different type of fan, so fan max/min speed should be read per PSU

#### How I did it
The existing implementation read PSU max/min fan speed from a common file, change it to read from per PSU file

#### How to verify it
Manual test
2021-08-26 01:03:55 -07:00
shlomibitton
5f04146a10
Upstream new FW/SDK (#8567)
- Why I did it
Update SDK\FW version to 4.4.3326\2008.3326. This version contains:

New Features:
1. Add support for Fast Boot for SN3800

Bug Fixing:
1. In some cases, when the total number of allocations exceeds the resource limit, an error can occur due to incorrect resource release procedure. This issue is most likely to affect the following resources: flow counters, ACL actions, PBS, WJH filter, Tunnels, ECMP containers, MC (L2 &L3)
2. On Spectrum systems, when using Async Router API with IPV6, an error message in the log regarding failing to remove ECMP container may show up. This error is not functional and can be safely ignored.
3. On Spectrum-2 systems and above, when using warm boot, setting max_bridge_num to a value greater than 1968 will cause an error and potential crash.
4. Some Molex cables do not support speed after reboot

- How I did it

- How to verify it
Was verified by running regression tests that includes complete sonic-mgmt tests supported
2021-08-25 16:46:56 +03:00
Alexander Allen
d0c73b050d
[Mellanox] Upgrade Mellanox firmware tools to 4.17.0 (#8299)
- Why I did it
New release of MFT has the following changelog / RN
 Fixed an issue that resulted in getting MVPD read errors from the mlxfwmanager during fast reboot.
 Fixed mlxuptime sometimes generating a time less than previous due the wrong frequency calculation

- How I did it
Update makefile pointer to new version.

- How to verify it
Manually tested on all Mellanox platforms.
2021-08-22 15:54:43 +03:00
Dror Prital
ca713e2fad
[Mellanox] Update SDK\FW to version 4.4.3320\2008.3324 (#8487)
Update SDK\FW version to 4.4.3320\2008.3324. This version contains: 

New Features:
* Add support for Fast Boot for SN3800

Bug Fixing:
* In some cases, when the total number of allocations exceeds the resource limit, an error can occur due to incorrect resource release procedure. This issue is most likely to affect the following resources: flow counters, ACL actions, PBS, WJH filter, Tunnels, ECMP containers, MC (L2 &L3)
* On Spectrum systems, when using Async Router API with IPV6,  an error message in the log regarding failing to remove ECMP container may show up. This error is not functional and can be safely ignored.
* On Spectrum-2 systems and above, when using warm boot, setting max_bridge_num to a value greater than 1968 will cause an error and potential crash.
* Some Molex cables do not support speed after reboot

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-08-17 01:57:24 -07:00
Kebo Liu
59c13cb406
[Mellanox] Upgrade hw-mgmt to 7.0100.2344 (#8463)
To pick up new PSU fan support from new hw-mgmt release

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-08-13 03:25:53 -07:00
Saikrishna Arcot
c38b95c899 Remove --net=host from run options for containers
The start template script that this value is used in will determine what
network namespace to use, and will add --net=host if it needs to run in
the host namespace.

With Docker 20.10, if --net=host is specified twice in docker run, then
it errors out. Therefore, remove the explicit --net=host in the run
options setting and let the start template script specify it.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-08-12 23:18:01 -07:00
DavidZagury
d26307d80f
[Mellanox][Pcie] Fix issue on pcied with an id that contains only decimal digits was treated as a decimal number (#8309)
A device that contains only decimal digits was mistreated as a decimal integer resulting in failure to find it in the id to bus map.
2021-08-03 15:25:28 -07:00
DavidZagury
67781abb97
[Mellanox][pcied] Ignore bus on pcie.yaml for Mellanox switches (#8063)
Why I did it
BIOS upgrade on rare cases cannot guarantee bus value remain the same on every BIOS release. Ignoring this field in order for pcied not to fail but still verify device id in a different way. The solution is future proof and will not require changes in code when new BIOS version is available

How I did it
Since bus is not a fixed value (it is determined by the bios version) we are ignoring this field, and instead checking if there is a device that match on all other fields that and in addition has a matching device id.

How to verify it
Verify no errors or failures in pcied on different BIOS version with the same code base.
2021-07-26 08:43:42 -07:00
Dror Prital
7698747028
Update SDK\FW to version 4.4.3222\2008.3224 (#8247)
*Update SDK\FW Version to 4.4.3222\2008.3224.

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-22 17:00:16 -07:00
Dror Prital
644875712e
[Mellanox] Update SAI to version 1.19.1 (#8245)
- Why I did it
Update SAI version to 1.19.1. The following was changed:
1. Update license
2. Do not remove and re-apply the same SDK mirror session on LAG
3. FEC fix to support all speeds
4. Improve PG counters performance
5. Fix number of switch priorities for port mirroring

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-22 18:13:49 +03:00
tomer-israel
950c24c5ae
[PMON] [Mellanox] fix syseepromd issue on simx (#8131)
Avoid initializing sfp/thermal/components/fan/psu/leds on simx and create vpd_info file on hw_management when we use mellanox simulator platform

- Why I did it
this is a fix for issue in mellanox simulator platforms. the syseepromd failed on the pmon docker. also "decode-syseeprom" failed also

- How I did it
before initializing thermal/components/fan/psu/leds --> check if we are running on simx
creating the vpd_info on the hw_management folder.

- How to verify it
check if syseepromd process was loaded properly on the pmon docker.
decode-syseeprom is working well without errors/warnings
2021-07-20 11:56:04 +03:00
tomer-israel
a328fd24c0
[WARM-REBOOT] fix issue of watchdog on simx when executing warm-reboot command (#8132)
- Why I did it
to prevent python exception error when executing warm-reboot command on mellanox simulator platform

- How I did it
return None on the watchdog python script on cases that watchdog file is not exist

- How to verify it
warm-reboot is running well without the python error. error message will appear on log on these cases.
in order to avoid this error message we can simulate the watchdog on mellanox simulator platform
2021-07-19 22:08:44 +03:00
Vivek Reddy
447f09f2dd
Update SAI (#8143)
Fix saisdkdump + Fix port dropped pkts counters
Co-authored-by: Vivek Reddy Karri <vkarri@nvidia.com>
2021-07-09 15:27:36 -07:00
Dror Prital
0c9862d980
[Mellanox] Update FW version to 2008.3218 (#8079)
Update FW version to 2008.3218, fixing the following issues:
- 50G/100G links that are operationally down before warm-reboot are not coming up after warm-reboot
- 50G/100G links with admin shut / no shut commands are not coming up after warm-reboot

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-07 08:35:34 +03:00
Dror Prital
edb48e3191
[Mellanox] Update SAI and SDK\FW ver. 4.4.3216\2008.3216(#8055)
- Why I did it
* For SAI - Advance to adopt the following fixes:
1. Better handle not implement object type for resource availability
2. Fix ext dump when saidump is triggered from 2nd process (saidump utility) other than main adapter host (syncd in SONiC)

* For SDK\FW:
- Changes and new features:
1. Added support in SN4600C systems for new module Finisar ET7402-CWDM4 (100G CWDM4 QSFP28 1310nm SM 2KM).
2. Added support for new module MMS1W50-HM (2km transceiver FR4) for 200GbE
3. Improved performance of "per-port-buffer" counters
4. Added support for Kernel 5.10

- Bugs fixes:
On rare occasions (0.5%), in SN4600C systems, when using 100GbE NRZ mode and Fastboot flow, the link up time may take up to 10 seconds

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-06 07:40:29 +03:00
Junchao-Mellanox
147bf240f0
[Mellanox] Add bitmap support for SFP error event (#7605)
#### Why I did it

Currently, SONiC use a single value to represent SFP error, however, multiple SFP errors could exist at the same time. This PR is aimed to support it

#### How I did it

Return bitmap instead of single value when a SFP event occurs

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-06-25 10:56:47 -07:00
shlomibitton
9d18a35e35
[Mellanox] advance SAI submodule (#7952)
Split and bulk counter bug fixes:

- Init port auto neg to default on static (SAI XML) port split for 2nd+ port
- Fix port stats SAI_PORT_STAT_WRED_DROPPED_PACKETS, SAI_PORT_STAT_ECN_MARKED_PACKETS, SAI_PORT_STAT_ETHER_TX_OVERSIZE_PKTS
- Hide error message when reading not implemented port stat

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-06-24 09:21:45 -07:00
Stephen Sun
fc61ec9dbf
[Mellanox] Return N/A for PSU's model, serial and revision on platforms with fixed PSU (#7927)
- Why I did it
The methods get_model, get_serial, and get_revision have been implemented by reading relevant information from VPD and then recording the information into relevant fields.
However, there is no VPD data on platforms with fixed PSUs and relevant fields haven't been initialized, which causes the methods to throw exceptions. which in turn prevents psud from inserting fields into PSU table.
Eventually, this causes show platform psustatus doesn't output correct info.

- How I did it
Initialize those fields as N/A on systems with fixed PSUs.

- How to verify it
Manually test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-06-23 20:41:28 +03:00
Junchao-Mellanox
f294096eb6
[Mellanox] Read EEPROM data from DB if possible (#7808)
- Why I did it
Remove EEPROM cache file and use DB instead

- How I did it
Read EEPROM data from DB if possible
If data is not ready in DB, read from hardware using a visitor pattern

- How to verify it
Manual test and regression
2021-06-20 17:58:11 +03:00
Alexander Allen
29601366ee
[Mellanox] Implement auto_firmware_update platform API for to support fwutil auto-update (#7721)
Why I did it
The Mellanox platform is required to support the fwutil auto-update feature defined here

This is to allow switches, when performing SONiC upgrades to choose whether to perform firmware upgrades that may interrupt the data plane through a cold boot.

How I did it
Two methods were added to the component implementations for mellanox.

In the base Component class we add a default function that chooses to skip the installation of any firmware unless the cold boot option is provided. This is because the Mellanox platform, by default, does not support installing firmware on ONIE, the CPLD, or the BIOS "on-the-fly".

In the ComponentSSD class we add a function that behaves similarly but uses the Mellanox specific SSD firmware upgrade tool to check if the current SSD supports being upgraded on the fly in order to decide whether to skip or perform the installation.

How to verify it
Unit tests are included with this PR. These test will run on build of target sonic-mellanox.bin

You may also perform fwutil auto-update ... commands after Azure/sonic-utilities#1242 is merged in.
2021-06-16 14:55:20 -07:00
Stephen Sun
80d01f2f9a
[Mellanox] Enhance Python3 support for platform API (#7410)
- Why I did it
Enhance the Python3 support for platform API. Originally, some platform APIs call SDK API which didn't support Python 3. Now the Python 3 APIs have been supported in SDK 4.4.3XXX, Python3 is completely supported by platform API

- How I did it
Start all platform daemons from python3
1. Remove #/usr/bin/env python at the beginning of each platform API file as the platform API won't be started as daemons but be imported from other daemons.
2. Adjust SDK API calls accordingly

- How to verify it
Manually test and run regression platform test

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-06-15 17:57:48 +03:00
Stephen Sun
87bdc1a415
[Mellanox] Adjust Makefile for SDK/python-sdk-api to support both python2 and python3 (#7848)
- Why I did it
Adjust the Makefile for SDK/python-SDK-API to support both python2 and python3

- How to verify it
Build the image and check whether python2 and python3 are both supported by SDK API.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-06-15 17:54:14 +03:00
DavidZagury
cedc2113fb
[Mellanox] Install MFT packages on Syncd container (#7844)
To have access to MFT tools in the Syncd container on Mellanox switches due to SAI dump API implementation enhancements
2021-06-14 08:25:58 -07:00
Dror Prital
a3d90b9fbf
[Mellanox] Update SAI ver. 1.19.0, SDK\FW ver. 4.4.3106\2008.3110 (#7820)
Why I did it

* For SAI - Upgrade to Version 1.19.0

- Add support for VxLAN encap TTL uniform model on SPC2/3
- Add ACL entry actions set VRF, set do no learn, add VLAN ID, add VLAN priority
- Add ACL field has VLAN tag
- Bulk counters (improve port statistics performance)
- Create async dump extra as part of debug generate dump
- Create irisc dump on severe health event
- Support 0 port systems (modify get switch mac to work accordingly)
- Set interface vlan up state for ping tool in SONiC
- Support attributes SAI_PORT_ATTR_QOS_SCHEDULER_PROFILE_ID, SAI_PORT_ATTR_QOS_INGRESS_BUFFER_PROFILE_LIST,
SAI_PORT_ATTR_QOS_EGRESS_BUFFER_PROFILE_LIST, SAI_PORT_ATTR_POLICER_ID as part of port create Git stats

* For SDK\FW - Upgrade to Version SDK 4.4.3106, FW 2008_3110

Added Features:

- Increased ACL table
- Enhanced PSAMPLE support
- Added support for Finisar SR4 module in SN3700 systems
- Added support for Python 3.0 in examples.
Fix bugs:

- On LR4 transceivers 00YD278, the firmware incorrectly identified the transceiver
- Reduce memory consumption for virtual LAG
- Fixed PSAMPLE listeners cleanup on SDK drivers unloading.
- On Spectrum-2 and Spectrum-3 systems, slow reaction time to Rx pause packets may lead to buffer overflow on servers.
- BER may be experienced when using 5m DAC cables between SN4700 and SN2700 in 100GbE speed.
- On very rare occasion, when connecting DR4 PAM4 transceiver to 100GbE DR1 NRZ, low BER may be experienced.
- Unexpected packet drops on the port ingress buffer may be experienced when working in 400GbE mode.
Note: When performing ISSU from an older version, this fix won't be applied. For fix to apply, a non-ISSU reset is required.
- Fix SN3800 specific warm boot scenario: Disable interface, Warm Boot, Enable Interface --> link will remain down.

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-06-10 14:15:55 +03:00
Stepan Blyshchak
5c81c499e4
[nvidia/mellanox] add MLNX_SDK_DEB_VERSION to SDK packages flags list. (#7747)
This is due to the fact that we use SONIC_OVERRIDE_BUILD_VARS internally
in our build jobs and this is not accounted in caching framework.
So we add MLNX_SDK_DEB_VERSION to force rebuild if we changed it via
SONIC_OVERRIDE_BUILD_VARS.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-06-08 09:38:00 -07:00
yozhao101
1a3cab43ac
[Monit] Deprecate the feature of monitoring the critical processes by Monit (#7676)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.

How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.

How to verify it
I verified this on the device str-7260cx3-acs-1.
2021-06-04 10:16:53 -07:00
Kebo Liu
93534ce0e1
[Mellanox] Align PSU name convention returned from psu.get_name platform API (#7783)
Make PSU name returned from platform API aligned with the convention "PSU {X}" instead of "PSU{X}".
2021-06-04 09:40:23 -07:00
Kebo Liu
bf21dbce87
[Mellanox] Add support for MSN4600 A1 system (#7732)
Add new sensor conf for MSN4600 A1 system
Add a Mellanox hw-management patch to support MSN4600 A1 system
2021-05-27 09:52:45 -07:00
Kebo Liu
237849a330
update hw-mgmt version to 2304 (#7725)
- Why I did it
Pick up fix from new hw-management package:
Fix gearbox thermal zone name, which was lack suffix thermal zone number

- How I did it
Update the hw-management version number in the make file
Update hw-management submodule pointer

- How to verify it
Run platform related test cases on Mellanox platform
2021-05-27 14:17:23 +03:00
Junchao-Mellanox
3ea3a5c8c1
[Mellanox] clear fan from chassis._fan_list (#7682)
#### Why I did it

According to thermalctld hld, each fan must belong to a fan drawer, if the fan drawer does not physically exist, put fan into a virtual fan drawer. This PR is to clear fan from chassis._fan_list

#### How I did it

1. Don't put fan to chassis._fan_list
2. Always query fan from fan_drawer
2021-05-24 11:36:39 -07:00
Alexander Allen
6a9d1e584d
[Mellanox] Implement Hardware Revision Platform API Call for Mellanox Chassis and PSU (#7552)
#### Why I did it

This pull request allows calls to be made through the platform 2.0 API that retrieve the PSU and Chassis hardware revision on Mellanox platforms. Access to these values will aid customers in determining their hardware revisions for debugging and technical support. These values are intended to be eventually exposed through the CLI. 

#### How I did it

For the PSU hardware revision I used the existing VPD function calls implemented in https://github.com/Azure/sonic-buildimage/pull/7382

For the Chassis hardware revision I parsed the SMBIOS / DMI type 2 information to retrieve the information.
2021-05-24 09:37:59 -07:00
Junchao-Mellanox
bfae15fb83
[mellaonox]: No need enable thermal zones in thermal_manager.deinitialize since they are enabled by default (#7556)
No need enable thermal zones in thermal_manager.deinitialize since they are enabled by default. And removing this will faster thermalctld exit speed
2021-05-08 10:33:37 -07:00
Stephen Sun
9f0dce0313
[Mellanox] Optimize SFP modules initialization (#7537)
Originally, SFP modules were always accessed from platform daemons, and arbitrary SFP modules can be accessed in the daemon. So all SFP modules were initialized in one shot once one of the following chassis APIs called
- get_all_sfps
- get_sfp_numbers
- get_sfp

Recently, we noticed that SFP modules can also be accessed from CLI, eg. the latest refactor of `sfputil`.

In this case, only one SFP module is accessed in the chassis object's life cycle.
To initialize all SFP modules in one shot is waste of time and causes the CLI to take much more time to finish.
So we would like to optimize the initialization flow by introducing a two-phase initialization approach:
- Partial initialization, which means the `chassis._sfp_list` has been initialized with proper length and all elements being `None`
- Full initialization, which means all elements in `chassis._sfp_list` are created

If the relevant function is called,
- `get_sfp`, only partial initialization will be done, and then the specific SFP module is initialized.
- `get_all_sfps` or `get_num_sfps`, full initialization will be done, which means all SFP modules are initialized.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-05-06 10:14:48 -07:00
shlomibitton
2d3149d641
[Mellanox] Update FW to xx.2008.2526 (#7511)
- Why I did it
Updated FW to xx.2008.2526 version.

Fixed issues:
1. Spectrum-2, Spectrum-3 | sFlow | High CPU load and high on fully loaded switch.
2. Spectrum-2, Spectrum-3 | Fine grain LAG | in rare cases doesn’t update the right entry

- How I did it
Updated submodule pointer and version in a Makefile.

- How to verify it
Full regression and bugs validation

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-05-05 09:47:57 +03:00
Alexander Allen
0bc0f98d48
[platform] Add serial number and model number to Mellanox PSU platform implementation (#7382)
#### Why I did it

We want to add the ability for the command `show platform psustatus` to show the serial number and part number of the PSU devices on Mellanox platforms. This will be useful for data-center management of field replaceable units (FRUs) on switches.

#### How I did it

I implemented the platform 2.0 functions `get_model()` and `get_serial()` for the PSU in the mellanox platform API by referencing the sysfs nodes provided by the [hw-management](https://github.com/Azure/sonic-buildimage/tree/master/platform/mellanox/hw-management) module.
2021-05-04 13:07:00 -07:00
Stephen Sun
b2286a24dc
[Mellanox] Adopt single way to get fan direction for all ASIC types (#7386)
#### Why I did it
Adopt a single way to get fan direction for all ASIC types.
It depends on hw-mgmt V.7.0010.2000.2303. Depends on https://github.com/Azure/sonic-buildimage/pull/7419

#### How I did it
Originally, the get_direction was implemented by fetching and parsing `/var/run/hw-management/system/fan_dir` on the Spectrum-2 and the Spectrum-3 systems. It isn't supported on the Spectrum system.
Now, it is implemented by fetching `/var/run/hw-management/thermal/fanX_dir` for all the platforms.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-05-03 17:10:18 -07:00
Junchao-Mellanox
d9cdf9d14f
[Mellanox] Adjust PSU fan name to align with sysfs file name (#7490)
Change PSU fan name from psu_{psu_index}fan{fan_index} to psu{psu_index}_fan{fan_index}
2021-05-02 08:14:56 -07:00
Stephen Sun
b3a283366c
Fix issue: exception occurred during chassis object being destroyed (#7446)
The following error message is observed during chassis object being destroyed

"Exception ignored in: <function Chassis.__del__ at 0x7fd22165cd08>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sonic_platform/chassis.py", line 83, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
The chassis tries to import deinitialize_sdk_handle during being destroyed for the purpose of releasing the sdk_handle.
However, importing another module during shutting down can cause the error because some of the fundamental infrastructures are no longer available."

This error occurs when a chassis object is created and then destroyed in the Python shell.

- How I did it
To fix it, record the deinitialize_sdk_handle in the chassis object when sdk_handle is being initialized and call the deinitialize handler when the chassis object is being destroyed

- How to verify it
Manually test.
2021-04-29 11:33:39 +03:00
Junchao-Mellanox
ccc7bd1315
[Mellanox] Upgrade hw-mgmt to 7.0100.2303 (#7419)
- Why I did it
Upgrade hw-mgmt to 7.0100.2303

Bug fixes

1. Fan direction feature fix for fixed FAN system (using shell instead of binutils/strings)
2. Remove cpld 4th link on systems with only 3 CPLD's
3. hw-mgmt: thermal: Add hardcoded critical trip point. Follow-up after patch "Removing critical thermal zones to prevent unexpected software system shutdown".
4. Fix sensor attribute mapping to be label based instead of index based to allow common handling of voltage regulator names independently of hardware changes.
5. Update 'lm-sensors' custom configuration file. Relevant only for users utilizing sensors.conf files coming along with hw-management package.
6. For full feature list please follow https://github.com/Mellanox/hw-mgmt/blob/V.7.0010.2300_BR/debian/Release.txt

- How I did it
Update hw-mgmt pointer
Remove unused patches
Fix existing patch to make sure it apply successfully

- How to verify it
Full platform regression on all mellanox platforms
2021-04-28 16:21:55 +03:00
Dror Prital
22abec3c5d
[mellanox]: Integrate SAI version 1.18.3.2 into Master branch (#7428)
Changes in the new release:

Fix 10G and 50G speeds in SAI XML to support all interface types
Enable SMAC=DMAC and SMAC MC in tunnel debug counter
Add tunnel statistics
Add isolation group API implementation
Fix ACL ANY debug counter to correctly track ACL drops
Add VXLAN source port hard coded range, controlled by K/V
FW dump me now feature
Add mlxtrace to saidump
Speed lane setting and AN control
Implement query stats API
VNI miss part of tunnel decal drop reason
Align with SAI API v1.8.1

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-04-27 16:24:59 -07:00
Stepan Blyshchak
cd2c86eab6
[dockers] label SONiC Docker with manifest (#5939)
Signed-off-by: Stepan Blyschak stepanb@nvidia.com

This PR is part of SONiC Application Extension

Depends on #5938

- Why I did it
To provide an infrastructure change in order to support SONiC Application Extension feature.

- How I did it
Label every installable SONiC Docker with a minimal required manifest and auto-generate packages.json file based on
installed SONiC images.

- How to verify it
Build an image, execute the following command:

admin@sonic:~$ docker inspect docker-snmp:1.0.0 | jq '.[0].Config.Labels["com.azure.sonic.manifest"]' -r | jq
Cat /var/lib/sonic-package-manager/packages.json file to verify all dockers are listed there.
2021-04-26 13:51:50 -07:00
Kebo Liu
b43c4001f8
[Mellanox] Update SDK to 4.4.2522 and FW to 2008.2520 (#7391)
New features and fixes in the new SDK/FW:

SN4600C | AN/LT support
SN2700 | AN/LT bugs fixes
WJH | FID_MISS support

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-04-21 10:50:14 -07:00
Stephen Sun
46a7fac1aa
Bug fix: Support dynamic buffer calculation on ACS-MSN3420 and ACS-MSN4410 (#7113)
- Why I did it
Add missed files for dynamic buffer calculation for ACS-MSN3420 and ACS-MSN4410

- How I did it
asic_table.j2: Add mapping from platform to ASIC
Add buffer_dynamic.json.j2 for ACS-MSN4410.

- How to verify it
Check whether the dynamic buffer calculation daemon starts successfully.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-04-07 20:33:15 +03:00
Stephen Sun
ecaf97d8a3
[mellanox]: Integrate hw-mgmt package V.7.0010.2002 (#7148)
Integrate hw-management package V.7.0010.2002

Bug fixes:
Removing critical thermal zones to prevent unexpected software system shutdown:
*Kernel 4.9 -0071-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch
*Kernel 4.19 -076-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch
Removing redundant link for cpld3 for fixed systems (SN2100, SN2010).
Fix an issue with missed attribute for cpld3 (port CPLD) for SN2700, SN2410.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-03-30 18:30:15 -07:00
Joe LeVeque
c651a9ade4
[dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7083)
To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:

```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```

This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.

This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).

I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-27 21:14:24 -07:00
Volodymyr Samotiy
b30595ac49
[Mellanox] Update SDK to 4.4.2508 and FW to xx.2008.2508 (#7141)
Fix the following issues:

Spectrum-2, Spectrum-3 | Port | Fix link issue when using 25 GbE rate between two ports while one is on Spectrum-2-based system and the other is on Spectrum-3-based system
All | warmboot | fail to upgrade from earlier SONiC versions with official SDK/FW 4.4.2306 (was on SONiC 201911)
All | What-Just-Happened | When enabling or disabling WJH under high traffic load to the host CPU, in very specific and low probability conditions, an error could occur, that may result in loss of data, channel failure or in extreme cases SW failure

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-03-27 11:51:49 -07:00
Junchao-Mellanox
93a54450d3
Fix issue: should not initialize led color in __init__ file as platform API will be called by multiple daemons (#7114)
- Why I did it
The existing Fan led and Psu led object initialize itself to green color in init method. However, there are multiple daemons calls sonic platform API and there could be a case that:

A PSU is removed from system
Reboot switch
psud detects that 1 PSU is missing and set PSU led to red
Other daemon just start up and call sonic platform API, the API set PSU led to green by call PsuLed.init
This PR is a partial fix for the issue. As we also need guarantee that the led is initialized with a correct value. I checked existing psud and thermalctld code. psud always initialize the PSU led color on boot up, thermalcltd need some changes to initialize led color on the first run

- How I did it
Remove the led color initialization code from FanLed.init and PsuLed.init

- How to verify it
Manual test
2021-03-25 14:28:33 +02:00
Volodymyr Samotiy
c7cc4b465b
[Mellanox] Update FW to xx.2008.2424 (#7118)
Fixed issues:
* Mellanox SN-2700 breakout port not linking up with QSA

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-03-22 18:27:36 -07:00
Junchao-Mellanox
8504c72f14
[Mellanox] Initialize PSU API on both host and docker side (#7016)
There was a change to replace platform utils with sonic platform API in psuutil. However, psu API is not initialized on host side. The PR is to fix it.
2021-03-15 12:43:18 -07:00
Kebo Liu
c82aaaeb41
[Mellanox] Update SDK to 4.4.2418, FW to 2008.2416, SAI to new commit (#7041)
- Why I did it
To pick up new features and fix from SDK/FW and SAI

SDK/FW new Feature:

All | Added support for multiple modules and cable types. For full list contact Nvidia networking support
Spectrum-3 | SN46000C | Added support for up to 5W on ports 49 to 64 .
SDK/FW bugs' fix:

All | fast reboot | fast boot failure from latest 201811 to 201911 and above
Spectrum | 10GbE/1GbE Transceiver (FTLX8574D3BCV) stopped working after firmware upgrade
Spectrum-2 | When device is rebooted with locked Optical Transceivers in split mode, the firmware may get stuck
Spectrum-2 | SN3700 | When connecting at 200GbE to Ixia K400, Ixia receives CRC errors
Spectrum-2 | SN3800 | On rare occasions packets loss may be experienced due to signal integrity issues
Spectrum-2 | When the port is a member of a LAG, after a warmboot and port toggle on the peer-side, the port remains down
Spectrum-3 | SN4700 | While using Optic cable in Split 4x1 mode in PAM4, when two first ports are toggled, the other 2 ports go down
Spectrum-3 | SN4700 | When working in 400GbE, deleting the headroom configuration (changing buffer size to zero) on the fly may cause continual packet drops
SAI

All | sFlow | Use hardcoded value 1 as netlink group number ax expected by hsflowd
- How I did it
Update the related version number in the make files and update the submodule pointer accordingly.

- How to verify it
Run regression test and everything works good.
2021-03-13 21:19:40 +02:00
Junchao-Mellanox
7caa70d2d6
[Mellanox] Fixes issue: CLI sfputil does not work based on sonic platform API (#7018)
#### Why I did it

Recently, CLI sfputil replace the old sonic platform utils with sonic platform API. However, sonic platform API does not support SFP low power mode and reset related operation. The PR is to fix it.

The change to replace platform utils with sonic platform API was reverted on 202012, once this PR is merged, we can cherry-pick these two PRs to 202012 together.

#### How I did it

In low power mode and reset related operation, use "docker exec" if the command is running on host side.
2021-03-11 18:54:33 -08:00
DavidZagury
6779118d71
[Mellanox] Update MFT to 4.16.0-105 (#7007)
- Why I did it
Update MFT tool version to 4.16.0

Bugs fixes:
mlxlink: Fixed an issue that caused the margin scan to fail with the following message: Eye scan not completed.
mlxcable: Cable firmware burning capability is not supported.

New features:
mlxlink: Enabled margin scan on Network links.
mlxlink: Added PRBS TX/RX polarity inversion using the following flags: --invert_tx_polarity / --invert_rx_polarity

- How I did it
Update MFT make file with new version number.

- How to verify it
Build image and test related functions on Mellanox platform
2021-03-10 22:03:43 +02:00
Kebo Liu
0e71d82f72
[Mellanox] Update hw-management package to version 7.0010.2000 (#6692)
- Why I did it
   Bug fixes
   - In rare cases when thermal algorithm is reactivated after FAN/PSU insertion, FAN remains at high rpm
   - When stop hw-management code received error in the log instead of exit code '0'.
   - In SPC1 i2c sometimes collide with chip reset coming from SDK
   - Remove raw eeprom data link, when working with PSU which don't have eeprom for "msn274x", "msn24xx" and "msn27xx" systems
   - Fix memory leak on mlxsw_core_bus_device module removal

- How I did it
Update the hw-mgmt version number in the make file
Update the hw-mgmt repo pointer

- How to verify it
run platform related test cases on all Mellanox platform

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-03-01 10:01:50 +02:00
Joe LeVeque
516ff8bfff
[Mellanox] Ensure concrete platform API classes call base class initializer (#6854)
In preparation for the merging of Azure/sonic-platform-common#173, which properly defines class and instance members in the Platform API base classes.

It is proper object-oriented methodology to call the base class initializer, even if it is only the default initializer. This also future-proofs the potential addition of custom initializers in the base classes down the road.
2021-02-25 11:06:22 -08:00
shlomibitton
3de6a67353
[Mellanox] Add hw-mgmt patch for SimX platform adaptation (#6782)
- Why I did it
System is stuck on 'starting' state on SimX platform because of infinite loop on 'hw-management-ready.sh' script .
The loop is polling to check if the hw-mgmt sysfs created before proceeding with the flow, for SimX platform the sysfs will never create so the system is not starting properly.

- How I did it
Add a condition to poll on hw-mgmt sysfs only if the switch is real HW and not SimX platform.

- How to verify it
Check "systemctl status hw-management.service" output on a SimX switch with this patch, the state will be "active".

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-02-25 12:41:29 +02:00
DavidZagury
5aee92e56d
[Mellanox] Add support for SN4600 system (#6879)
- Why I did it
Add support for new 64x200G SN4600 systems

- How I did it
Add all relevant files (w/o platform.json and hwsku.json as they will come later) with default SKU.

- How to verify it
Install image on switch, verify all ports are up and configured properly, run full platform SONiC tests.
2021-02-25 09:30:43 +02:00
Eran Dahan
a472cabc0b
[Docker] Added support for python2 (#6753)
- Why I did it
Mellanox SDK APIs support python 2 at the moment.

- How I did it
Mellanox SDK APIs support python 2 at the moment.

- How to verify it
Add python 2 to Mellanox syncd only.

- Which release branch to backport (provide reason below if selected)
docker exec -t syncd /bin/bash -c "sx_api_dbg_generate_dump.py /home/sx_api_dbg_dump"
You can see that it will work and generate /home/sx_api_dbg_dump

Signed-off-by: allas <allas@nvidia.com>
2021-02-24 19:49:58 +02:00
Volodymyr Samotiy
ea100d2a19
[Mellanox][SAI] update submodule pointer (#6806)
Open ACL Outer VLAN ID for egress for ports part of VLAN RIF

- Why I did it
Open ACL Outer VLAN ID for egress for ports part of VLAN RIF

- How I did it
Updated SAI submodule pointer

- How to verify it
Build an image, deploy and check all is up and running.
Verify ACL sonic-mgmt test is passing

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-02-18 11:53:19 +02:00
Volodymyr Samotiy
6998aef114
[Mellanox] Update SDK to 4.4.2318, FW to *.2008.2314 (#6794)
To have the following fixes:
* All | Port status remains down after warm boot and flapping the port on peer side
* All | LAG HASH  | IPv6 SRC_IP is not accounted in LAG hashing [
* All | ASIC driver | Kernel crash observed when driver reload is initiated before it fully loaded
* Spectrum-3 | Buffer | In lossless configuration, headroom is been evicted only when the shared buffers is free
* All | prevent FW access during ISSU

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-02-16 18:07:11 -08:00
Stepan Blyshchak
0e17525937
[Mellanox][SAI] update submodule pointer (#6729)
Include SAI bug fixes:

- Apply device MAC on port host interface when port is removed from LAG.
- [Shared Headroom]: fixed watermark handling for SHP flow
- Decrease verbosity of policer unbind message when no policer is attached

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-02-10 23:13:46 -08:00
Joe LeVeque
7ea0d9e27a
[sonic-platform-common] Update submodule (#6742)
Submodule commits included:

* src/sonic-platform-common 6ad0004...bd4dc03 (1):
  > [sonic_sfp/qsfp_dd.py] Update DOM capability method name to align with other drivers (#163)

Also align all calling function names to match.
2021-02-10 06:12:49 -08:00
Tamer Ahmed
149a68b956
[syncd-rpc] Install Libboost Atomic 1.71, Libqtcore And Libqtnetwork (#6689)
When Building syncd-rpc, libthrift has dependency on libboost-atomic1.71.0,
however the debian packager install version 1.67 instead. This PR
preinstalls libboost-atomic v 1.71 to avoid falling back to v 1.67.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2021-02-10 02:26:31 -08:00
Junchao-Mellanox
6d4c20efb1
Fix dynamic minimum fan table issue caused by python3 (#6690)
**- Why I did it**
After migrating to python3, the operator '/' always get a float result, but it gets integer result in python2. Need fix this in thermal_conditions.

**- How I did it**
1. cast float value to int
2. change the unit test case to cover this situation

**- How to verify it**
Manually test and regression test
2021-02-07 11:21:44 +02:00
Joe LeVeque
18f2c5cfdd
[platform] Update QSFP method name 'parse_qsfp_dom_capability' -> 'parse_dom_capability' (#6695)
**- Why I did it**
PR https://github.com/Azure/sonic-platform-common/pull/102 modified the name of the SFF-8436 (QSFP) method to align the method name between all drivers, renaming it from `parse_qsfp_dom_capability` to `parse_dom_capability`. Once the submodule was updated, the callers using the old nomenclature broke. This PR updates all callers to use the new naming convention.

**- How I did it**

Update the name of the function globally for all calls into the SFF-8436 driver.

Note that the QSFP-DD driver still uses the old nomenclature and should be modified similarly. I will open a PR to handle this separately.
2021-02-05 14:41:05 -08:00
Lior Avramov
f76926add3
[Mellanox] Update FW upgrade script to use 'mlxfwmanager -d' option for specifying MST device in FW burn operation (#6541)
**- Why I did it**
Reduce the time it takes for the ASIC FW burn as part of the automatic FW upgrade procedure.

**- How I did it**
Add -d option to mlxfwmanager tool to use the faster MST device and not the default one which is not the fastest one.

**- How to verify it**
I manually changed ASIC FW followed by reboot command in order for FW upgrade to take place on deinit.
I manually changed ASIC FW followed by hard reset in order for FW upgrade to take place on init.

Signed-off-by: liora <liora@nvidia.com>
2021-02-04 19:44:16 +02:00
xumia
19ccba4d05
[build]: Fix syncd dpkg cache dependency issue (#6680)
* Fix syncd dpkg cache dependency issue
2021-02-04 09:03:14 -08:00
Eran Dahan
984c1cd209
[MLNX] update SAI submodule to include fix for debug dump (#6667)
**Why I did it**
Disable SDK extended dump due to issue found

**How I did it**
Update SAI submodule

**How to verify it**
Verify the SDK extended dump is not called.

Signed-off-by: Eran Dahan <erand@nvidia.com>
2021-02-04 09:12:28 +02:00
Stephen Sun
4f50658cfc
[syncd-rpc docker] Fix issue: ptf_nn_agent isn't able to start in syncd-rpc docker on buster (#6448)
- Why I did it
Fix issue: ptf_nn_agent isn't able to start in syncd-rpc docker on buster.

- How I did it
The issue is fixed by installing python-dev, cffi and nnpy for python 2 explicitly.

- How to verify it
Run copp test on RPC image.
2021-01-31 09:11:33 +02:00
Kebo Liu
7f222e7bc1
[mellanox]: Update SAI to sonic2012 1.18.1.0 (#6566)
Changes in the new release:

1. Policy based hashing optimization
2. New attribute support for Max port headroom
3. Tunnel ECN map fixes
4. Tunnel EVPN skeleton extensions (peer attrib, maps)
5. Bridge port admin not affecting port admin (optimize port down time)
6. CRM new API for neighbors and tunnel termination entries
7. Improve FDB event for flush by bridge port (before, null bridge was reported to SONiC, now the bridge will be extracted from bridge port)
8. DHCP L2 v4+v6 traps (for ZTP use case)
9. Generic counter implementation

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-01-27 12:29:28 -08:00
Guohan Lu
ca0e8cbe0e [docker-ptf]: build docker ptf
- combine docker-ptf-saithrift into docker-ptf docker
- build docker-ptf under platform vs
- remove docker-ptf for other platforms

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-01-27 08:28:21 -08:00
Kebo Liu
9ff56445c9
Add hw-mgmt patch to support SDK OFFLINE event for handling flow within service firmware upgrade (#6550)
During ISSU, "mlxsw_minimal" driver still trying to access firmware, in some cases FW could return some wrong critical threshold value which will cause switch shutdown.

**- How I did it**
In order to prevent "mlxsw_minimal" driver from accessing ASIC during ISSU, SDK will raise "OFFLINE" 'udev' event
at the early beginning of such flow. When this event is received, hw-management will remove "mlxsw_minimal" driver.
There is no need to implement the opposite "ONLINE" event since this flow is ended up with "kexec".

**- How to verify it**
repeatedly perform warm reboot, make sure there is no switch shutdown occurred.
2021-01-27 15:39:54 +02:00
Kebo Liu
84985e103d
[mellanox]: Update SDK to 4.4.2308, FW to *.2008.2308 (#6552)
Bugs fixes:
    All | Kernel | During system reload when CPU is loaded with heavy traffic, a Kernel Panic may occur.
    All | Modules, Port split | FW stuck when device rebooted with locked Optical Transceivers in split mode
    Spectrum-3 | PFC | On Spectrum-3 systems, slow reaction time to Rx pause packets on 40GbE ports may lead to buffer overflow on servers.
    Spectrum-3 | SN4700, Port Split | On rare occasion SN4700, conducting 100G split (4x25G) in NRZ when splitter port 1 or 2 are down, ports 3 and 4 will also go down.

Enahncments:
    All | Kernel | new notification on ISSU start, so other kernel drivers can disable any interface to ASIC

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-01-25 10:52:22 -08:00
yozhao101
be3c036794
[supervisord] Monitoring the critical processes with supervisord. (#6242)
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
2021-01-21 12:57:49 -08:00
lguohan
755c73797c
[mellanox]: fix mellanox hw-management build (#6471)
use dpkg-buildpackage build with fakeroot

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-01-18 13:10:27 -08:00
Kebo Liu
4cf9316ec3
[Mellanox] Make determine-reboot-cause service start after hw-management service (#6465)
**- Why I did it**

On the Mellanox platform, reboot cause is fetched from some certain sysfs which is created by the hw-management service. So determine-reboot-cause service shall start after hw-management, otherwise it could fail due to the related sysfs is not available yet.

**- How I did it**

Add a patch to the hw-management service to make sure determine-reboot-cause service should start after it.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-01-15 11:38:31 -08:00
Kebo Liu
1b2980540d
[mellanox][platform api] fix a missing import time module (#6458)
“time" module was missed to be imported and will cause an error when the branch hit.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-01-15 08:01:11 -08:00
Junchao-Mellanox
0a49edb68e
[Mellanox] Fix issue: need import initialize_sdk_handle in get_sdk_handle (#6435)
Found test_sfp.py failed due to use a method without importing it.
2021-01-13 09:42:04 -08:00
Kebo Liu
015b421e5e
[Mellanox] [platform API] Fix “local variable 'label_port' referenced before assignment” error (#6419)
In rare case can see that xcvrd failed due to "UnboundLocalError: local variable 'label_port' referenced before assignment"

Init "label_port" as None at the beginning of the function, to avoid the case that "label_port" not assigned.
2021-01-12 10:43:57 -08:00