Commit Graph

482 Commits

Author SHA1 Message Date
Yakiv Huryk
0ced7081c7
[asan] add print_suppressions=0 to ASAN configs (#11252)
- Why I did it
To provide an ability to suppress ASAN false positives and have a clean ASAN report for docker-sonic-vs/mlnx-syncd/orchagent docker

- How I did it
Added the "print_suppressions=0" to ASAN configs.

- How to verify it
add a suppression to some ASAN-enabled component (the suppression should catch some leak)
build with ENABLE_ASAN=y
run a test and see that the ASAN report is empty instead of having the suppression summary

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
2022-06-28 18:45:52 +03:00
Kebo Liu
7ac590b5c5
[Mellanox] Enhance Platform API to support SN2201 - RJ45 ports and new components mgmt. (#10377)
* Support new platform SN2201 and RJ45 port

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* remove unused import and redundant function

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* fix error introduced by rebase

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* Revert the special handling of RJ45 ports (#56)

* Revert the special handling of RJ45 ports

sfp.py
sfp_event.py
chassis.py

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Remove deadcode

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Support CPLD update for SN2201

A new class is introduced, deriving from ComponentCPLD and overloading _install_firmware
Change _install_firmware from private (starting with __) to protected, making it overloadable

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Initialize component BIOS/CPLD

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Remove swb_amb which doesn't on DVT board any more

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Remove the unexisted sensor - switch board ambient - from platform.json

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Do not report error on receiving unknown status on RJ45 ports

Translate it to disconnect for RJ45 ports
Report error for xSFP ports

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Add reinit for RJ45 to avoid exception

Signed-off-by: Stephen Sun <stephens@nvidia.com>

Co-authored-by: Stephen Sun <5379172+stephenxs@users.noreply.github.com>
Co-authored-by: Stephen Sun <stephens@nvidia.com>
2022-06-20 19:12:20 -07:00
Junchao-Mellanox
f135f37a50
[Mellanox] optimize platform API import time (#10815)
- Why I did it
"import sonic_platform" takes about 600ms ~ 1000ms, it is kind of slow. After this optimization, the time is about 100ms. The benefit is that those CLIs which does not need the slow import sentence would be faster than before.

- How I did it
Find slow import and call them when need.

- How to verify it
Measure the import time.
2022-06-07 15:13:16 +03:00
Volodymyr Samotiy
5c869492bd
[Mellanox] Update SDK/FW to 4.5.2262/xx.2010.2262 (#10882)
- Why I did it
To include latest fixes:
1. Warmboot | When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU.
2. Link Up | When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted.
3. Shared buffer | While moving from lossless to lossy while shared headroom was used, reduction of the shared headroom can only be done prior to pool type change and when shared headroom is not utilized.

- How I did it
Updated SDK submodule along with the relevant Makefiles

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2022-06-07 15:11:25 +03:00
vdahiya12
0b8e463db4
[sonic-platform-common][sonic-platform-daemons] submodule update; Remove python2 sonic-platform-common wheel (#10994)
* [sonic-platform-common][sonic-platform-daemons] submodule update

vdahiya@vdahiya-dev3:~/sonic-buildimage8/sonic-buildimage/src/sonic-platform-daemons$
git log --oneline 9ac12bf..master
0d90023 (HEAD -> master, origin/master, origin/HEAD, origin/202205) grpc
client implementation for active-active dualtor (#248)
6b8bf69 [ycabled] Fix some syntax warnings in ycabled (#263)
2bcf936 [ycabled] fix the posting for mux_cable_static_info per downlink
when ycabled is spawned; synchronizing executing Telemetry API (#257)
ce217c0 Include changes from xcvr_api in transceiver_info table (#253)
e0f8a35 Fix checkReplyType failed issue via recreating xcvr_table_helper
on forking subprocess (#255)

f575a40 (origin/master, origin/HEAD, origin/202205, master)
[Credo][Ycable] changes for synchronizing executing Telemetry API's when
mux toggle is inprogress (#280)
b043372 [sonic_ssd] Nokia-7215: "show platform ssdhealth" not showing
health percent (#279)
d62d3d6 [CMIS]Fix low-power to high power mode transition (#268)
f918125 [syseeprom] Enable display of vendor extension TLV content
(#270)
4e08440 [Credo][Ycable] improve logging for Server Powered off/Faulty
cables (#272)

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>

* remove python2 wheel for sonic-platform-common

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>

* remove python2 platform_common definitions

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
2022-06-04 07:41:15 -07:00
Yakiv Huryk
7306d68411
[build][asan] make dpkg cache asan-aware (#10750)
Currently, the build with ASAN_ENABLE=y reuses the packages built with
ASAN_ENABLE=n (and vice versa). To address this issue, ASAN_ENABLE is added to DEP_FLAGS for asan-enabled packages (docker-syncd-mlnx, syncd, docker-orchagent, swss).

- Why I did it
To make dpkg cache use/rebuild the packages for ASAN_ENABLE=y/n.

- How I did it
Added ASAN_ENABLE to the DEP_FLAGS for asan-enabled packages.

- How to verify it
Built with ASAN_ENABLE=y/n and checked the .flags .log files.

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
2022-05-31 11:15:44 +03:00
Yakiv Huryk
bd91b2eef3
[asan] add debug package for asan-enabled containers (#10953)
This is to improve the readability of ASAN reports. The debug package adds function names and source code references to the backtrace (currently, there are only binary addresses of functions)

Another way to address this issue is to build the image with "INSTALL_DEBUG_TOOLS=y". The downside of this approach is that the image size and compilation time are unnecessarily big. Also, the idea is to make the "ENABLE_ASAN" self-sufficient, which would not be the case for this approach.

- Why I did it
To improve the readability of asan logs.

- How I did it
Added SYNCD_DBG and SWSS_DBG to corresponding docker images for ASAN_ENABLE=y build

- How to verify it
Add artificial memory leak
Build with ASAN_ENABLE=y
Test the image and check the ASAN report

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
2022-05-31 09:24:18 +03:00
Andriy Yurkiv
29043ff026
[MFT] Update MFT version to MFT 4.20.0-34 (#10933)
- Why I did it
Update MFT to newer version

- How I did it
Update MFT_VERSION in platform/mellanox/mft.mk

- How to verify it
Check version via dpkg -l | grep mft

Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
2022-05-28 15:45:02 +03:00
Alexander Allen
71c868f56a
Upgrade libasan to version 6 in docker-syncd-mlnx to align with bullseye libasan (#10886) 2022-05-27 11:28:01 -07:00
Andriy Yurkiv
70d71f99f5
[Mellanox] Credo Y-cable | add more log info, checks, fix exception message (#10779)
- Why I did it
Script fails when there is an exception while reading

- How I did it
Add more logs and checks. Fix wrong variable naming and messages.

- How to verify it
Provoke exception while read_eeprom() and check that it is handled properly
2022-05-19 17:36:02 +03:00
Alexander Allen
d202bf26d7
Upgrade mellanox platform containers (syncd / saiserver / syncd-rpc) and pmon to bullseye (#10580)
Fixes #9279

- Why I did it
Part of larger effort to move all SONiC systems to bullseye

- How I did it
1. Update container makefiles with correct dependencies
2. Update container Dockerfile with correct base image
3. Update container Dockerfile with correct apt dependencies
4. Update any other makefiles with dependencies to remove python2 support
5. Minor changes to support bullseye / python3

- How to verify it
Run regression on the switch:
1. Verify PTF community tests work
2. Verify syncd runs and all ports come up / pass traffic
3. Verify all platform tests succeed
2022-05-10 12:45:28 +03:00
vmittal-msft
9ae17e66a3
[sonic-sairedis update] Support for SAI header v1.10.2 with BRCM SAI v7.1.0.0 and MLNX SAI v1.21.1.0 (#10583) 2022-05-05 20:27:29 -07:00
Alexander Allen
53e5fe6a93
[Mellanox] Upgrade mellanox SDK to 4.5.1500 and mlnx-sai to 1.21.1.1 (#10675)
Update SDK/FW to 4.5.1500/2010.1500 and SAI version to 1.21.1.1

SDK/FW features:
1. Added support for Finisar DR4 (FTCD4523E2PCM) on Spectrum-2 and Spectrum-3 systems.

SAI Features:
1. ECMP overlay support for IPv6
2. BFD offloading / 4K scale
3. Host interface user traps + improved trap registration (table entry)
4. gcc11 compilation fixes
5. Read support for ACL redirect action
6. Optimize ECMP DB size
7. Buffer descriptors new defaults
8. Updated port mapping for SN2201

SAI Fixes:
1. Debug counter removal when configured with all drop reasons

- Why I did it
Upgrade Mellanox SDK and SAI versions to latest

- How I did it
Updated submodule pointers

- How to verify it
Regression tested
2022-04-29 20:50:59 +03:00
Kalimuthu-Velappan
bc30528341
Parallel building of sonic dockers using native dockerd(dood). (#10352)
Currently, the build dockers are created as a user dockers(docker-base-stretch-<user>, etc) that are
specific to each user. But the sonic dockers (docker-database, docker-swss, etc) are
created with a fixed docker name and common to all the users.

    docker-database:latest
    docker-swss:latest

When multiple builds are triggered on the same build server that creates parallel building issue because
all the build jobs are trying to create the same docker with latest tag.
This happens only when sonic dockers are built using native host dockerd for sonic docker image creation.

This patch creates all sonic dockers as user sonic dockers and then, while
saving and loading the user sonic dockers, it rename the user sonic
dockers into correct sonic dockers with tag as latest.

	docker-database:latest <== SAVE/LOAD ==> docker-database-<user>:tag

The user sonic docker names are derived from 'DOCKER_USERNAME and DOCKER_USERTAG' make env
variable and using Jinja template, it replaces the FROM docker name with correct user sonic docker name for
loading and saving the docker image.
2022-04-28 08:39:37 +08:00
Junchao-Mellanox
af5e5c4c94
[Mellanox] Adjust PSU voltage WA (#10619)
- Why I did it
InvalidPsuVolWA.run might raise exception if user power off PSU when it is running. This exception is not caught and will be raised to psud which causes psud failed to update PSU data to DB.

- How I did it
1. Change the log level when WA does not work. This could happen when user power off PSU, hence changing the log level from error to warning is better
2. Change the wait time from 5 to 1 to avoid introduce too much delay in psud. 1 second is usually enough per my test
3. Give a default return value for function get_voltage_low_threshold and get_voltage_high_threshold to avoid exception reach to psud

- How to verify it
Manual test.
Run sonic-mgmt regression
2022-04-22 11:02:30 +03:00
Kebo Liu
a0c76b1bc9
[Mellanox] support newly added reboot cause (#10531)
- Why I did it
Implement newly added reboot causes in PR Azure/sonic-platform-common#277

- How I did it
Map the reboot cause sysfs to the newly added reboot causes.

- How to verify it
manual test, check whether the reboot cause is correct after rebooting the switch in various ways.
run the community reboot test to see whether the reboot cause checker is passing.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-04-18 10:55:56 +03:00
Yakiv Huryk
d9117d9411
[Mellanox][asan] add address sanitizer support for syncd (#10266)
Why I did it
To support address sanitizer for Mellanox syncd

How I did it
/var/log/asan is mapped for syncd container (the same as for swss)
container stop() has a timeout (60s) for syncd (the same as for swss)
This is so libasan has enough time to generate a report.
added ASAN's log path to Mellanox syncd supervisord.conf
added "asan: yes" to sonic_version.yml
How to verify it
Added artificial memory leaks
Compiled with ENABLE_ASAN=y
Installed the image on DUT
Rebooted the DUT
Verified that /var/log/asan/syncd-asan.log contains the leaks

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
2022-04-14 15:00:32 -07:00
Junchao-Mellanox
0191300b96
[Mellanox] Auto correct PSU voltage threshold (WA) (#10394)
- Why I did it
There is a hardware bug that PSU voltage threshold sysfs returns incorrect value. The workaround is to call "sensor -s" to refresh it.

- How I did it
Call "sensor -s" when the threshold value is not incorrect and PSU is "DELTA 1100"

- How to verify it
Unit test and Manual test
2022-04-14 08:14:40 +03:00
Kebo Liu
85539e7e08
[Mellanox] Update hw-mgmt package to version V.7.0020.2004 (#10401)
- Why I did it
Take new hw-mgmt release to SONiC, including:

New features:
1. hw-mgmt: add to PSU FW upgrade tool command to show current FW version
2. hw-mgmt: add to PSU FW upgrade tool support for single-PSU-in-the-system FW upgrade
3. hw-mgmt: add attribute “/firmware” to show FW version of restricted upgradable PSUs only
4. hw-mgmt: Add NVME temperature reports attributes (_alarm/_crit/_min/_max)

Bug fix:
1. psu: redundant i2c_addr attributes being created for psu 3 & 4 in system having only 2 psus.
2. hw-mgmt: in SPC1/2 i2c driver removal is too slow vs. ASIC reset causing non-functional log errors
3. PSU thresholds sysfs changed in 5.10 to “read only” preventing modification (modification required due PSU HW bug)
4. CPLD3 sysfs attribute missing after chip down/up flow
5. sysfs attributes missing when hw-mgmt is restarted (stop/start) within systemd

Release notes can be found from link https://github.com/Mellanox/hw-mgmt/blob/V.7.0020.2004/debian/Release.txt

- How I did it
Update hw-mgmt make file with new version number
Update hw-mgmt submodule pointer

- How to verify it
Run platform regression on all Mellanox platform

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-03-31 08:21:09 +03:00
Andriy Yurkiv
1e2e493daa
[Mellanox] Credo Y-cable read_eeprom/write_eeprom API implementation (#10320)
- Why I did it
Implement read_eeprom/write_eeprom API for Credo Y-cable for Dual ToR Active-Standby

- How I did it
Use mlxreg utility for API implementation

Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
2022-03-30 20:41:31 +03:00
Dror Prital
8bc81206c5
[Mellanox] Update NVIDIA License header for files changed since 1.1.2022 (#10289)
- Why I did it
Update NVIDIA Copyright header to "mellanox" files which were changed since 1.1.2022

- How I did it
Update the copyright header

- How to verify it
Sanity tests and PR checkers.
2022-03-23 13:19:25 +02:00
Oleksandr Ivantsiv
9565ef7a9a
[Mellanox] Refactor SFP to use new APIs. (#10317)
- Why I did it
Refactor SFP code to remove code duplication and to be able to use the latest features available in new APIs.

- How I did it
Refactor SFP code to remove code duplication and to be able to use the latest features available in new APIs.

- How to verify it
Run sonic-mgmt/platform_tests/sfp tests
2022-03-23 09:16:03 +02:00
Kebo Liu
42ab5b8eaa
[Mellanox] update MFT version to 4.18.0-106 (#10304)
- Why I did it
With the previous MFT 4.18.1-16 there is a bug in mstdump tool accessing wrong address. it is confirmed this issue does not exist in official 4.18.0-106.

- How I did it
Update the MFT version to 4.18.0-106

- How to verify it
Run regression on Mellanox platforms
2022-03-21 19:38:37 +02:00
Junchao-Mellanox
f0ddd102d5
[Mellanox] Add CPU thermal control for Nvidia platforms (#10202)
Why I did it
Add CPU thermal control for Nvidia platforms which will be enabled for platforms that have heavy CPU load. Now it is only enabled on 4800, and it will be enabled on future platforms.

How I did it
Check CPU pack temperature and update cooling level accordingly

How to verify it
Manual test
Added sonic-mgmt test case, PR link will update later
2022-03-21 09:54:52 -07:00
Junchao-Mellanox
dc04d64219
[Mellanox] Fix issue: psu might use wrong voltage sysfs which causes invalid voltage value (#10231)
- Why I did it
Fix issue: psu might use wrong voltage sysfs which causes invalid voltage value. The flow is like:

1. User power off a PSU
2. All sysfs files related to this PSU are removed
3. User did a reboot/config reload
4. PSU will use wrong sysfs as voltage node

- How I did it
Always try find an existing sysfs.

- How to verify it
Manual test
2022-03-20 10:34:04 +02:00
Yang Wang
c8db7a2d52
[Mellanox][SAISERVER] Support Mellanox saiserverv1 and saiserverv2 docker (#9686)
* support saiserverv1 and saiserverv2 docker

* add saiserver into buster and revert some changes

* update thrift version
2022-03-10 13:15:44 +08:00
Alexander Allen
7c4fbf0455
[Mellanox] Add patch to hw-mgmt to prevent loading of non-existent kernel modules (#10073)
- Why I did it
The latest upgrade of Mellanox hw-mgmt V7.0020.1300 introduced a couple new kernel modules for new Mellanox platforms that have yet to be upstreamed to the linux kernel.

As these new platforms do not have SONiC support we elected not to upstream these new drivers to sonic-linux-kernel but hw-mgmt expects them to exist which is causing a non-functional error on switch boot.

Feb 15 00:09:55.374130 r-leopard-simx-74 ERR systemd-modules-load[269]: Failed to find module 'emc2305'
Feb 15 00:09:55.374141 r-leopard-simx-74 ERR systemd-modules-load[269]: Failed to find module 'ads1015'
To resolve this we can patch hw-mgmt to no longer attempt to load these modules by default.

- How I did it
Added a SONiC patch to Mellanox hw-mgmt in order to remove the unused kernel modules which were not upstreamed to sonic-linux-kernel

- How to verify it
Boot switch and verify there are no error logs regarding kernel modules failing to load.
2022-02-28 08:08:19 +02:00
Junchao-Mellanox
fe59e0f2c0
[Mellanox] Fix issue: thermal zone threshold value 0 causes fan speed stuck at 100% (#10057)
- Why I did it
In SONiC thermal control algorithm, it compares thermal zone temperature with thermal zone threshold. Previously, a thermal zone with no thermal sensor can still get its threshold. However, a recently driver patch changes this behavior: a thermal zone with no thermal sensor will return 0 for threshold. We need to ignore such thermal zone.

- How I did it
Ignore thermal zones whose temperature is 0.

- How to verify it
Added unit test case and Manual test
2022-02-24 12:05:56 +02:00
Dror Prital
cf1bc8dc65
[Mellanox] Upgrade ASIC FW tool to 4.18.1-16 (#9981)
- Why I did it
Update MFT to version 4.18.1-16 for bugs fixes and new SN2201 support

- How I did it
Advance to MFT tool version to 4.18.1-16

- How to verify it
Manually tested on all Mellanox platforms (ASIC FW Upgrade, link debug tools, CPLD upgrade, etc.)
2022-02-14 10:39:47 +02:00
Stephen Sun
486e9b0c75
Fix issue: module id got from get_change_event is wrong (#9961)
Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-02-13 09:20:37 +05:30
Alexander Allen
0ae2906c06
[Mellanox] Update mellanox hw-mgmt submodule and versions to V.7.0020.1300 (#9860)
- Why I did it
New version of mellanox platform management code available adding support for new platforms and fixing bugs.

- How I did it
1. Updated the submodule
2. Updated makefile version references
3. Regenerated SONiC patches
2022-02-06 16:42:53 +02:00
Junchao-Mellanox
43e967d6a4
Fix issue: 'sx_port_mapping_t' object has no attribute 'slot_id' (#9835)
- Why I did it
Fix issue: 'sx_port_mapping_t' object has no attribute 'slot_id'. sx_port_mapping_t only has attribute slot.

- How I did it
Change slot_id to slot.

- How to verify it
Manual test
2022-01-27 07:38:32 +02:00
Junchao-Mellanox
8e590da563
[Mellanox] Fix select timeout in sfp event (#9795)
- Why I did it
Python select.select accept a optional timeout value in seconds, however, the value passes to it is a value in millisecond.

- How I did it
Transfer the value to millisecond.

- How to verify it
Manual test
2022-01-27 07:36:56 +02:00
Dror Prital
f24f19391b
[Mellanox] Update SDK/FW to 4.5.1208/2010.1218 and SAI version to 1.20.2.5 (#9619)
- Why I did it
To include latest SDK fixes:
1.  On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting SN4600C, 100GbE port with CWDM4 module (Gen 3.0), link up time is 30 seconds.

and to include SAI fixes \ changes:
1. Reduce verbosity for resource check vendor data not found
2. Fix metadata validation, check default value on conditions check
3. Add 100MB, 10MB to 2201 system
4. L3 VXLAN overlay ECMP
5. VXLAN srcport API implementation
6. Fix scheduler profile null (default values) when set on sub group scheduler group
7. Fix ACL binding restoration when port leaves a LAG
8. Fix route logic for set next hop/action and reference counter for ECMP overlay

- How I did it
1. Updated SDK/FW submodule and relevant makefiles with the required versions.
2. Update SAI submodule and relevant makefile with the required version.

- How to verify it
Build an image and run tests from "sonic-mgmt".
2022-01-26 11:01:55 +02:00
Alexander Allen
8a07af95e5
[Mellanox] Modified Platform API to support all firmware updates in single boot (#9608)
Why I did it
Requirements from Microsoft for fwutil update all state that all firmwares which support this upgrade flow must support upgrade within a single boot cycle. This conflicted with a number of Mellanox upgrade flows which have been revised to safely meet this requirement.

How I did it
Added --no-power-cycle flags to SSD and ONIE firmware scripts
Modified Platform API to call firmware upgrade flows with this new flag during fwutil update all
Added a script to our reboot plugin to handle installing firmwares in the correct order with prior to reboot
How to verify it
Populate platform_components.json with firmware for CPLD / BIOS / ONIE / SSD
Execute fwutil update all fw --boot cold
CPLD will burn / ONIE and BIOS images will stage / SSD will schedule for reboot
Reboot the switch
SSD will install / CPLD will refresh / switch will power cycle into ONIE
ONIE installer will upgrade ONIE and BIOS / switch will reboot back into SONiC
In SONiC run fwutil show status to check that all firmware upgrades were successful
2022-01-24 00:56:38 -08:00
Junchao-Mellanox
4ae504a813
[Mellanox] Optimize thermal control policies (#9452)
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.

- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable

- How to verify it
1. Manual test
2. Regression
2022-01-19 11:44:37 +02:00
Junchao-Mellanox
ca36b4a57b
Fix build issue: cannot import name FW_AUTO_ERR_UKNOWN- required module not found (#9764) 2022-01-14 20:12:53 +05:30
xumia
3e46582314
[Bug][Build]: Fix the mlnx-platform-api dpkg cache config error (#9705)
Failed to build the mellanox image when dpkg enabled, it has impact on PR checks.
2022-01-09 09:21:59 +08:00
Lior Avramov
a9c9f56eeb
[Mellanox] Include CPU board and switch board sensors only on SN2201 system (#9644)
Why I did it
Recently additional sensors that were needed only for specific system added to all systems and caused errors.

How I did it
* Include CPU board and switch board sensors only on SN2201 system
* Fix issue in test_chassis_thermal, now it skips non existing thermals.

How to verify it
Run show platform temperature

Signed-off-by: liora <liora@nvidia.com>
2022-01-05 10:25:47 -08:00
Raphael Tryster
c2b0af89fb
Added platform.json in Mellanox Spectrum-4 and older simx platforms (#9621)
- Why I did it
Added missing functionality for dynamic buffer calculation in Spectrum-4.

- How I did it
Added a section of code in asic_table.j2 for Spectrum-4, and added the simx version of SN5600 to the supported list.

- How to verify it
Manually: buffershow -l should show all ingress/egress lossy/lossless pools, and all fields of profiles should show values.
Automatically: https://github.com/Azure/sonic-mgmt/blob/master/tests/qos/test_buffer.py

Signed-off-by: Raphael Tryster <raphaelt@nvidia.com>
2022-01-02 09:58:40 +02:00
Stepan Blyshchak
f3df6e2f1b
[Mellanox] fix hw-mgmt patches (#9539)
- Why I did it
To fix an issue that hw-mgmt patches were not applied. One patch was already in upstream hw-mgmt package thus applying it again caused an error and no other patches were applied. Also, I did it to improve the Makefile, so that the make will fail in case patches fail to apply.

- How I did it
Removed obsolete patch, made applying patches a hard failure in the build.

- How to verify it
Run the make and verify patches are applied.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-12-16 16:02:03 +02:00
Junchao-Mellanox
d05afb5dbd
[Mellanox] Rename platform x86_64-mlnx_msn4800 to x86_64-nvidia_sn4800 (#9512)
- Why I did it
Rename platform x86_64-mlnx_msn4800 to x86_64-nvidia_msn4800

- How I did it
Rename platform folder as well as all code that reference the platform name

- How to verify it
Manual test
2021-12-15 09:08:04 +02:00
Stepan Blyshchak
c3008c3ce7
[Mellanox][SDK] Build SDK with PRM sniffer support (#9500)
- Why I did it
To have an ability to use PRM sniffer.

- How I did it
Enabled the option in configure flags.

- How to verify it
Built and ran on switch. Enabled the feature in runtime and checked the sniffer recording.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-12-14 10:56:19 +02:00
Shilong Liu
6402a02261
[build] Remove dulplicated DOCKER_SYNCD_BASE target. (#9378) 2021-12-13 11:06:10 +08:00
xumia
49dd5db94d
[Build]: Fix docker images built multiple times issue (#9253)
The same docker image is built multiple times after upgrading to bullseye, the build time is increased to about 15 hours from 6 hours.
See log: https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_apis/build/builds/50390/logs/9
Line 1437: 2021-11-11T11:15:02.7094923Z [ building ] [ target/docker-sonic-telemetry.gz ]
Line 1446: 2021-11-11T11:37:41.1073304Z [ finished ] [ target/docker-sonic-telemetry.gz ]
Line 1459: 2021-11-11T11:38:20.6293007Z [ building ] [ target/docker-sonic-telemetry.gz-load ]
Line 1462: 2021-11-11T11:38:28.1250201Z [ finished ] [ target/docker-sonic-telemetry.gz-load ]
Line 2906: 2021-11-11T18:57:42.8207365Z [ building ] [ target/docker-sonic-telemetry.gz ]
Line 2917: 2021-11-11T19:43:47.1860961Z [ finished ] [ target/docker-sonic-telemetry.gz ]
Line 3997: 2021-11-11T22:49:35.0196252Z [ building ] [ target/docker-sonic-telemetry.gz ]
Line 4002: 2021-11-11T23:14:00.4127728Z [ finished ] [ target/docker-sonic-telemetry.gz ]

How I did it
Place the python wheels in another folder relative to the build distribution.

Co-authored-by: Ubuntu <xumia@xumia-vm1.jqzc3g5pdlluxln0vevsg3s20h.xx.internal.cloudapp.net>
2021-12-09 15:42:44 -08:00
Raphael Tryster
e3c0a888c9
[Mellanox] Add support of SN5600 platform on top of Nvidia ASIC simulation (#9392)
- Why I did it
Add new Spectrum-4 system support SN5600 on top of Nvidia ASIC simulator.

- How I did it
Add all relevant system and simulator SKU.
Updated syseeprom.hex and related directories to reflect Nvidia SN5600 brand name.

- How to verify it
Tested init flow, basic show commands, up interfaces, traffic test.

Signed-off-by: Raphael Tryster <raphaelt@nvidia.com>
2021-12-09 17:46:24 +02:00
Volodymyr Samotiy
3f00b5df84
[Mellanox] Update SAI to v1.20.1.1 and SDK/FW to v4.5.1158/v2010.1154 (#9474)
- Why I did it
To include latest fixes.

SAI
1. Reclaim buffers for port which is admin down
2. Support for Spectrum-4 os Nvidia ASIC simulation
3. Support for SN2201
4. Fix host interface table entry, one channel per trap (fix sflow double registration)
5. 2 new queue counters - ecn marked packets + shared current occupancy
6. Fix storm policer unknown unicast
7. Add key/value for accuflow counters
8. Add MAC move
9. Add mirror congestion mode attribute

SDK
1. Under various circumstances, Ethernet ports falsely showed that InfiniBand cables were connected.
2. In SN4600C, at times, the link up time in both DAC and optics cables may, in the worst case, take up to 15 seconds.
3. Using SN4600C with copper or optics loopback cables in NRZ speeds, link may raise in long link up times
4. When ECMP has high amount of next-hops based on VLAN interfaces, in some rare cases, packets will get a wrong VLAN tag and will be dropped.
5. When connecting Spectrum devices with optical transceivers that support RXLOS, remote side port down might cause the switch firmware to get stuck and cause unexpected switch behavior.
6. Aggregation event is missing for WJH L2 drop reason 'Unicast egress port list is empty'.
7. Tying the SCL and SDA of the optical modules to 3.3V causes errors.
8. On SN4600, there was a delay of more than 10 seconds from the time a data packet is sent from CPU until it is transmitted through one of the switch ports.
9. While using SN4600C system with Finisar FTLC1157RGPL 100GbE CWDM4 modules, intermittent link flaps across multiple ports may be observed.
10. In Spectrum-2 and Spectrum-3 systems, link did not work in auto-negotiation when connected to Marvell PHY. KR mechanism has been enhanced to integrate with Marvell PHY.
11. The tunnel counter counts the drop packets now for Spectrum-2 and Spectrum-3 and consistent with Spectrum behavior and count the ECN dropped packets as well.
12. When connecting SN3800 to Cisco-9000, fast-linkup flow will fail and will rise in the normal flow.
13. Race condition in WJH library: when multiple threads load the LAG shared memory concurrently, the program may crash.
14. Add WJH L2 drop reason 'Unicast egress port list is empty' as a new drop reason.
15. Fixed a memory leak in sx_api_port_sflow_statistics_get API.
16. During initialization flow, the command interface that is used by the minimal driver and SDK caused the collision in the firmware since the same buffer is used in the firmware for the two interfaces.
17. Fix route issue on Kernel 5.10

- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-12-09 14:52:18 +02:00
Aravind Mani
ac2885a988
[SFP-Refactor] Modify transceiver key name (#9447)
* Modify transceiever key name

* fix alignment
2021-12-09 12:38:45 +05:30
Junchao-Mellanox
ae0989784d
[Mellanox] Allow user to set LED to orange (#9259)
Why I did it
Nvidia platform API does not support set LED to orange

How I did it
Allow user to set LED to orange

How to verify it
Added unit test
Manual test
2021-12-08 13:05:10 -08:00
Lior Avramov
1fce3ebda3
[Mellanox] Add support for SN2201 platform (#9333)
- Why I did it
Add support for SN2201 platform

- How I did it
Add required content for SN2201 platform
Note: still missing kernel driver support for this system. Once all is upstream will be updated as well.

- How to verify it
Install and basic sanity tests including traffic.

Signed-off-by: liora liora@nvidia.com
2021-12-06 14:47:50 +02:00