Commit Graph

621 Commits

Author SHA1 Message Date
Kebo Liu
7873a9131d
[Mellanox] Skip the leftover hardware reboot cause in case of last boot is warm/fast reboot (#13246)
- Why I did it
In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.

- How I did it
Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.

- How to verify it
a. Manual test:
    - Perform a power loss
    - Perform a warm/fast reboot
    - Check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"
b. Run reboot cause related regression test.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-01-11 16:50:46 +02:00
Junchao-Mellanox
2126def04e
[infra] Support syslog rate limit configuration (#12490)
- Why I did it
Support syslog rate limit configuration feature

- How I did it
Remove unused rsyslog.conf from containers
Modify docker startup script to generate rsyslog.conf from template files
Add metadata/init data for syslog rate limit configuration

- How to verify it
Manual test
New sonic-mgmt regression cases
2022-12-20 10:53:58 +02:00
Vadym Hlushko
1a5889ade7
[SFP] Change logging severity when failed to read EEPROM (#13011)
- Why I did it
In order to prevent the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py test failing on the log analyzer step.

The mentioned test is performing the sfputil reset EthernetX for every interface on the SONiC switch, this action will flap the SFP device status (INSTERTED -> REMOVED -> INSTERTED).

The SONiC XCVRD daemon will catch this SFP device status change (because it is monitoring the presence status of the cable).
To judge the cable presence status, currently, we are still leveraging to read the first bytes of the EEPROM, and the EEPROM could be not ready at some moment and the SONiC XCVRD daemon will print the error log to Syslog:

ERR pmon#xcvrd: Error! Unable to read data for 'xx' port, page 'xx' offset 128, rc = 1, err msg: Sending access register

- How I did it
Change logging severity from ERR to WARNING

- How to verify it
Run the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py

OR much faster way to run the next script on the switch:

#!/bin/bash

START=0
END=248

for (( intf=$START; intf<=$END; intf+=8))
do
    sfputil reset Ethernet"${intf}"
done

sfputil show presence
2022-12-20 10:05:45 +02:00
Kebo Liu
d6ee7f08c2
[Mellanox] change the implementation of is_host() to fix a stuck issue on simx platform (#13100)
- Why I did it
Following code to judge whether a process is running inside a docker could get stuck on the simx platform

subprocess.Popen(["docker", "--version"],
                                stdout=subprocess.PIPE,
                                stderr=subprocess.STDOUT,
                                universal_newlines=True)
When it gets stuck, the config-chassisdb service can not be successfully started, thus the system can not be booted up.

root@sonic:/# service config-chassisdb status
     config-chassisdb.service - Config chassis_db
     Loaded: loaded (/lib/systemd/system/config-chassisdb.service; enabled; vendor preset: enabled)
     Active: activating (start) since Thu 2022-12-15 09:23:02 UTC; 29min ago
   Main PID: 571 (config-chassisd)
      Tasks: 14 (limit: 9501)
     Memory: 132.4M
     CGroup: /system.slice/config-chassisdb.service
                        ├─571 /bin/bash /usr/bin/config-chassisdb
			├─575 /usr/bin/python3 /usr/local/bin/sonic-cfggen -H -v DEVICE_METADATA.localhost.platform
			├─602 /bin/sh -c sudo decode-syseeprom -m
			├─603 sudo decode-syseeprom -m
			├─607 /usr/bin/python3 /usr/local/bin/decode-syseeprom -m
			├─616 /bin/sh -c docker --version 2>/dev/null
			└─617 docker --version

- How I did it
Use an alternative way to implement this function and issue can be avoided:

docker_env_file = '/.dockerenv'
return os.path.exists(docker_env_file) is False

- How to verify it
run regression on real hardware and simx platform.
2022-12-20 10:00:11 +02:00
Lior Avramov
bb2e7685c8
[Mellanox] Update ECMP calculator README (#13051)
Why I did it
Update ECMP calculator README file with new instructions how to run the calculator.

How I did it
Update README file.

How to verify it
Read README file.
2022-12-15 09:47:33 +02:00
Junchao-Mellanox
9590339d69
[Mellanox] Remove TODO comments which are no longer needed (#13023)
- Why I did it
Remove TODO comments which are no longer needed

- How I did it
Remove TODO comments which are no longer needed

- How to verify it
Only comment change
2022-12-14 09:57:48 +02:00
Lior Avramov
e5808020a7
Add ECMP calculator tool (#12482)
- Why I did it
Added ECMP calculator tool.

- How I did it
New files were added.

- How to verify it
Manual tests performed according to tests chapter in HLD
Automated tests will be added by verification.
2022-12-04 17:14:25 +02:00
Kebo Liu
36a100083f
[Mellanox] Add support to Mellanox Spectrum-4 ASIC Firmware compiling and upgrade (#12844)
- Why I did it
Add support for compiling Spectrum-4 ASIC firmware to the SONiC image
Add support for Spectrum-4 ASIC firmware upgrade

- How I did it
Update Mellanox fw make files to include Spectrum-4 ASIC firmware binaries.
Update firmware upgrade scripts to be able to detect Spectrum-4 ASIC.

- How to verify it
Run regression tests

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-11-29 16:38:41 +02:00
Lior Avramov
fb662442bc
[Mellanox] Add SDK hash calculator debian and update SDK makefile to compile it (#12840)
- Why I did it
Add SDK hash calculator Debian and update SDK makefile to compile it.

- How I did it
SDK hash calculator Debian will be used by ECMP calculator (PR #12482)

- How to verify it
Compile sonic-buildimage and verify SDK hash calculator Debian exist in target folder.
2022-11-28 13:30:40 +02:00
Richard.Yu
19e3d8ce98
[submodule]Advance sairdis with sai 1.11 and add brcm and mlnx sai sdk (#12471)
* rebase code

advance sairedis

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

* Update Mellanox SDK/FW to 4026

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* Update Mellanox SAI to 2211.23.1.0

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* update Switch-SDK-drivers pointer

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* git update sai header in saibcm

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

* mapping to sairedis 202211

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Co-authored-by: Kebo Liu <kebol@nvidia.com>
2022-11-23 09:02:36 -08:00
Stephen Sun
5d457596ba
[Mellanox] Support PSU power threshold checking (#11863)
* Support power threshold

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* get_psu_power_warning_threshold => get_psu_power_warning_suppress_threshold

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Fix comments

Signed-off-by: Stephen Sun <stephens@nvidia.com>

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-11-21 14:47:43 -08:00
Junchao-Mellanox
20d885dbc2
[Mellanox] Add new thermal sensors for SN5600 (#12671)
- Why I did it
Add new thermal sensors for SN5600

- How I did it
Add new thermal sensors for SN5600: PCH and SODIMM

- How to verify it
Manual test
2022-11-14 11:10:33 -08:00
Kebo Liu
c8c2b7fc45
[Mellanox] [Platform API] Update SN2201 dynamic minimum fan speed table (#12602)
- Why I did it
Update SN2201 dynamic minimum fan speed table according to data provided by the thermal team.

- How I did it
Update the thermal table in device_data.py

- How to verify it
Run platform related regression

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-11-08 13:37:10 +02:00
Junchao-Mellanox
830b7d8cb4
[Mellanox] Use sdk sysfs instead of ethtool (#12480) 2022-11-03 11:17:44 -07:00
Vivek
5d83d424b1
Added BUILD flags to provision for building the kernel with non-upstream patches (#12428)
* Added ENV vars for non-upstream patches

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>

* Made MLNX_PATCH_LOC an absolute path

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>

* Added non-upstream-patches dir

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>

* Update README.md

* Addressed comments

* Env vars updated

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>

* Readme updated

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>
2022-10-31 12:16:05 -07:00
Dror Prital
917ad1ffe0
[Mellanox] Update SDK/FW to version 4.5.3186/2010.3186 (#12542)
- Why I did it
Update SDK/FW version - 4.5.3186/2010_3186 in order to have the following changes:

New functionality:
1. Added support for 6.5W (Class 8) in ports 49-50, 53-54, 57-58, and 61-62 on SN4600 system

Fix the following issues:
1. On very rare occasion (~1/100K), during I2C transaction with MMS1V50-WM and MMS1V90-WR modules on SN4700 system, the module may send unexpected stop which violate the I2C specification, possibly affecting the link up flow
2. When running 1GbE speeds on SN4600 system, the port remained active while peer side was closed
3. While toggling the cable with ‘sfputil lpmode on/off’, error msg like “ERR pmon#xcvrd: Receive PMPE error event on module 1: status {X} error type {y}” could be received
4. When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted
5. When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU
6. While moving from lossless to lossy mode while shared headroom was used, reduction of the shared headroom can only be done prior to pool type change and when shared headroom is not utilized
7. SLL configuration is missing in SDK dump
8. If TTL_CMD_COPY is used in Encap direction for a packet with no TTL, then the value passed in the ttl data structure will be used if non-zero (default 255 if zero)
9. PCI calibration changes from a static to a dynamic mechanism
10. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event
11. SDK returned error when FEC mode is set on twisted pair, when FEC was set to None

- How I did it
Update pointer for the SDK/FW

- How to verify it
Run regression tests

Signed-off-by: dprital <drorp@nvidia.com>
2022-10-30 09:31:09 +02:00
Stephen Sun
8c73e68468
Remove \n from the end of fs_path in ONIEUpdater (#12465)
This fixes the following error

```
admin@sonic:~$ sudo fwutil show status
mount: /mnt/onie-fs: special device /dev/sda2
 does not exist.
Error: Command '['mount', '-n', '-r', '-t', 'ext4', '/dev/sda2\n', '/mnt/onie-fs']' returned non-zero exit status 32.. Aborting...
Aborted!
admin@sonic:~$ sudo vi /usr/local/lib/python3.9/dist-packages/sonic_platform/

```
Seems like #11877 the rstrip('\n') was removed. Probably by mistake.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-10-23 09:59:20 +03:00
Mai Bui
648ca075c7
[device/mellanox] Mitigation for security vulnerability (#11877)
Signed-off-by: maipbui <maibui@microsoft.com>
Dependency: [PR (#12065)](https://github.com/sonic-net/sonic-buildimage/pull/12065) needs to merge first.
#### Why I did it
`subprocess.Popen()` and `subprocess.check_output()` is used with `shell=True`, which is very dangerous for shell injection.
#### How I did it
Disable `shell=True`, enable `shell=False`
#### How to verify it
Tested on DUT, compare and verify the output between the original behavior and the new changes' behavior.
[testresults.zip](https://github.com/sonic-net/sonic-buildimage/files/9550867/testresults.zip)
2022-10-06 17:51:31 -04:00
Dror Prital
44356fa8d7
[Mellanox] Add NVIDIA copyright header for NVIDIA added files (#12130)
- Why I did it
Add NVIDIA Copyright header for new "NVIDIA" files

- How I did it
Add the copyright header as remark at the head of the file
2022-10-02 11:34:24 +03:00
Volodymyr Samotiy
eea8ebd0a9
[Mellanox] Update MFT to v4.21.0-100 (#11758)
- Why I did it
To update MFT package to the latest version.

- How I did it
Updated MFT_VERSION & MFT_REVISION in platform/mellanox/mft.mk.

- How to verify it
Build an image and deploy to the switch
Check MFT version by dpkg -l | grep mft
Verify that all the SONiC services up and running
Run regression testing using tests from sonic-mgmt

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2022-09-30 09:48:40 +03:00
Volodymyr Samotiy
92bd6dae28
[Mellanox] Update SAI to v2205.22.1.19 and SDK/FW to v4.5.3168/v2010.3170 (#12205)
- Why I did it
To include latest fixes and new functionality

SAI fixes and new features
fix #3205239, incorrect object type returned for SG child list
Fix VRF-VNI map entries remove issue
ECC health event and logging
[Port Buffers] restore default queue and pg configuration when all user pools are deleted
Fix EVPN type3 error on removal of uc/bc flood group
Fix EVPN type2 MAC move from local to remote results in SAI failure
Fix Disable learning on VXLAN tunnel
Fix error on VXLAN v6 tunnel removal
Fix port cannot apply schedule group when it is a lag member
Fix BFD add more detailed message on BFD packet not related to any existing session
gcc10 compilation fixes
Disable learning on VXLAN tunnel
Support BFD remote-disc exchange in negotiation stage
Tunnel Loopback packet action attribute implementation (for Dual TOR)
Add KVD resources MIN/MAX functionality (pending CRM issue with MIN only)
Support for CRC2 hash algorithm
Bulk counter support for PGs, queues
Support mirror sample rate attribute (SPC2+)
[Functional] [QoS] | Unable to remove SCHEDULE profile table even if there is no object referencing it
Next hop group optimized bulk API
Reduce verbosity of shared database already exists print
Span mirror policer (SPC2+), optimize pipeline for acl mirror action with policer on SPC2+
use same size descriptor pool for rx/tx
fix bfd - notify Sonic for admin-down event
2201 - empty list for supported fec for RJ45 ports
Fix don't disable used tunnel underlay interfaces

SDK fixes
100GbE FCI DAC (10137628-4050LF/HPE PN: 845408-B21) was recognized by mistake as supporting "cable burning' which caused the switch firmware to read page 0x9f (which unsupported in the cable) and to report this cable as having "bad eeprom".
Added remote peer UDP port information in BFD packet event.
After editing an ECMP, the resilient ECMP next-hop counter may not count correctly.
Fixed potential memory leaks in some APIs related to LPM
If TTL_CMD_COPY is used in Encap direction for a packet with no TTL, then the value passed in the ttl data structure will be used if non-zero (default 255 if zero).
In SN2201: When configuring Force mode, user should configure Speed and FEC on both sides
In Flex Tunnel encapsulation flow, if the encapsulation is with an IPv6 header, the flow label field may not be updated as expected.
In some cases, when changing speed to 400GbE over 8 lanes, the first few packets would be dropped.
In some traffic patterns involving small packets, the PortRcvErrors counter may mistakenly count events of local physical errors due to an internal flow in the hardware that involves link packets.
On Spectrum systems, sometimes during link failure, not all previous firmware indications cleared properly, potentially affecting the next link up attempt.
On the NVIDIA Spectrum-2 switch, when receiving a packet with Symbol Errors on ports that are configured to cut-thought mode, a pipeline might get stuck.
PCI calibration changes from a static to a dynamic mechanism.
SDK debug dump shows "Unknown" Counter in RFC3635 Counter Group.
SDK debug dump shows "Unknown" Counter in the PPCNT Traffic Class Counter Group.
SDK Dump missing column headers in some GC tables may result in difficulty understanding the dump.
SLL configuration is missing in SDK dump.
Spectrum-2 systems, do no support 1GbE on supported 40GbE modules.
When binding a UDP port which is already in use for BFD TX session, the error message appears incorrectly.
When Flex Tunnel was used, Flex Modifier sometimes experienced a brief mis-configuration during ISSU.
When many ports are active (e.g. 70 ports up), and the configuration of shared buffer is applied on the fly, occasionally, the firmware might get stuck.
When running 1GbE speeds on SN4600 system, the port remained active while peer side was closed.
When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted.
When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU.
While toggling the cable, and the low power mode is set to ON, an unexpected PMPE event error is received.

- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2022-09-30 09:40:12 +03:00
Junchao-Mellanox
1d69f0916e
[Mellanox] Provide dummy implementation for get_rx_los and get_tx_fault (#12231)
- Why I did it
get_rx_los and get_tx_fault is not supported via the exisitng interface used, need provide dummy implementation for them.
NOTE: in later releases we will get them back via different interface.

- How I did it
Return False * lane_num for get_rx_los and get_tx_fault

- How to verify it
Added unit test
2022-09-30 09:38:05 +03:00
Stephen Sun
4d317aff94
[Mellanox] Fix typo in platform API (#12136)
- Why I did it
Fix a typo in chassis platform API which causes the following error

>>> import sonic_platform as P
>>> c = P.platform.Platform().get_chassis()
>>> sl = c.get_all_sfps()
>>> sl[0].get_lpmode()
Sep 28 07:48:33 INFO    LOG: Initializing SX log with STDOUT as output file.
False
>>> del c
Exception ignored in: <function Chassis.__del__ at 0x7f1d166ef8b0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/sonic_platform/chassis.py", line 126, in __del__
    self.sfp_module.deinitialize_sdk_handle(sfp_module.SFP.shared_sdk_handle)
NameError: name 'sfp_module' is not defined

- How I did it
Use self while using the SDK handle

- How to verify it
Manual test

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-09-28 11:09:18 +03:00
Junchao-Mellanox
f890606d82
Revert "[Mellanox] Redirect ethtool stderr to subprocess for better error log (#12038)" (#12183)
This reverts commit 9750cb4.

There is a PR to handle 202205 branch revert: #12184

- Why I did it
The PR to be reverted introduced many notice logs every 1 minute if SFP is not plugged:

Cannot get module EEPROM information: Input/output error
Before the "bad" PR, the message format is like this:

INFO pmon#supervisord: xcvrd Cannot get module EEPROM information: Input/output error
It was truncated by rsyslog because every message is the same. However, the "bad" PR introduces SFP index to the message:

NOTICE pmon#xcvrd: Failed to get EEPROM data for sfp 39: Cannot get module EEPROM information: Input/output error
Rsyslog no longer truncate such log and many such messages are flooded to syslog.

- How I did it
Revert the PR

- How to verify it
Manual test
2022-09-28 10:15:26 +03:00
Dror Prital
54b146f56c
[Mellanox] Update SDK/FW to version 4.5.2320/2010.2320 (#11990)
- Why I did it
Update SDK/FW version - 4.5.2320/2010_2320 in order to have the following fixes:
• Spectrum-3 | PCI calibration changes from a static to a dynamic mechanism.
• [VxLAN] TTL was set to 0 for non IP traffic (such as ARP)

- How I did it
Update pointer for the SDK/FW

- How to verify it
Run regression tests
2022-09-14 20:43:38 +03:00
Junchao-Mellanox
9750cb48c6
[Mellanox] Redirect ethtool stderr to subprocess for better error log (#12038)
- Why I did it
ethtool print error logs when EEPROM of a SFP is not available. It prints error like this:

INFO pmon#/supervisord: xcvrd Cannot get module EEPROM information: Input/output error
INFO pmon#/supervisord: xcvrd Cannot get Module EEPROM data: Invalid argument
However, this log does not contain the relevant SFP index which is hard for developer/qa to find the exactly SFP.

- How I did it
Redirect ethtool stderr to subprocess and log it better

- How to verify it
Manual test
2022-09-14 20:41:43 +03:00
Richard.Yu
a1eae940d5
[SAIServer] support saiserver v2 in bullseye (#11849)
Upgrade libboost-atomic1.71 to libboost-atomic1.74

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
2022-08-25 22:51:53 -07:00
Kebo Liu
d9f3e6ce4f
update Mellanox SDK/FW to 4.5.2318 2010.2318 (#11773)
- Why I did it
Update SDK/FW version - 4.5.2318/2010_2318 to pick up new fixes:
1. Cr space timeout on Hold and Release GW - at warm boot
2. Spectrum Port in stuck PHY_UP after peer side rebooted
3. Memory leak in sx_api_router_ecmp_update_set

- How I did it
Update the make file with the new version number
Update submodule Switch-SDK-drivers pointer

- How to verify it
Run sonic regression

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-08-24 11:45:16 +03:00
Junchao-Mellanox
46ebd06403
[Mellanox] Fix issue: set lpmode by platform API does not work (#11732)
- Why I did it
Fix issue: set lpmode by platform API does not work

- How I did it
Fix miss return value in code

- How to verify it
Manual test
2022-08-18 13:07:38 +03:00
Kebo Liu
0031cc6c64
[Mellanox] Update hw-mgmt package to V.7.0020.3006 (#11538)
- Why I did it
Update HW-MGMT to V.7.0020.3006
1. Support new system SN2201
2. Add COMEX BRDWL respin support

- How I did it
Update the version number of the makefile
Advance the hw-mgmt submodule pointer

- How to verify it
Run full regression on Nvidia platforms

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-08-11 08:59:26 +03:00
orfarfara
aec1248258
[Mellanox] add PSU input voltage and current (#11510)
- Why I did it
Add PSU input voltage and input current to mlnx platform api.

- How I did it
Implement 2 function of getting the psu voltage and psu current input:
Get the values from "power/psu{}_curr_in" , "power/psu{}_volt_in"

- How to verify it
Manual test.
Run sonic-mgmt regression

Signed-off-by: orfar1994 <orfar1994@gmail.com>
2022-08-10 18:10:55 +03:00
Junchao-Mellanox
4c1c0c1852
[Mellanox] add more log while doing sysfs reading (#11556)
- Why I did it
Add more log while doing sysfs reading to increase the debug capability

- How I did it
Log the relevant file path and error number while sysfs reading return None

- How to verify it
Manual test
2022-08-08 15:06:52 +03:00
Stephen Sun
8282d427e4
Fix chassis test issue (#11460)
Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-07-16 19:34:45 -07:00
Stephen Sun
81600fafe9
[Mellanox] Support new platform API get_port_or_cage_type for RJ45 ports (#11336)
- Why I did it
Support get_port_or_cage_type for RJ45 ports

- How I did it
Implement the new platform API get_port_or_cage_type
Fix the issue: unable to import SFP when chassis object is destructed

- How to verify it
Manually test and regression test

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-07-14 12:20:16 +03:00
Junchao-Mellanox
2863945f7c
[Mellanox] Fix issue: failed to decode Json while there is no hwsku.json (#11436)
- Why I did it
Fix bug: pmon report error on start up because some SKUs do not have hwsku.json

- How I did it
If hwsku.json, do not extract RJ45 port information

- How to verify it
Manual test.
Unit test.
2022-07-14 09:24:39 +03:00
Alexander Allen
bec35df04a
[Mellanox] Add arm64 architecture support to mellanox platform (#11342)
- Why I did it
Add support for mellanox platform building for target architecture arm64.

- How I did it
Contains the following changes:
1. Change instances of hard-coded amd64 to $(CONFIGURED_ARCH)
2. Add logic to download correct binary for MFT package
3. Add TARGET_BOOTLOADER=grub definition to rules.mk to override default arm64 bootloader

- How to verify it
Build mellanox platform with TARGET_ARCH set as arm64
2022-07-13 16:21:33 +03:00
Nazarii Hnydyn
40cbd3fe87
[Mellanox] Update SAI version 1.21.2.0 (#11326)
- Why I did it
Advance to new SAI version for bugs fixes as well as new features/enhacements:

New:
- ARM64 support
- FG ECMP performance optimization
- Support setting empty list for port ingress/egress buffer profile list
- Add service port for SN5600
- Add CR8/SR8/LR8/KR8 interface type
- Disable mlxtrace during debug dump

Fixes:
- Fix SAI_ACL_ENTRY_ATTR_FIELD_TC
- Fix Packets loop back if no member in portchannel
- Fix optimize descriptors apply time (and fast boot time)
- Add flush fdb entries for vxlan tunnel bridge port
- Don't disable used tunnel underlay interfaces

- How I did it
Advanced SAI submodule

- How to verify it
make configure PLATFORM=mellanox
make target/sonic-mellanox.bin

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2022-07-06 09:58:15 +03:00
Sudharsan Dhamal Gopalarathnam
23d68883f5
[Mellanox]Check dmi file permission before access (#11309)
Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com

Why I did it
During the system boot up when 'show platform status' or 'show version' command is executed before STATE_DB CHASSIS_INFO table is populated, the show will try to fallback to use the platform API. The DMI file in mellanox platforms require root permission for access. So if the show commands are executed as admin or any other user, the following error log will appear in the syslog

Jun 28 17:21:25.612123 sonic ERR show: Fail to decode DMI /sys/firmware/dmi/entries/2-0/raw due to PermissionError(13, 'Permission denied')

How I did it
Check the file permission before accessing it.

How to verify it
Added UT to verify. Manually verified if the error log is not thrown.
2022-07-01 17:29:07 -07:00
Alexander Allen
3ba9ae2405
[Mellanox] Add arch folder to SDK binary location (#11278)
- Why I did it
This is for the eventual support of multiple architectures for the mellanox platform.

- How I did it
Change the location of the binaries in Switch-SDK-drivers so that the path specifies the target architecture in addition to the target distribution that the debians are built for.

This is the most straightforward way to separate binaries built against different architectures and selectively target them for installation in the mellanox SONiC image.

- How to verify it
Build SONiC for mellanox and verify it compiles successfully.
2022-06-29 07:15:05 +03:00
Yakiv Huryk
0ced7081c7
[asan] add print_suppressions=0 to ASAN configs (#11252)
- Why I did it
To provide an ability to suppress ASAN false positives and have a clean ASAN report for docker-sonic-vs/mlnx-syncd/orchagent docker

- How I did it
Added the "print_suppressions=0" to ASAN configs.

- How to verify it
add a suppression to some ASAN-enabled component (the suppression should catch some leak)
build with ENABLE_ASAN=y
run a test and see that the ASAN report is empty instead of having the suppression summary

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
2022-06-28 18:45:52 +03:00
Kebo Liu
7ac590b5c5
[Mellanox] Enhance Platform API to support SN2201 - RJ45 ports and new components mgmt. (#10377)
* Support new platform SN2201 and RJ45 port

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* remove unused import and redundant function

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* fix error introduced by rebase

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* Revert the special handling of RJ45 ports (#56)

* Revert the special handling of RJ45 ports

sfp.py
sfp_event.py
chassis.py

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Remove deadcode

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Support CPLD update for SN2201

A new class is introduced, deriving from ComponentCPLD and overloading _install_firmware
Change _install_firmware from private (starting with __) to protected, making it overloadable

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Initialize component BIOS/CPLD

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Remove swb_amb which doesn't on DVT board any more

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Remove the unexisted sensor - switch board ambient - from platform.json

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Do not report error on receiving unknown status on RJ45 ports

Translate it to disconnect for RJ45 ports
Report error for xSFP ports

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Add reinit for RJ45 to avoid exception

Signed-off-by: Stephen Sun <stephens@nvidia.com>

Co-authored-by: Stephen Sun <5379172+stephenxs@users.noreply.github.com>
Co-authored-by: Stephen Sun <stephens@nvidia.com>
2022-06-20 19:12:20 -07:00
Junchao-Mellanox
f135f37a50
[Mellanox] optimize platform API import time (#10815)
- Why I did it
"import sonic_platform" takes about 600ms ~ 1000ms, it is kind of slow. After this optimization, the time is about 100ms. The benefit is that those CLIs which does not need the slow import sentence would be faster than before.

- How I did it
Find slow import and call them when need.

- How to verify it
Measure the import time.
2022-06-07 15:13:16 +03:00
Volodymyr Samotiy
5c869492bd
[Mellanox] Update SDK/FW to 4.5.2262/xx.2010.2262 (#10882)
- Why I did it
To include latest fixes:
1. Warmboot | When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU.
2. Link Up | When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted.
3. Shared buffer | While moving from lossless to lossy while shared headroom was used, reduction of the shared headroom can only be done prior to pool type change and when shared headroom is not utilized.

- How I did it
Updated SDK submodule along with the relevant Makefiles

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2022-06-07 15:11:25 +03:00
vdahiya12
0b8e463db4
[sonic-platform-common][sonic-platform-daemons] submodule update; Remove python2 sonic-platform-common wheel (#10994)
* [sonic-platform-common][sonic-platform-daemons] submodule update

vdahiya@vdahiya-dev3:~/sonic-buildimage8/sonic-buildimage/src/sonic-platform-daemons$
git log --oneline 9ac12bf..master
0d90023 (HEAD -> master, origin/master, origin/HEAD, origin/202205) grpc
client implementation for active-active dualtor (#248)
6b8bf69 [ycabled] Fix some syntax warnings in ycabled (#263)
2bcf936 [ycabled] fix the posting for mux_cable_static_info per downlink
when ycabled is spawned; synchronizing executing Telemetry API (#257)
ce217c0 Include changes from xcvr_api in transceiver_info table (#253)
e0f8a35 Fix checkReplyType failed issue via recreating xcvr_table_helper
on forking subprocess (#255)

f575a40 (origin/master, origin/HEAD, origin/202205, master)
[Credo][Ycable] changes for synchronizing executing Telemetry API's when
mux toggle is inprogress (#280)
b043372 [sonic_ssd] Nokia-7215: "show platform ssdhealth" not showing
health percent (#279)
d62d3d6 [CMIS]Fix low-power to high power mode transition (#268)
f918125 [syseeprom] Enable display of vendor extension TLV content
(#270)
4e08440 [Credo][Ycable] improve logging for Server Powered off/Faulty
cables (#272)

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>

* remove python2 wheel for sonic-platform-common

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>

* remove python2 platform_common definitions

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
2022-06-04 07:41:15 -07:00
Yakiv Huryk
7306d68411
[build][asan] make dpkg cache asan-aware (#10750)
Currently, the build with ASAN_ENABLE=y reuses the packages built with
ASAN_ENABLE=n (and vice versa). To address this issue, ASAN_ENABLE is added to DEP_FLAGS for asan-enabled packages (docker-syncd-mlnx, syncd, docker-orchagent, swss).

- Why I did it
To make dpkg cache use/rebuild the packages for ASAN_ENABLE=y/n.

- How I did it
Added ASAN_ENABLE to the DEP_FLAGS for asan-enabled packages.

- How to verify it
Built with ASAN_ENABLE=y/n and checked the .flags .log files.

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
2022-05-31 11:15:44 +03:00
Yakiv Huryk
bd91b2eef3
[asan] add debug package for asan-enabled containers (#10953)
This is to improve the readability of ASAN reports. The debug package adds function names and source code references to the backtrace (currently, there are only binary addresses of functions)

Another way to address this issue is to build the image with "INSTALL_DEBUG_TOOLS=y". The downside of this approach is that the image size and compilation time are unnecessarily big. Also, the idea is to make the "ENABLE_ASAN" self-sufficient, which would not be the case for this approach.

- Why I did it
To improve the readability of asan logs.

- How I did it
Added SYNCD_DBG and SWSS_DBG to corresponding docker images for ASAN_ENABLE=y build

- How to verify it
Add artificial memory leak
Build with ASAN_ENABLE=y
Test the image and check the ASAN report

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
2022-05-31 09:24:18 +03:00
Andriy Yurkiv
29043ff026
[MFT] Update MFT version to MFT 4.20.0-34 (#10933)
- Why I did it
Update MFT to newer version

- How I did it
Update MFT_VERSION in platform/mellanox/mft.mk

- How to verify it
Check version via dpkg -l | grep mft

Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
2022-05-28 15:45:02 +03:00
Alexander Allen
71c868f56a
Upgrade libasan to version 6 in docker-syncd-mlnx to align with bullseye libasan (#10886) 2022-05-27 11:28:01 -07:00
Andriy Yurkiv
70d71f99f5
[Mellanox] Credo Y-cable | add more log info, checks, fix exception message (#10779)
- Why I did it
Script fails when there is an exception while reading

- How I did it
Add more logs and checks. Fix wrong variable naming and messages.

- How to verify it
Provoke exception while read_eeprom() and check that it is handled properly
2022-05-19 17:36:02 +03:00
Alexander Allen
d202bf26d7
Upgrade mellanox platform containers (syncd / saiserver / syncd-rpc) and pmon to bullseye (#10580)
Fixes #9279

- Why I did it
Part of larger effort to move all SONiC systems to bullseye

- How I did it
1. Update container makefiles with correct dependencies
2. Update container Dockerfile with correct base image
3. Update container Dockerfile with correct apt dependencies
4. Update any other makefiles with dependencies to remove python2 support
5. Minor changes to support bullseye / python3

- How to verify it
Run regression on the switch:
1. Verify PTF community tests work
2. Verify syncd runs and all ports come up / pass traffic
3. Verify all platform tests succeed
2022-05-10 12:45:28 +03:00
vmittal-msft
9ae17e66a3
[sonic-sairedis update] Support for SAI header v1.10.2 with BRCM SAI v7.1.0.0 and MLNX SAI v1.21.1.0 (#10583) 2022-05-05 20:27:29 -07:00
Alexander Allen
53e5fe6a93
[Mellanox] Upgrade mellanox SDK to 4.5.1500 and mlnx-sai to 1.21.1.1 (#10675)
Update SDK/FW to 4.5.1500/2010.1500 and SAI version to 1.21.1.1

SDK/FW features:
1. Added support for Finisar DR4 (FTCD4523E2PCM) on Spectrum-2 and Spectrum-3 systems.

SAI Features:
1. ECMP overlay support for IPv6
2. BFD offloading / 4K scale
3. Host interface user traps + improved trap registration (table entry)
4. gcc11 compilation fixes
5. Read support for ACL redirect action
6. Optimize ECMP DB size
7. Buffer descriptors new defaults
8. Updated port mapping for SN2201

SAI Fixes:
1. Debug counter removal when configured with all drop reasons

- Why I did it
Upgrade Mellanox SDK and SAI versions to latest

- How I did it
Updated submodule pointers

- How to verify it
Regression tested
2022-04-29 20:50:59 +03:00
Kalimuthu-Velappan
bc30528341
Parallel building of sonic dockers using native dockerd(dood). (#10352)
Currently, the build dockers are created as a user dockers(docker-base-stretch-<user>, etc) that are
specific to each user. But the sonic dockers (docker-database, docker-swss, etc) are
created with a fixed docker name and common to all the users.

    docker-database:latest
    docker-swss:latest

When multiple builds are triggered on the same build server that creates parallel building issue because
all the build jobs are trying to create the same docker with latest tag.
This happens only when sonic dockers are built using native host dockerd for sonic docker image creation.

This patch creates all sonic dockers as user sonic dockers and then, while
saving and loading the user sonic dockers, it rename the user sonic
dockers into correct sonic dockers with tag as latest.

	docker-database:latest <== SAVE/LOAD ==> docker-database-<user>:tag

The user sonic docker names are derived from 'DOCKER_USERNAME and DOCKER_USERTAG' make env
variable and using Jinja template, it replaces the FROM docker name with correct user sonic docker name for
loading and saving the docker image.
2022-04-28 08:39:37 +08:00
Junchao-Mellanox
af5e5c4c94
[Mellanox] Adjust PSU voltage WA (#10619)
- Why I did it
InvalidPsuVolWA.run might raise exception if user power off PSU when it is running. This exception is not caught and will be raised to psud which causes psud failed to update PSU data to DB.

- How I did it
1. Change the log level when WA does not work. This could happen when user power off PSU, hence changing the log level from error to warning is better
2. Change the wait time from 5 to 1 to avoid introduce too much delay in psud. 1 second is usually enough per my test
3. Give a default return value for function get_voltage_low_threshold and get_voltage_high_threshold to avoid exception reach to psud

- How to verify it
Manual test.
Run sonic-mgmt regression
2022-04-22 11:02:30 +03:00
Kebo Liu
a0c76b1bc9
[Mellanox] support newly added reboot cause (#10531)
- Why I did it
Implement newly added reboot causes in PR Azure/sonic-platform-common#277

- How I did it
Map the reboot cause sysfs to the newly added reboot causes.

- How to verify it
manual test, check whether the reboot cause is correct after rebooting the switch in various ways.
run the community reboot test to see whether the reboot cause checker is passing.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-04-18 10:55:56 +03:00
Yakiv Huryk
d9117d9411
[Mellanox][asan] add address sanitizer support for syncd (#10266)
Why I did it
To support address sanitizer for Mellanox syncd

How I did it
/var/log/asan is mapped for syncd container (the same as for swss)
container stop() has a timeout (60s) for syncd (the same as for swss)
This is so libasan has enough time to generate a report.
added ASAN's log path to Mellanox syncd supervisord.conf
added "asan: yes" to sonic_version.yml
How to verify it
Added artificial memory leaks
Compiled with ENABLE_ASAN=y
Installed the image on DUT
Rebooted the DUT
Verified that /var/log/asan/syncd-asan.log contains the leaks

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
2022-04-14 15:00:32 -07:00
Junchao-Mellanox
0191300b96
[Mellanox] Auto correct PSU voltage threshold (WA) (#10394)
- Why I did it
There is a hardware bug that PSU voltage threshold sysfs returns incorrect value. The workaround is to call "sensor -s" to refresh it.

- How I did it
Call "sensor -s" when the threshold value is not incorrect and PSU is "DELTA 1100"

- How to verify it
Unit test and Manual test
2022-04-14 08:14:40 +03:00
Kebo Liu
85539e7e08
[Mellanox] Update hw-mgmt package to version V.7.0020.2004 (#10401)
- Why I did it
Take new hw-mgmt release to SONiC, including:

New features:
1. hw-mgmt: add to PSU FW upgrade tool command to show current FW version
2. hw-mgmt: add to PSU FW upgrade tool support for single-PSU-in-the-system FW upgrade
3. hw-mgmt: add attribute “/firmware” to show FW version of restricted upgradable PSUs only
4. hw-mgmt: Add NVME temperature reports attributes (_alarm/_crit/_min/_max)

Bug fix:
1. psu: redundant i2c_addr attributes being created for psu 3 & 4 in system having only 2 psus.
2. hw-mgmt: in SPC1/2 i2c driver removal is too slow vs. ASIC reset causing non-functional log errors
3. PSU thresholds sysfs changed in 5.10 to “read only” preventing modification (modification required due PSU HW bug)
4. CPLD3 sysfs attribute missing after chip down/up flow
5. sysfs attributes missing when hw-mgmt is restarted (stop/start) within systemd

Release notes can be found from link https://github.com/Mellanox/hw-mgmt/blob/V.7.0020.2004/debian/Release.txt

- How I did it
Update hw-mgmt make file with new version number
Update hw-mgmt submodule pointer

- How to verify it
Run platform regression on all Mellanox platform

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-03-31 08:21:09 +03:00
Andriy Yurkiv
1e2e493daa
[Mellanox] Credo Y-cable read_eeprom/write_eeprom API implementation (#10320)
- Why I did it
Implement read_eeprom/write_eeprom API for Credo Y-cable for Dual ToR Active-Standby

- How I did it
Use mlxreg utility for API implementation

Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
2022-03-30 20:41:31 +03:00
Dror Prital
8bc81206c5
[Mellanox] Update NVIDIA License header for files changed since 1.1.2022 (#10289)
- Why I did it
Update NVIDIA Copyright header to "mellanox" files which were changed since 1.1.2022

- How I did it
Update the copyright header

- How to verify it
Sanity tests and PR checkers.
2022-03-23 13:19:25 +02:00
Oleksandr Ivantsiv
9565ef7a9a
[Mellanox] Refactor SFP to use new APIs. (#10317)
- Why I did it
Refactor SFP code to remove code duplication and to be able to use the latest features available in new APIs.

- How I did it
Refactor SFP code to remove code duplication and to be able to use the latest features available in new APIs.

- How to verify it
Run sonic-mgmt/platform_tests/sfp tests
2022-03-23 09:16:03 +02:00
Kebo Liu
42ab5b8eaa
[Mellanox] update MFT version to 4.18.0-106 (#10304)
- Why I did it
With the previous MFT 4.18.1-16 there is a bug in mstdump tool accessing wrong address. it is confirmed this issue does not exist in official 4.18.0-106.

- How I did it
Update the MFT version to 4.18.0-106

- How to verify it
Run regression on Mellanox platforms
2022-03-21 19:38:37 +02:00
Junchao-Mellanox
f0ddd102d5
[Mellanox] Add CPU thermal control for Nvidia platforms (#10202)
Why I did it
Add CPU thermal control for Nvidia platforms which will be enabled for platforms that have heavy CPU load. Now it is only enabled on 4800, and it will be enabled on future platforms.

How I did it
Check CPU pack temperature and update cooling level accordingly

How to verify it
Manual test
Added sonic-mgmt test case, PR link will update later
2022-03-21 09:54:52 -07:00
Junchao-Mellanox
dc04d64219
[Mellanox] Fix issue: psu might use wrong voltage sysfs which causes invalid voltage value (#10231)
- Why I did it
Fix issue: psu might use wrong voltage sysfs which causes invalid voltage value. The flow is like:

1. User power off a PSU
2. All sysfs files related to this PSU are removed
3. User did a reboot/config reload
4. PSU will use wrong sysfs as voltage node

- How I did it
Always try find an existing sysfs.

- How to verify it
Manual test
2022-03-20 10:34:04 +02:00
Yang Wang
c8db7a2d52
[Mellanox][SAISERVER] Support Mellanox saiserverv1 and saiserverv2 docker (#9686)
* support saiserverv1 and saiserverv2 docker

* add saiserver into buster and revert some changes

* update thrift version
2022-03-10 13:15:44 +08:00
Alexander Allen
7c4fbf0455
[Mellanox] Add patch to hw-mgmt to prevent loading of non-existent kernel modules (#10073)
- Why I did it
The latest upgrade of Mellanox hw-mgmt V7.0020.1300 introduced a couple new kernel modules for new Mellanox platforms that have yet to be upstreamed to the linux kernel.

As these new platforms do not have SONiC support we elected not to upstream these new drivers to sonic-linux-kernel but hw-mgmt expects them to exist which is causing a non-functional error on switch boot.

Feb 15 00:09:55.374130 r-leopard-simx-74 ERR systemd-modules-load[269]: Failed to find module 'emc2305'
Feb 15 00:09:55.374141 r-leopard-simx-74 ERR systemd-modules-load[269]: Failed to find module 'ads1015'
To resolve this we can patch hw-mgmt to no longer attempt to load these modules by default.

- How I did it
Added a SONiC patch to Mellanox hw-mgmt in order to remove the unused kernel modules which were not upstreamed to sonic-linux-kernel

- How to verify it
Boot switch and verify there are no error logs regarding kernel modules failing to load.
2022-02-28 08:08:19 +02:00
Junchao-Mellanox
fe59e0f2c0
[Mellanox] Fix issue: thermal zone threshold value 0 causes fan speed stuck at 100% (#10057)
- Why I did it
In SONiC thermal control algorithm, it compares thermal zone temperature with thermal zone threshold. Previously, a thermal zone with no thermal sensor can still get its threshold. However, a recently driver patch changes this behavior: a thermal zone with no thermal sensor will return 0 for threshold. We need to ignore such thermal zone.

- How I did it
Ignore thermal zones whose temperature is 0.

- How to verify it
Added unit test case and Manual test
2022-02-24 12:05:56 +02:00
Dror Prital
cf1bc8dc65
[Mellanox] Upgrade ASIC FW tool to 4.18.1-16 (#9981)
- Why I did it
Update MFT to version 4.18.1-16 for bugs fixes and new SN2201 support

- How I did it
Advance to MFT tool version to 4.18.1-16

- How to verify it
Manually tested on all Mellanox platforms (ASIC FW Upgrade, link debug tools, CPLD upgrade, etc.)
2022-02-14 10:39:47 +02:00
Stephen Sun
486e9b0c75
Fix issue: module id got from get_change_event is wrong (#9961)
Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-02-13 09:20:37 +05:30
Alexander Allen
0ae2906c06
[Mellanox] Update mellanox hw-mgmt submodule and versions to V.7.0020.1300 (#9860)
- Why I did it
New version of mellanox platform management code available adding support for new platforms and fixing bugs.

- How I did it
1. Updated the submodule
2. Updated makefile version references
3. Regenerated SONiC patches
2022-02-06 16:42:53 +02:00
Junchao-Mellanox
43e967d6a4
Fix issue: 'sx_port_mapping_t' object has no attribute 'slot_id' (#9835)
- Why I did it
Fix issue: 'sx_port_mapping_t' object has no attribute 'slot_id'. sx_port_mapping_t only has attribute slot.

- How I did it
Change slot_id to slot.

- How to verify it
Manual test
2022-01-27 07:38:32 +02:00
Junchao-Mellanox
8e590da563
[Mellanox] Fix select timeout in sfp event (#9795)
- Why I did it
Python select.select accept a optional timeout value in seconds, however, the value passes to it is a value in millisecond.

- How I did it
Transfer the value to millisecond.

- How to verify it
Manual test
2022-01-27 07:36:56 +02:00
Dror Prital
f24f19391b
[Mellanox] Update SDK/FW to 4.5.1208/2010.1218 and SAI version to 1.20.2.5 (#9619)
- Why I did it
To include latest SDK fixes:
1.  On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting SN4600C, 100GbE port with CWDM4 module (Gen 3.0), link up time is 30 seconds.

and to include SAI fixes \ changes:
1. Reduce verbosity for resource check vendor data not found
2. Fix metadata validation, check default value on conditions check
3. Add 100MB, 10MB to 2201 system
4. L3 VXLAN overlay ECMP
5. VXLAN srcport API implementation
6. Fix scheduler profile null (default values) when set on sub group scheduler group
7. Fix ACL binding restoration when port leaves a LAG
8. Fix route logic for set next hop/action and reference counter for ECMP overlay

- How I did it
1. Updated SDK/FW submodule and relevant makefiles with the required versions.
2. Update SAI submodule and relevant makefile with the required version.

- How to verify it
Build an image and run tests from "sonic-mgmt".
2022-01-26 11:01:55 +02:00
Alexander Allen
8a07af95e5
[Mellanox] Modified Platform API to support all firmware updates in single boot (#9608)
Why I did it
Requirements from Microsoft for fwutil update all state that all firmwares which support this upgrade flow must support upgrade within a single boot cycle. This conflicted with a number of Mellanox upgrade flows which have been revised to safely meet this requirement.

How I did it
Added --no-power-cycle flags to SSD and ONIE firmware scripts
Modified Platform API to call firmware upgrade flows with this new flag during fwutil update all
Added a script to our reboot plugin to handle installing firmwares in the correct order with prior to reboot
How to verify it
Populate platform_components.json with firmware for CPLD / BIOS / ONIE / SSD
Execute fwutil update all fw --boot cold
CPLD will burn / ONIE and BIOS images will stage / SSD will schedule for reboot
Reboot the switch
SSD will install / CPLD will refresh / switch will power cycle into ONIE
ONIE installer will upgrade ONIE and BIOS / switch will reboot back into SONiC
In SONiC run fwutil show status to check that all firmware upgrades were successful
2022-01-24 00:56:38 -08:00
Junchao-Mellanox
4ae504a813
[Mellanox] Optimize thermal control policies (#9452)
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.

- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable

- How to verify it
1. Manual test
2. Regression
2022-01-19 11:44:37 +02:00
Junchao-Mellanox
ca36b4a57b
Fix build issue: cannot import name FW_AUTO_ERR_UKNOWN- required module not found (#9764) 2022-01-14 20:12:53 +05:30
xumia
3e46582314
[Bug][Build]: Fix the mlnx-platform-api dpkg cache config error (#9705)
Failed to build the mellanox image when dpkg enabled, it has impact on PR checks.
2022-01-09 09:21:59 +08:00
Lior Avramov
a9c9f56eeb
[Mellanox] Include CPU board and switch board sensors only on SN2201 system (#9644)
Why I did it
Recently additional sensors that were needed only for specific system added to all systems and caused errors.

How I did it
* Include CPU board and switch board sensors only on SN2201 system
* Fix issue in test_chassis_thermal, now it skips non existing thermals.

How to verify it
Run show platform temperature

Signed-off-by: liora <liora@nvidia.com>
2022-01-05 10:25:47 -08:00
Raphael Tryster
c2b0af89fb
Added platform.json in Mellanox Spectrum-4 and older simx platforms (#9621)
- Why I did it
Added missing functionality for dynamic buffer calculation in Spectrum-4.

- How I did it
Added a section of code in asic_table.j2 for Spectrum-4, and added the simx version of SN5600 to the supported list.

- How to verify it
Manually: buffershow -l should show all ingress/egress lossy/lossless pools, and all fields of profiles should show values.
Automatically: https://github.com/Azure/sonic-mgmt/blob/master/tests/qos/test_buffer.py

Signed-off-by: Raphael Tryster <raphaelt@nvidia.com>
2022-01-02 09:58:40 +02:00
Stepan Blyshchak
f3df6e2f1b
[Mellanox] fix hw-mgmt patches (#9539)
- Why I did it
To fix an issue that hw-mgmt patches were not applied. One patch was already in upstream hw-mgmt package thus applying it again caused an error and no other patches were applied. Also, I did it to improve the Makefile, so that the make will fail in case patches fail to apply.

- How I did it
Removed obsolete patch, made applying patches a hard failure in the build.

- How to verify it
Run the make and verify patches are applied.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-12-16 16:02:03 +02:00
Junchao-Mellanox
d05afb5dbd
[Mellanox] Rename platform x86_64-mlnx_msn4800 to x86_64-nvidia_sn4800 (#9512)
- Why I did it
Rename platform x86_64-mlnx_msn4800 to x86_64-nvidia_msn4800

- How I did it
Rename platform folder as well as all code that reference the platform name

- How to verify it
Manual test
2021-12-15 09:08:04 +02:00
Stepan Blyshchak
c3008c3ce7
[Mellanox][SDK] Build SDK with PRM sniffer support (#9500)
- Why I did it
To have an ability to use PRM sniffer.

- How I did it
Enabled the option in configure flags.

- How to verify it
Built and ran on switch. Enabled the feature in runtime and checked the sniffer recording.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-12-14 10:56:19 +02:00
Shilong Liu
6402a02261
[build] Remove dulplicated DOCKER_SYNCD_BASE target. (#9378) 2021-12-13 11:06:10 +08:00
xumia
49dd5db94d
[Build]: Fix docker images built multiple times issue (#9253)
The same docker image is built multiple times after upgrading to bullseye, the build time is increased to about 15 hours from 6 hours.
See log: https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_apis/build/builds/50390/logs/9
Line 1437: 2021-11-11T11:15:02.7094923Z [ building ] [ target/docker-sonic-telemetry.gz ]
Line 1446: 2021-11-11T11:37:41.1073304Z [ finished ] [ target/docker-sonic-telemetry.gz ]
Line 1459: 2021-11-11T11:38:20.6293007Z [ building ] [ target/docker-sonic-telemetry.gz-load ]
Line 1462: 2021-11-11T11:38:28.1250201Z [ finished ] [ target/docker-sonic-telemetry.gz-load ]
Line 2906: 2021-11-11T18:57:42.8207365Z [ building ] [ target/docker-sonic-telemetry.gz ]
Line 2917: 2021-11-11T19:43:47.1860961Z [ finished ] [ target/docker-sonic-telemetry.gz ]
Line 3997: 2021-11-11T22:49:35.0196252Z [ building ] [ target/docker-sonic-telemetry.gz ]
Line 4002: 2021-11-11T23:14:00.4127728Z [ finished ] [ target/docker-sonic-telemetry.gz ]

How I did it
Place the python wheels in another folder relative to the build distribution.

Co-authored-by: Ubuntu <xumia@xumia-vm1.jqzc3g5pdlluxln0vevsg3s20h.xx.internal.cloudapp.net>
2021-12-09 15:42:44 -08:00
Raphael Tryster
e3c0a888c9
[Mellanox] Add support of SN5600 platform on top of Nvidia ASIC simulation (#9392)
- Why I did it
Add new Spectrum-4 system support SN5600 on top of Nvidia ASIC simulator.

- How I did it
Add all relevant system and simulator SKU.
Updated syseeprom.hex and related directories to reflect Nvidia SN5600 brand name.

- How to verify it
Tested init flow, basic show commands, up interfaces, traffic test.

Signed-off-by: Raphael Tryster <raphaelt@nvidia.com>
2021-12-09 17:46:24 +02:00
Volodymyr Samotiy
3f00b5df84
[Mellanox] Update SAI to v1.20.1.1 and SDK/FW to v4.5.1158/v2010.1154 (#9474)
- Why I did it
To include latest fixes.

SAI
1. Reclaim buffers for port which is admin down
2. Support for Spectrum-4 os Nvidia ASIC simulation
3. Support for SN2201
4. Fix host interface table entry, one channel per trap (fix sflow double registration)
5. 2 new queue counters - ecn marked packets + shared current occupancy
6. Fix storm policer unknown unicast
7. Add key/value for accuflow counters
8. Add MAC move
9. Add mirror congestion mode attribute

SDK
1. Under various circumstances, Ethernet ports falsely showed that InfiniBand cables were connected.
2. In SN4600C, at times, the link up time in both DAC and optics cables may, in the worst case, take up to 15 seconds.
3. Using SN4600C with copper or optics loopback cables in NRZ speeds, link may raise in long link up times
4. When ECMP has high amount of next-hops based on VLAN interfaces, in some rare cases, packets will get a wrong VLAN tag and will be dropped.
5. When connecting Spectrum devices with optical transceivers that support RXLOS, remote side port down might cause the switch firmware to get stuck and cause unexpected switch behavior.
6. Aggregation event is missing for WJH L2 drop reason 'Unicast egress port list is empty'.
7. Tying the SCL and SDA of the optical modules to 3.3V causes errors.
8. On SN4600, there was a delay of more than 10 seconds from the time a data packet is sent from CPU until it is transmitted through one of the switch ports.
9. While using SN4600C system with Finisar FTLC1157RGPL 100GbE CWDM4 modules, intermittent link flaps across multiple ports may be observed.
10. In Spectrum-2 and Spectrum-3 systems, link did not work in auto-negotiation when connected to Marvell PHY. KR mechanism has been enhanced to integrate with Marvell PHY.
11. The tunnel counter counts the drop packets now for Spectrum-2 and Spectrum-3 and consistent with Spectrum behavior and count the ECN dropped packets as well.
12. When connecting SN3800 to Cisco-9000, fast-linkup flow will fail and will rise in the normal flow.
13. Race condition in WJH library: when multiple threads load the LAG shared memory concurrently, the program may crash.
14. Add WJH L2 drop reason 'Unicast egress port list is empty' as a new drop reason.
15. Fixed a memory leak in sx_api_port_sflow_statistics_get API.
16. During initialization flow, the command interface that is used by the minimal driver and SDK caused the collision in the firmware since the same buffer is used in the firmware for the two interfaces.
17. Fix route issue on Kernel 5.10

- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-12-09 14:52:18 +02:00
Aravind Mani
ac2885a988
[SFP-Refactor] Modify transceiver key name (#9447)
* Modify transceiever key name

* fix alignment
2021-12-09 12:38:45 +05:30
Junchao-Mellanox
ae0989784d
[Mellanox] Allow user to set LED to orange (#9259)
Why I did it
Nvidia platform API does not support set LED to orange

How I did it
Allow user to set LED to orange

How to verify it
Added unit test
Manual test
2021-12-08 13:05:10 -08:00
Lior Avramov
1fce3ebda3
[Mellanox] Add support for SN2201 platform (#9333)
- Why I did it
Add support for SN2201 platform

- How I did it
Add required content for SN2201 platform
Note: still missing kernel driver support for this system. Once all is upstream will be updated as well.

- How to verify it
Install and basic sanity tests including traffic.

Signed-off-by: liora liora@nvidia.com
2021-12-06 14:47:50 +02:00
Junchao-Mellanox
9a31a1e689
[Mellanox] Add a trigger to set LED to blink (#8764)
Depends on #9358

Why I did it
Adjust LED logical according to hw-mgmt change.

How I did it
Add a trigger to set LED to blink.

How to verify it
Manual test
2021-12-03 14:13:20 -08:00
Alexander Allen
467ae5beca
[mellanox] update hw-mgmt pointer (#9358)
#### Why I did it

Updated hw-mgmt pointer to updated branch and to include new bugfixes. The hw-mgmt submodule was previously pointing to an orphaned commit which could not be fetched from github, this has now been resolved. 

#### How I did it

Updated submodule pointer.

#### How to verify it

Clone down repository and update all submodules.
2021-12-01 14:02:38 -08:00
Stephen Sun
ba853348d5
[Reclaim buffer] Reclaim unused buffers by applying zero buffer profiles (#8768)
Signed-off-by: Stephen Sun stephens@nvidia.com

Why I did it
Support zero buffer profiles

Add buffer profiles and pool definition for zero buffer profiles
Support applying zero profiles on INACTIVE PORTS
Enable dynamic buffer manager to load zero pools and profiles from a JSON file
Dependency: It depends on Azure/sonic-swss#1910 and submodule advancing PR once the former merged.

How I did it
Add buffer profiles and pool definition for zero buffer profiles

If the buffer model is static:
Apply normal buffer profiles to admin-up ports
Apply zero buffer profiles to admin-down ports
If the buffer model is dynamic:
Apply normal buffer profiles to all ports
buffer manager will take care when a port is shut down
Update buffers_config.j2 to support INACTIVE PORTS by extending the existing macros to generate the various buffer objects, including PGs, queues, ingress/egress profile lists

Originally, all the macros to generate the above buffer objects took active ports only as an argument
Now that buffer items need to be generated on inactive ports as well, an extra argument representing the inactive ports need to be added
To be backward compatible, a new series of macros are introduced to take both active and inactive ports as arguments
The original version (with active ports only) will be checked first. If it is not defined, then the extended version will be called
Only vendors who support zero profiles need to change their buffer templates
Enable buffer manager to load zero pools and profiles from a JSON file:

The JSON file is provided on a per-platform basis
It is copied from platform/<vendor> folder to /usr/share/sonic/temlates folder in compiling time and rendered when the swss container is being created.
To make code clean and reduce redundant code, extract common macros from buffer_defaults_t{0,1}.j2 of all SKUs to two common files:

One in Mellanox-SN2700-D48C8 for single ingress pool mode
The other in ACS-MSN2700 for double ingress pool mode
Those files of all other SKUs will be symbol link to the above files

Update sonic-cfggen test accordingly:

Adjust example output file of JSON template for unit test
Add unit test in for Mellanox's new buffer templates.

How to verify it
Regression test.
Unit test in sonic-cfggen
Run regression test and manually test.
2021-11-29 08:04:01 -08:00
Junchao-Mellanox
e79c7155b7
[Mellanox] Fan speed should not be 100% when PSU is powered off (#9258)
- Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.

- How I did it
When PSU is powered of, don't treat it as absent.

- How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
2021-11-24 14:56:00 +02:00
Stephen Sun
f9d0921d09
Support PSU voltage high/low thresholds and power max threshold (#8983)
- Why I did it
Support PSU voltage high/low thresholds and power max threshold
1. Add thresholds support for voltage and power.
2. As thresholds are not supported on all platforms, we need to check the capability first and fetch thresholds only if it is supported.

- How I did it

- How to verify it
Run regression test and manual test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-11-22 12:41:16 +02:00
Alexander Allen
53891c5602
Fix cache related mellanox bullseye build failures (#9234)
#### Why I did it

Mellanox builds were failing intermittently due to the `issue_version` file and MFT package not building correctly in the Azure pipeline environment (both of these packages were patched to build correctly with bullseye running on the host and buster running on the dockers)

#### How I did it

Fixed two problems:

1. BLDENV is not passed to the Makefiles so the references to this were replaced with correct logic
2. `issue_version` was not defined as a target for bullseye and as such was not cached. Altered the build such that it is defined as a target for bullseye (in the case of buster it builds the file, in the case of bullseye it copies from buster) 

The previous PR fixing this was reverted as it is no longer necessary for a passing build and was not a long-term fix. https://github.com/Azure/sonic-buildimage/pull/9235

#### How to verify it

Build on AZP and verify success.
2021-11-16 14:49:47 -08:00
Saikrishna Arcot
a980e836c1
[mellanox]: Disable caching for the issu-version file (#9235)
The issu-version file for Mellanox is generated from the Mellanox SDK
libraries. The SDK is installed into a Buster docker container, but the
issu-version file goes onto the base OS, which is Bullseye. To work
around this, the issu-version build rules explicitly copies the
issu-version file to target/files/bullseye/ during the Buster build.

Because of our build infra, if caching is enabled and a cache is being
used, then for issu-version, since it is technically built as part of
Buster, then only target/files/buster/issu-version is saved into the
cache, and target/files/bullseye/issu-version isn't cached. If this
cache gets used, then target/files/bullseye/issu-version is missing, and
the final image build fails.

This is to work around the current build issue where Mellanox builds are
failing. This is so that issu-version is always "built", so that copy is
made into the bullseye directory.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-11 19:39:25 -08:00
Saikrishna Arcot
d4bc0095a5 Fix Mellanox hw-mgmt package version
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-10 15:27:22 -08:00
Alexander Allen
cc5a2f3d54 Update pointer (#12)
Updated the hw-mgmt pointer to include some bugfixes related to power supply voltages.
2021-11-10 15:27:22 -08:00
Alexander Allen
857937d592 Mellanox bullseye merge (#1)
* Make neccesary changed to mellanox platform code to build on Debian 11

* Revert use of backported kernel to build mft and elect to only build kernel module under bullseye
2021-11-10 15:27:22 -08:00
Alexander Allen
2847265bfd Mellanox bullseye merge (#1)
Allow mellanox platform to build and successfully switch packets in
Debian 11

Upgraded

* Mellanox SDK
* Mellanox Hardware Management
* Mellanox Firmware
* Mellanox Kernel Patches

Adjusted build system to support host system running bullseye and
dockers running buster.
2021-11-10 15:27:22 -08:00
Saikrishna Arcot
1d00613305 Add support for building Mellanox image
ISSU will likely be broken. As of right now, the issu-version file is
not being generated during build.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-10 15:27:22 -08:00
Stepan Blyshchak
2ef97bb5df
[dockers] change RPC, DBG dockers version: put RPG, DBG sign in build metadata part of the version (#8920)
- Why I did it
In case an app.ext requires a dependency syncd^1.0.0, the RPC version of syncd will not satisfy this constraint, since 1.0.0-rpc < 1.0.0. This is not correct to put 'rpc' as a prerelease identifier. Instead put 'rpc' as build metadata in the version: 1.0.0+rpc which satisfies the constraint ^1.0.0.

- How I did it
Changed the way how to version in RPC and DBG images are constructed.

- How to verify it
Install app.ext with syncd^1.0.0 dependency on a switch with RPC syncd docker.
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
2021-11-01 19:02:57 +02:00
Junchao-Mellanox
e8b4c2a1f4
[Mellanox] Refactor Mellanox platform API to support dynamic port configuration (#8422)
- Why I did it
* To support systems with dynamic port configuration
* Apply lazy initialization to faster the speed of loading platform API

- How I did it
* Add module.py to implement dynamic port configuration (aka line card model)
* Adjust chassis.py, platform.py, thermal.py, sfp.py to support dynamic port configuration
* Optimize existing code

- How to verify it
Platform regression on MSN4700, MSN3800 and MSN2700, 100% pass
Unit test covers all new changes.
2021-10-25 07:59:06 +03:00
vmittal-msft
8b5f33dbb7
[sonic-sairedis submodule] Update SAI header to ver 1.9.1 for MLNX SDK/SAI (#9012)
* Updated sonic-sairedis to point to SAI 1.9.1 and MLNX SAI to 1.19.5(API v1.9.1)
2021-10-22 13:49:55 -07:00
Vadym Hlushko
cd0d407690
[fan] Fixed dynamic minimum fan speed table for SN4410 (#8960)
Put new values to the dynamic minimum fan speed table for SN4410.
2021-10-20 18:12:58 -07:00
DavidZagury
9527cbe53b
[Mellanox] Upgrade Mellanox firmware tools to 4.17.2-12 (#8978)
- Why I did it
Bug fix:
bad_param request due to missing parser rest command while running mlxlink

- How I did it
Advance to MFT tool version to 4.17.2-12.

- How to verify it
Manually tested on all mellanox platforms.
2021-10-20 09:24:25 +03:00
Dror Prital
5356244e53
[Mellanox] Add NVIDIA Copyright header to "mellanox" files (#8799)
- Why I did it
Add NVIDIA Copyright header to "mellanox" files

- How I did it
Add NVIDIA Copyright header as a comment for Mellanox files

- How to verify it
Sanity tests and PR checkers.
2021-10-17 19:03:02 +03:00
Alexander Allen
e6e6f414cd
[mellanox] Remove validation for fw filenames with no extension (#8956)
Why I did it
Currently the mellanox platform API is validating the file extensions of firmware packages to be installed for basic sanity checking. However, ONIE packages do not have an extension and as such if there is a "." in the name it is taken to be an extension and then fails the sanity check.

How I did it
I removed the check which ensures that ONIE images don't have a file extension.

How to verify it
Name the ONIE updater file 2021.onie and attempt to install it via fwutil install fw 2021.onie --yes
2021-10-15 13:29:24 -07:00
Alexander Allen
cefb9c1946
[platform] [mellanox] Use correct API call to update firmware in auto_update_firmware (#8961)
Why I did it
The fwutil update all utility expects the auto_update_firmware method in the Platform API to execute the update_firmware() call and not the install_firmware() call.

How I did it
Changed the method in the mellanox platform API component implementation.

How to verify it
Run fwutil update all with a CPLD update on a Mellanox platform and verify that it properly updates the firmware using the MPFA file.
2021-10-15 13:28:52 -07:00
Volodymyr Samotiy
ce7abad3ba
[Mellanox] Update SAI to v1.19.4 (#8929)
- Why I did it
Advance to new Nvidia SAI release with the following changes:
New features:
- Align with new SDK/FW version 4.5.1006 and above and in parallel to existing used SDK/FW bundle
- Implement timestamp and egress_queue_index hostif packet attributes.

Bugs fixes:
- Fix compilation issues with gcc10
- Fix return code for buffer overflow for query enum values and query statistics capabilities
- Reduce verbosity of print in case packet ingress on invalid port
- Fix mirror Qos settings

- How I did it
Updated SAI version and submodule pointer

- How to verify it
Run regression tests from sonic-mgmt

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-10-12 10:33:35 +03:00
Junchao-Mellanox
552963ab0e
[Mellanox] Change thermal recover threshold from temp_trip_norm to temp_trip_high (#8792)
- Why I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high, so that thermal algorithm would set fan speed to minimum allowed earlier and save power.

- How I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high

- How to verify it
Manual test
2021-10-04 20:20:33 +03:00
Nazarii Hnydyn
f36952fea3
[Mellanox]: Update SAI to v1.19.2 (#8618)
- Why I did it
Advance to Mellanox SAI ver 1.19.2 to pick up dynamic Policy Based Hashing support.
For this version above the static Policy Based Hashing is no longer supported.
For detailed release notes check https://github.com/Mellanox/SAI-Implementation/blob/sonic2111/release%20notes.txt

- How I did it
Updated SAI-Implementation submodule

- How to verify it
1. make configure PLATFORM=mellanox
2. make target/sonic-mellanox.bin
Run full regression as well as new dynamic PBH tests

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2021-09-10 17:30:00 +03:00
Nazarii Hnydyn
63ba489c6b
[Mellanox] Advance hw-mgmt to V.7.0010.2346. (#8667)
Commits on Sep 01, 2021
hw-mgmt: attributes: Add PSU power sensor attributes d8fce39

Commits on Sep 02, 2021
Remove MFT package flint tool from hw-management dump generation. 53d06b2
hw-mgmt: debug: Add timeout to generate-dump.sh b661fa3 

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2021-09-08 09:59:50 -07:00
Junchao-Mellanox
ed64eb94d9
[Mellanox] Read PSU fan max/min speed per PSU (#8563)
#### Why I did it
New PSU could install different type of fan, so fan max/min speed should be read per PSU

#### How I did it
The existing implementation read PSU max/min fan speed from a common file, change it to read from per PSU file

#### How to verify it
Manual test
2021-08-26 01:03:55 -07:00
shlomibitton
5f04146a10
Upstream new FW/SDK (#8567)
- Why I did it
Update SDK\FW version to 4.4.3326\2008.3326. This version contains:

New Features:
1. Add support for Fast Boot for SN3800

Bug Fixing:
1. In some cases, when the total number of allocations exceeds the resource limit, an error can occur due to incorrect resource release procedure. This issue is most likely to affect the following resources: flow counters, ACL actions, PBS, WJH filter, Tunnels, ECMP containers, MC (L2 &L3)
2. On Spectrum systems, when using Async Router API with IPV6, an error message in the log regarding failing to remove ECMP container may show up. This error is not functional and can be safely ignored.
3. On Spectrum-2 systems and above, when using warm boot, setting max_bridge_num to a value greater than 1968 will cause an error and potential crash.
4. Some Molex cables do not support speed after reboot

- How I did it

- How to verify it
Was verified by running regression tests that includes complete sonic-mgmt tests supported
2021-08-25 16:46:56 +03:00
Alexander Allen
d0c73b050d
[Mellanox] Upgrade Mellanox firmware tools to 4.17.0 (#8299)
- Why I did it
New release of MFT has the following changelog / RN
 Fixed an issue that resulted in getting MVPD read errors from the mlxfwmanager during fast reboot.
 Fixed mlxuptime sometimes generating a time less than previous due the wrong frequency calculation

- How I did it
Update makefile pointer to new version.

- How to verify it
Manually tested on all Mellanox platforms.
2021-08-22 15:54:43 +03:00
Dror Prital
ca713e2fad
[Mellanox] Update SDK\FW to version 4.4.3320\2008.3324 (#8487)
Update SDK\FW version to 4.4.3320\2008.3324. This version contains: 

New Features:
* Add support for Fast Boot for SN3800

Bug Fixing:
* In some cases, when the total number of allocations exceeds the resource limit, an error can occur due to incorrect resource release procedure. This issue is most likely to affect the following resources: flow counters, ACL actions, PBS, WJH filter, Tunnels, ECMP containers, MC (L2 &L3)
* On Spectrum systems, when using Async Router API with IPV6,  an error message in the log regarding failing to remove ECMP container may show up. This error is not functional and can be safely ignored.
* On Spectrum-2 systems and above, when using warm boot, setting max_bridge_num to a value greater than 1968 will cause an error and potential crash.
* Some Molex cables do not support speed after reboot

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-08-17 01:57:24 -07:00
Kebo Liu
59c13cb406
[Mellanox] Upgrade hw-mgmt to 7.0100.2344 (#8463)
To pick up new PSU fan support from new hw-mgmt release

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-08-13 03:25:53 -07:00
Saikrishna Arcot
c38b95c899 Remove --net=host from run options for containers
The start template script that this value is used in will determine what
network namespace to use, and will add --net=host if it needs to run in
the host namespace.

With Docker 20.10, if --net=host is specified twice in docker run, then
it errors out. Therefore, remove the explicit --net=host in the run
options setting and let the start template script specify it.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-08-12 23:18:01 -07:00
DavidZagury
d26307d80f
[Mellanox][Pcie] Fix issue on pcied with an id that contains only decimal digits was treated as a decimal number (#8309)
A device that contains only decimal digits was mistreated as a decimal integer resulting in failure to find it in the id to bus map.
2021-08-03 15:25:28 -07:00
DavidZagury
67781abb97
[Mellanox][pcied] Ignore bus on pcie.yaml for Mellanox switches (#8063)
Why I did it
BIOS upgrade on rare cases cannot guarantee bus value remain the same on every BIOS release. Ignoring this field in order for pcied not to fail but still verify device id in a different way. The solution is future proof and will not require changes in code when new BIOS version is available

How I did it
Since bus is not a fixed value (it is determined by the bios version) we are ignoring this field, and instead checking if there is a device that match on all other fields that and in addition has a matching device id.

How to verify it
Verify no errors or failures in pcied on different BIOS version with the same code base.
2021-07-26 08:43:42 -07:00
Dror Prital
7698747028
Update SDK\FW to version 4.4.3222\2008.3224 (#8247)
*Update SDK\FW Version to 4.4.3222\2008.3224.

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-22 17:00:16 -07:00
Dror Prital
644875712e
[Mellanox] Update SAI to version 1.19.1 (#8245)
- Why I did it
Update SAI version to 1.19.1. The following was changed:
1. Update license
2. Do not remove and re-apply the same SDK mirror session on LAG
3. FEC fix to support all speeds
4. Improve PG counters performance
5. Fix number of switch priorities for port mirroring

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-22 18:13:49 +03:00
tomer-israel
950c24c5ae
[PMON] [Mellanox] fix syseepromd issue on simx (#8131)
Avoid initializing sfp/thermal/components/fan/psu/leds on simx and create vpd_info file on hw_management when we use mellanox simulator platform

- Why I did it
this is a fix for issue in mellanox simulator platforms. the syseepromd failed on the pmon docker. also "decode-syseeprom" failed also

- How I did it
before initializing thermal/components/fan/psu/leds --> check if we are running on simx
creating the vpd_info on the hw_management folder.

- How to verify it
check if syseepromd process was loaded properly on the pmon docker.
decode-syseeprom is working well without errors/warnings
2021-07-20 11:56:04 +03:00
tomer-israel
a328fd24c0
[WARM-REBOOT] fix issue of watchdog on simx when executing warm-reboot command (#8132)
- Why I did it
to prevent python exception error when executing warm-reboot command on mellanox simulator platform

- How I did it
return None on the watchdog python script on cases that watchdog file is not exist

- How to verify it
warm-reboot is running well without the python error. error message will appear on log on these cases.
in order to avoid this error message we can simulate the watchdog on mellanox simulator platform
2021-07-19 22:08:44 +03:00
Vivek Reddy
447f09f2dd
Update SAI (#8143)
Fix saisdkdump + Fix port dropped pkts counters
Co-authored-by: Vivek Reddy Karri <vkarri@nvidia.com>
2021-07-09 15:27:36 -07:00
Dror Prital
0c9862d980
[Mellanox] Update FW version to 2008.3218 (#8079)
Update FW version to 2008.3218, fixing the following issues:
- 50G/100G links that are operationally down before warm-reboot are not coming up after warm-reboot
- 50G/100G links with admin shut / no shut commands are not coming up after warm-reboot

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-07 08:35:34 +03:00
Dror Prital
edb48e3191
[Mellanox] Update SAI and SDK\FW ver. 4.4.3216\2008.3216(#8055)
- Why I did it
* For SAI - Advance to adopt the following fixes:
1. Better handle not implement object type for resource availability
2. Fix ext dump when saidump is triggered from 2nd process (saidump utility) other than main adapter host (syncd in SONiC)

* For SDK\FW:
- Changes and new features:
1. Added support in SN4600C systems for new module Finisar ET7402-CWDM4 (100G CWDM4 QSFP28 1310nm SM 2KM).
2. Added support for new module MMS1W50-HM (2km transceiver FR4) for 200GbE
3. Improved performance of "per-port-buffer" counters
4. Added support for Kernel 5.10

- Bugs fixes:
On rare occasions (0.5%), in SN4600C systems, when using 100GbE NRZ mode and Fastboot flow, the link up time may take up to 10 seconds

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-06 07:40:29 +03:00
Junchao-Mellanox
147bf240f0
[Mellanox] Add bitmap support for SFP error event (#7605)
#### Why I did it

Currently, SONiC use a single value to represent SFP error, however, multiple SFP errors could exist at the same time. This PR is aimed to support it

#### How I did it

Return bitmap instead of single value when a SFP event occurs

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-06-25 10:56:47 -07:00
shlomibitton
9d18a35e35
[Mellanox] advance SAI submodule (#7952)
Split and bulk counter bug fixes:

- Init port auto neg to default on static (SAI XML) port split for 2nd+ port
- Fix port stats SAI_PORT_STAT_WRED_DROPPED_PACKETS, SAI_PORT_STAT_ECN_MARKED_PACKETS, SAI_PORT_STAT_ETHER_TX_OVERSIZE_PKTS
- Hide error message when reading not implemented port stat

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-06-24 09:21:45 -07:00
Stephen Sun
fc61ec9dbf
[Mellanox] Return N/A for PSU's model, serial and revision on platforms with fixed PSU (#7927)
- Why I did it
The methods get_model, get_serial, and get_revision have been implemented by reading relevant information from VPD and then recording the information into relevant fields.
However, there is no VPD data on platforms with fixed PSUs and relevant fields haven't been initialized, which causes the methods to throw exceptions. which in turn prevents psud from inserting fields into PSU table.
Eventually, this causes show platform psustatus doesn't output correct info.

- How I did it
Initialize those fields as N/A on systems with fixed PSUs.

- How to verify it
Manually test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-06-23 20:41:28 +03:00
Junchao-Mellanox
f294096eb6
[Mellanox] Read EEPROM data from DB if possible (#7808)
- Why I did it
Remove EEPROM cache file and use DB instead

- How I did it
Read EEPROM data from DB if possible
If data is not ready in DB, read from hardware using a visitor pattern

- How to verify it
Manual test and regression
2021-06-20 17:58:11 +03:00
Alexander Allen
29601366ee
[Mellanox] Implement auto_firmware_update platform API for to support fwutil auto-update (#7721)
Why I did it
The Mellanox platform is required to support the fwutil auto-update feature defined here

This is to allow switches, when performing SONiC upgrades to choose whether to perform firmware upgrades that may interrupt the data plane through a cold boot.

How I did it
Two methods were added to the component implementations for mellanox.

In the base Component class we add a default function that chooses to skip the installation of any firmware unless the cold boot option is provided. This is because the Mellanox platform, by default, does not support installing firmware on ONIE, the CPLD, or the BIOS "on-the-fly".

In the ComponentSSD class we add a function that behaves similarly but uses the Mellanox specific SSD firmware upgrade tool to check if the current SSD supports being upgraded on the fly in order to decide whether to skip or perform the installation.

How to verify it
Unit tests are included with this PR. These test will run on build of target sonic-mellanox.bin

You may also perform fwutil auto-update ... commands after Azure/sonic-utilities#1242 is merged in.
2021-06-16 14:55:20 -07:00
Stephen Sun
80d01f2f9a
[Mellanox] Enhance Python3 support for platform API (#7410)
- Why I did it
Enhance the Python3 support for platform API. Originally, some platform APIs call SDK API which didn't support Python 3. Now the Python 3 APIs have been supported in SDK 4.4.3XXX, Python3 is completely supported by platform API

- How I did it
Start all platform daemons from python3
1. Remove #/usr/bin/env python at the beginning of each platform API file as the platform API won't be started as daemons but be imported from other daemons.
2. Adjust SDK API calls accordingly

- How to verify it
Manually test and run regression platform test

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-06-15 17:57:48 +03:00
Stephen Sun
87bdc1a415
[Mellanox] Adjust Makefile for SDK/python-sdk-api to support both python2 and python3 (#7848)
- Why I did it
Adjust the Makefile for SDK/python-SDK-API to support both python2 and python3

- How to verify it
Build the image and check whether python2 and python3 are both supported by SDK API.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-06-15 17:54:14 +03:00
DavidZagury
cedc2113fb
[Mellanox] Install MFT packages on Syncd container (#7844)
To have access to MFT tools in the Syncd container on Mellanox switches due to SAI dump API implementation enhancements
2021-06-14 08:25:58 -07:00
Dror Prital
a3d90b9fbf
[Mellanox] Update SAI ver. 1.19.0, SDK\FW ver. 4.4.3106\2008.3110 (#7820)
Why I did it

* For SAI - Upgrade to Version 1.19.0

- Add support for VxLAN encap TTL uniform model on SPC2/3
- Add ACL entry actions set VRF, set do no learn, add VLAN ID, add VLAN priority
- Add ACL field has VLAN tag
- Bulk counters (improve port statistics performance)
- Create async dump extra as part of debug generate dump
- Create irisc dump on severe health event
- Support 0 port systems (modify get switch mac to work accordingly)
- Set interface vlan up state for ping tool in SONiC
- Support attributes SAI_PORT_ATTR_QOS_SCHEDULER_PROFILE_ID, SAI_PORT_ATTR_QOS_INGRESS_BUFFER_PROFILE_LIST,
SAI_PORT_ATTR_QOS_EGRESS_BUFFER_PROFILE_LIST, SAI_PORT_ATTR_POLICER_ID as part of port create Git stats

* For SDK\FW - Upgrade to Version SDK 4.4.3106, FW 2008_3110

Added Features:

- Increased ACL table
- Enhanced PSAMPLE support
- Added support for Finisar SR4 module in SN3700 systems
- Added support for Python 3.0 in examples.
Fix bugs:

- On LR4 transceivers 00YD278, the firmware incorrectly identified the transceiver
- Reduce memory consumption for virtual LAG
- Fixed PSAMPLE listeners cleanup on SDK drivers unloading.
- On Spectrum-2 and Spectrum-3 systems, slow reaction time to Rx pause packets may lead to buffer overflow on servers.
- BER may be experienced when using 5m DAC cables between SN4700 and SN2700 in 100GbE speed.
- On very rare occasion, when connecting DR4 PAM4 transceiver to 100GbE DR1 NRZ, low BER may be experienced.
- Unexpected packet drops on the port ingress buffer may be experienced when working in 400GbE mode.
Note: When performing ISSU from an older version, this fix won't be applied. For fix to apply, a non-ISSU reset is required.
- Fix SN3800 specific warm boot scenario: Disable interface, Warm Boot, Enable Interface --> link will remain down.

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-06-10 14:15:55 +03:00
Stepan Blyshchak
5c81c499e4
[nvidia/mellanox] add MLNX_SDK_DEB_VERSION to SDK packages flags list. (#7747)
This is due to the fact that we use SONIC_OVERRIDE_BUILD_VARS internally
in our build jobs and this is not accounted in caching framework.
So we add MLNX_SDK_DEB_VERSION to force rebuild if we changed it via
SONIC_OVERRIDE_BUILD_VARS.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-06-08 09:38:00 -07:00
yozhao101
1a3cab43ac
[Monit] Deprecate the feature of monitoring the critical processes by Monit (#7676)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.

How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.

How to verify it
I verified this on the device str-7260cx3-acs-1.
2021-06-04 10:16:53 -07:00
Kebo Liu
93534ce0e1
[Mellanox] Align PSU name convention returned from psu.get_name platform API (#7783)
Make PSU name returned from platform API aligned with the convention "PSU {X}" instead of "PSU{X}".
2021-06-04 09:40:23 -07:00
Kebo Liu
bf21dbce87
[Mellanox] Add support for MSN4600 A1 system (#7732)
Add new sensor conf for MSN4600 A1 system
Add a Mellanox hw-management patch to support MSN4600 A1 system
2021-05-27 09:52:45 -07:00
Kebo Liu
237849a330
update hw-mgmt version to 2304 (#7725)
- Why I did it
Pick up fix from new hw-management package:
Fix gearbox thermal zone name, which was lack suffix thermal zone number

- How I did it
Update the hw-management version number in the make file
Update hw-management submodule pointer

- How to verify it
Run platform related test cases on Mellanox platform
2021-05-27 14:17:23 +03:00
Junchao-Mellanox
3ea3a5c8c1
[Mellanox] clear fan from chassis._fan_list (#7682)
#### Why I did it

According to thermalctld hld, each fan must belong to a fan drawer, if the fan drawer does not physically exist, put fan into a virtual fan drawer. This PR is to clear fan from chassis._fan_list

#### How I did it

1. Don't put fan to chassis._fan_list
2. Always query fan from fan_drawer
2021-05-24 11:36:39 -07:00
Alexander Allen
6a9d1e584d
[Mellanox] Implement Hardware Revision Platform API Call for Mellanox Chassis and PSU (#7552)
#### Why I did it

This pull request allows calls to be made through the platform 2.0 API that retrieve the PSU and Chassis hardware revision on Mellanox platforms. Access to these values will aid customers in determining their hardware revisions for debugging and technical support. These values are intended to be eventually exposed through the CLI. 

#### How I did it

For the PSU hardware revision I used the existing VPD function calls implemented in https://github.com/Azure/sonic-buildimage/pull/7382

For the Chassis hardware revision I parsed the SMBIOS / DMI type 2 information to retrieve the information.
2021-05-24 09:37:59 -07:00
Junchao-Mellanox
bfae15fb83
[mellaonox]: No need enable thermal zones in thermal_manager.deinitialize since they are enabled by default (#7556)
No need enable thermal zones in thermal_manager.deinitialize since they are enabled by default. And removing this will faster thermalctld exit speed
2021-05-08 10:33:37 -07:00
Stephen Sun
9f0dce0313
[Mellanox] Optimize SFP modules initialization (#7537)
Originally, SFP modules were always accessed from platform daemons, and arbitrary SFP modules can be accessed in the daemon. So all SFP modules were initialized in one shot once one of the following chassis APIs called
- get_all_sfps
- get_sfp_numbers
- get_sfp

Recently, we noticed that SFP modules can also be accessed from CLI, eg. the latest refactor of `sfputil`.

In this case, only one SFP module is accessed in the chassis object's life cycle.
To initialize all SFP modules in one shot is waste of time and causes the CLI to take much more time to finish.
So we would like to optimize the initialization flow by introducing a two-phase initialization approach:
- Partial initialization, which means the `chassis._sfp_list` has been initialized with proper length and all elements being `None`
- Full initialization, which means all elements in `chassis._sfp_list` are created

If the relevant function is called,
- `get_sfp`, only partial initialization will be done, and then the specific SFP module is initialized.
- `get_all_sfps` or `get_num_sfps`, full initialization will be done, which means all SFP modules are initialized.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-05-06 10:14:48 -07:00
shlomibitton
2d3149d641
[Mellanox] Update FW to xx.2008.2526 (#7511)
- Why I did it
Updated FW to xx.2008.2526 version.

Fixed issues:
1. Spectrum-2, Spectrum-3 | sFlow | High CPU load and high on fully loaded switch.
2. Spectrum-2, Spectrum-3 | Fine grain LAG | in rare cases doesn’t update the right entry

- How I did it
Updated submodule pointer and version in a Makefile.

- How to verify it
Full regression and bugs validation

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-05-05 09:47:57 +03:00
Alexander Allen
0bc0f98d48
[platform] Add serial number and model number to Mellanox PSU platform implementation (#7382)
#### Why I did it

We want to add the ability for the command `show platform psustatus` to show the serial number and part number of the PSU devices on Mellanox platforms. This will be useful for data-center management of field replaceable units (FRUs) on switches.

#### How I did it

I implemented the platform 2.0 functions `get_model()` and `get_serial()` for the PSU in the mellanox platform API by referencing the sysfs nodes provided by the [hw-management](https://github.com/Azure/sonic-buildimage/tree/master/platform/mellanox/hw-management) module.
2021-05-04 13:07:00 -07:00
Stephen Sun
b2286a24dc
[Mellanox] Adopt single way to get fan direction for all ASIC types (#7386)
#### Why I did it
Adopt a single way to get fan direction for all ASIC types.
It depends on hw-mgmt V.7.0010.2000.2303. Depends on https://github.com/Azure/sonic-buildimage/pull/7419

#### How I did it
Originally, the get_direction was implemented by fetching and parsing `/var/run/hw-management/system/fan_dir` on the Spectrum-2 and the Spectrum-3 systems. It isn't supported on the Spectrum system.
Now, it is implemented by fetching `/var/run/hw-management/thermal/fanX_dir` for all the platforms.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-05-03 17:10:18 -07:00
Junchao-Mellanox
d9cdf9d14f
[Mellanox] Adjust PSU fan name to align with sysfs file name (#7490)
Change PSU fan name from psu_{psu_index}fan{fan_index} to psu{psu_index}_fan{fan_index}
2021-05-02 08:14:56 -07:00