Commit Graph

1870 Commits

Author SHA1 Message Date
Sudharsan Dhamal Gopalarathnam
156189dbad [Mellanox]Fix lpmode set when logical port is larger than 64 (#14138)
- Why I did it
In sfplpm API, the number of logical ports is hardcoded as 64. When a system contains more port than this, the SDK APIs would fail with a syslog as below

Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [MGMT_LIB.ERR] Slot [0] Module [0] has logport [0x00010069] in enabled state
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [SDK_MGMT_LIB.ERR] Failed in __sdk_mgmt_phy_module_pwr_attr_set, error: Internal Error
Mar 7 03:53:58.106118 r-leopard-58 ERR pmon#-c: Error occurred when setting power mode for SFP module 0, slot 0, error code 1

- How I did it
Remove the hardcoded value of 64. Obtained the number of logical ports from SDK

- How to verify it
Manual testing
2023-03-19 20:50:58 +08:00
Junhua Zhai
29f3c4944a [gearbox] use credo sai v0.9.0 (#14149)
Update credo sai package to the latest v0.9.0.
2023-03-19 20:50:53 +08:00
Dror Prital
ba14f728de Update SDK/FW to version 4.5.4206/4.5.4204 (#14164)
- Why I did it
To include latest fixes:

Fix traffic loss on all routed traffic when moving from 4.4.3372/XX_2008_3388 to 4.5.4118-012/XX_2010_4120-010. Issue occurred after ISSU process in Spectrum 1 only, When upgrading from older version to a new one. Neighbor entries are overwritten.
Fix When using mirror session policer on SPC2/3, the actual CIR was 1.28 times more than the configured CIR value.
Fix Creation of router interface of type bridge may occasionally fail if create is performed immediately after delete.
Fix False errors during SDK deinitialization may be seen in the syslog

- How I did it
Updated SDK submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".
2023-03-19 20:50:49 +08:00
dbarashinvd
d7ba89a95b [Mellanox] fix for watchdog device not found, adding dependency on hw-management (#14182)
- Why I did it
Sometimes Nvidia watchdog device isn't ready when watchdog-control service is up after first installation from ONIE
need to delay watchdog control service to go up after hw-mgmt which gets devices up and ready

- How I did it
Delay Nvidia watchdog-control service before hw-mgmt has started on Mellanox platform in order to avoid missing or not ready watchdog device.

- How to verify it
verification test of ONIE installation of image in a loop
making sure watchdog service is always up (not failed) after first installation from ONIE
2023-03-19 20:50:44 +08:00
Volodymyr Samotiy
cc5ed4b632 [Mellanox] Update MFT to 4.22.1-15 (#14133)
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2023-03-19 18:33:57 +08:00
zitingguo-ms
3c312dec1c
Upgrade SAI xgs version to 8.4.0.2 and migrate to DMZ (#14119)
Why I did it
Update SAI xgs version to 8.4.0.2 and migrate xgs to DMZ repo.

How I did it
Update SAI xgs version in sai.mk.

How to verify it
Run the SONiC and SAI test with the8.4 SAI release pipeline.
2023-03-09 14:52:08 +08:00
Stepan Blyshchak
969166d769 [Mellanox] Place FW binaries under platform directory instead of squashfs (#13837)
Fixes #13568

Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:

admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb  8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa

- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.

- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.

- How to verify it
Upgrade from 201911 to master
master to 201911 downgrade
master -> master reboot
ONIE -> master boot (First FW burn)
Which release branch to backport (provide reason below if selected)
2023-03-08 13:50:18 +08:00
mssonicbld
aea96da04d
[Mellanox] Fix issue: cannot find label port for logical port when logical port number is larger than 64 (#13710) (#13962) 2023-03-06 16:47:31 +08:00
mssonicbld
1757f53290
[Mellanox] update sdk/fw build procedure (#14025) (#14059) 2023-03-03 02:43:19 +08:00
mssonicbld
72f9f51287
[Seastone] fix dx010 qsfp eeprom data write issue (#13930) (#14032) 2023-03-01 19:28:38 +08:00
mssonicbld
18bc044179
Remove support to Mellanox SPC4 ASIC (#13932) (#13957) 2023-02-23 22:22:35 +08:00
mssonicbld
310827c26c
Add PYTHON3_SWSSCOMMON as build time dependency to Mellanox platform API (#13847) (#13959) 2023-02-23 20:32:15 +08:00
mssonicbld
50aaf92590
[Mellanox] Non upstream patches for hw-mgmt V.4.0020.4104 (#13792) (#13960) 2023-02-23 20:32:09 +08:00
Junchao-Mellanox
e8789a2e11 [Mellanox] Check system eeprom existence in a retry manner (#13884)
- Why I did it
On Mellanox platform, system EEPROM is a soft link provided by hw-management. There is chance that config-setup service accessing the EEPROM before hw-management creating it. It causes errors. The PR is aim to fix it.

- How I did it
Waiting EEPROM creation in platform API up to 10 seconds.

- How to verify it
Manual test
2023-02-23 20:31:29 +08:00
mssonicbld
6a12ca9332
[Mellanox] [ECMP calculator] Add support for 4600/4600C/2201 platforms with different interface naming method (#13814) (#13931) 2023-02-22 22:14:09 +08:00
Pavan-Nokia
d7815f3229 add sfp get error description (#13275)
Why I did it
Command "sudo sfputil show error-status -hw" shows "OK (Not implemented)" in the output.

How I did it
Add a new SFP API get_error_description support in Nokia sonic-platform sfp.py module.

How to verify it
Run the new image and execute command "sudo sfputil show error-status -hw"
2023-02-22 18:36:56 +08:00
Stephen Sun
b0416a5c2c [Mellanox] Advance hw-mgmt to v.7.0020.4104 (#13372)
- Why I did it
Advance hw-mgmt service to V.7.0020.4100
Add missing thermal sensors that are supported by hw-mgmt package
Delay system health service before hw-mgmt has started on Mellanox platform in order to avoid reading some sensors before ready.
Depends on sonic-net/sonic-linux-kernel#305

- How I did it
1. Update hw mgmt version
2. Add missing sensors
3. Delay service 

- How to verify it
Regression test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-02-20 14:38:53 +08:00
Stephen Sun
4f3b649f8e [Mellanox] Support per PSU slope value for PSU power threshold (#13757)
- Why I did it
Support per PSU slope value for PSU power threshold according to hardware team requirement

- How I did it
Pass the PSU number as a parameter when fetching the slope value of PSU.

- How to verify it
Running regression and manual test

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-02-20 12:38:20 +08:00
Sudharsan Dhamal Gopalarathnam
a993fc205f [Mellanox][sai_failure_dump]Added platform specific script to be invoked during SAI failure dump (#13533)
- Why I did it
Added platform specific script to be invoked during SAI failure dump. Added some generic changes to mount /var/log/sai_failure_dump as read write in the syncd docker

- How I did it
Added script in docker-syncd of mellanox and copied it to /usr/bin

- How to verify it
Manual UT and new sonic-mgmt tests
2023-02-18 06:34:29 +08:00
Samuel Angebault
ef02c73a03
[202211][Arista] Update platform library submodules (#13872)
add SEU reporting on chassis
fix fallback logic for Clearlake eeprom identification
fix fan speed reporting for a specific model
move pcie timeout configuration for Upperlake in platform code (deprecates hwsku-init)
2023-02-17 13:51:42 -08:00
Pavan-Nokia
979e9a7d9d [armhf][Nokia-7215]High CPU caused by entropy.py (#13694)
Why I did it
High CPU utilization by entropy.py

How I did it
Remove entropy script as it does not work anymore and is no longer needed for bullseye(202205).
In Buster(202012) the max available poolsize (entropy_avail) for entropy is 4096 and our entropy.py script was based on this value. With the change in kernel to bullseye on 202205 this entropy poolsize was changed to 256 which also causes our script to fail.

This script was initially added to provide SW assistance to improve the system entropy value available early on in the Sonic boot sequence on buster.
On bullseye (Linux kernel 5.10) this is no longer needed as this feature has been improved.

How to verify it
run "top" command to check CPU usage.
2023-02-18 04:32:35 +08:00
mssonicbld
94e59a841e
[Mellanox] Enhance MFT make file to download source code from any valid URL (#13801) (#13868) 2023-02-18 02:14:00 +08:00
Volodymyr Samotiy
e849455742 [Mellanox] Update SDK/FW to 4.5.4150/2010.4150 (#13480)
- Why I did it
To include latest fixes and new functionality

SDK/FW
1. Fixed bug in recovery mechanism in case of I2C error when trying to access the XSFP module.
2. On the NVIDIA Spectrum-2 switch, when receiving a packet with Symbol Errors on ports that are configured to cut-thought mode, a pipeline might get stuck.
3. On the Spectrum-2 and Spectrum-3 switch, if you enable ECN marking and the port is in split mode, traffic sent to the port under congestion (for example, when connecting two ports with a total speed of 50GbE to a single 25GbE port) is not marked.
4. Modifying existing entry/Adding new one when switch is at its maximum capacity (full by maximum allowed entries from any type such as routes, FDB, and so forth), will fail with an error.
5. When many ports are active (e.g., 70 ports up), and the configuration of shared buffer is applied on the fly, occasionally, the firmware might get stuck.
6. When a system has more than 256 ACL rules, on rare occasion, removing/adding rules may cause some ACL rules not to work.
7. On SN2201 system, on RJ45 port, the link might appear in 'down' state even if it operations properly.
8. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event.
9. When setting LAG as a SPAN analyzer, the distributor mode of the LAG members was not taken into account. It may happen that the LAG member with distributor mode disabled will be set as a SPAN analyzer port.

- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2023-02-16 18:36:43 +08:00
Lior Avramov
e6b1ed366b [Mellanox] [ECMP calculator] Add script usage and more information to script description in help option (#13493)
Add script usage and more information to script description being printed in help option.

- Why I did it
Missing information in script description in help option.

- How I did it
Expand script description and add script usage.

- How to verify it
Run the script with -h option.
2023-02-16 18:36:36 +08:00
mssonicbld
8832ddd60b
[Mellanox] Improve FW upgrade logging (#13465) (#13681) 2023-02-12 23:53:33 +08:00
mssonicbld
956173856c
[sflow]: Unblocked psample_*() function calls in BRCM ESW platforms for proper functionality of sflow feature (#12918) (#13691) 2023-02-11 12:35:41 +08:00
Junhua Zhai
200342261a [gearbox] use credo sai v0.8.2 (#13565)
Update credo sai package to the latest v0.8.2, which also has the fix for aristanetworks/sonic#52.
2023-02-07 04:32:28 +08:00
mssonicbld
d9b15aea0d
[Seastone] Enhancement fix for PR12200 syseeprom issue (#13344) (#13664) 2023-02-05 01:22:04 +08:00
Ikki Zhu
62fb0726ee [Platform/Seastone]: fix syseeprom tlv read issue (#12200)
Why I did it
Fix Seastone syseeprom tlv header read incorrect issue

How I did it
Set mux idle_state

How to verify it
i2cdump -y -f 12 0x50 i
2023-02-04 04:32:29 +08:00
Vadym Hlushko
3530fdbea1 [SFP] Change logging severity when failed to read EEPROM (#13011)
- Why I did it
In order to prevent the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py test failing on the log analyzer step.

The mentioned test is performing the sfputil reset EthernetX for every interface on the SONiC switch, this action will flap the SFP device status (INSTERTED -> REMOVED -> INSTERTED).

The SONiC XCVRD daemon will catch this SFP device status change (because it is monitoring the presence status of the cable).
To judge the cable presence status, currently, we are still leveraging to read the first bytes of the EEPROM, and the EEPROM could be not ready at some moment and the SONiC XCVRD daemon will print the error log to Syslog:

ERR pmon#xcvrd: Error! Unable to read data for 'xx' port, page 'xx' offset 128, rc = 1, err msg: Sending access register

- How I did it
Change logging severity from ERR to WARNING

- How to verify it
Run the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py

OR much faster way to run the next script on the switch:

#!/bin/bash

START=0
END=248

for (( intf=$START; intf<=$END; intf+=8))
do
    sfputil reset Ethernet"${intf}"
done

sfputil show presence
2023-02-04 02:36:51 +08:00
Junchao-Mellanox
cf6f31b215 [Mellanox] Remove TODO comments which are no longer needed (#13023)
- Why I did it
Remove TODO comments which are no longer needed

- How I did it
Remove TODO comments which are no longer needed

- How to verify it
Only comment change
2023-02-04 02:36:47 +08:00
Kebo Liu
9680479661 [Mellanox] change the implementation of is_host() to fix a stuck issue on simx platform (#13100)
- Why I did it
Following code to judge whether a process is running inside a docker could get stuck on the simx platform

subprocess.Popen(["docker", "--version"],
                                stdout=subprocess.PIPE,
                                stderr=subprocess.STDOUT,
                                universal_newlines=True)
When it gets stuck, the config-chassisdb service can not be successfully started, thus the system can not be booted up.

root@sonic:/# service config-chassisdb status
     config-chassisdb.service - Config chassis_db
     Loaded: loaded (/lib/systemd/system/config-chassisdb.service; enabled; vendor preset: enabled)
     Active: activating (start) since Thu 2022-12-15 09:23:02 UTC; 29min ago
   Main PID: 571 (config-chassisd)
      Tasks: 14 (limit: 9501)
     Memory: 132.4M
     CGroup: /system.slice/config-chassisdb.service
                        ├─571 /bin/bash /usr/bin/config-chassisdb
			├─575 /usr/bin/python3 /usr/local/bin/sonic-cfggen -H -v DEVICE_METADATA.localhost.platform
			├─602 /bin/sh -c sudo decode-syseeprom -m
			├─603 sudo decode-syseeprom -m
			├─607 /usr/bin/python3 /usr/local/bin/decode-syseeprom -m
			├─616 /bin/sh -c docker --version 2>/dev/null
			└─617 docker --version

- How I did it
Use an alternative way to implement this function and issue can be avoided:

docker_env_file = '/.dockerenv'
return os.path.exists(docker_env_file) is False

- How to verify it
run regression on real hardware and simx platform.
2023-02-04 02:36:43 +08:00
Yoush
d59b43566f [centec]: reference to v1.11.0-1 sai debian package for master (#13206) 2023-02-04 02:36:38 +08:00
Kebo Liu
ab54549d53 [Mellanox] Skip the leftover hardware reboot cause in case of last boot is warm/fast reboot (#13246)
- Why I did it
In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.

- How I did it
Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.

- How to verify it
a. Manual test:
    - Perform a power loss
    - Perform a warm/fast reboot
    - Check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"
b. Run reboot cause related regression test.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-01-31 18:34:36 +08:00
Junchao-Mellanox
e631f426f4
[infra] Support syslog rate limit configuration (#12490) (#13535)
Backport of https://github.com/sonic-net/sonic-buildimage/pull/12490 into 202211

- Why I did it
Support syslog rate limit configuration feature

- How I did it
Remove unused rsyslog.conf from containers
Modify docker startup script to generate rsyslog.conf from template files
Add metadata/init data for syslog rate limit configuration

- How to verify it
Manual test
New sonic-mgmt regression cases
2023-01-30 20:11:44 +02:00
Dror Prital
d12c3b79bc
[202211][Mellanox] Add ASIC simulation version tag to fw.mk (#13473)
Signed-off-by: dprital <drorp@nvidia.com>
2023-01-23 13:28:19 +02:00
mssonicbld
1dc71aa4ff
[Mellanox] Update ECMP calculator README (#13051) (#13362) 2023-01-14 11:46:42 +08:00
mssonicbld
7524e91aa1
The FAN driver framework module complies with s3ip sysfs specification (#12888) (#13212)
Why I did it
Provide a Fan driver framework that complies with s3ip sysfs specification

How I did it
1、 The framework module provides register and unregister interface and implementation.
2、 The framework will help you create the sysfs node

How to verify it
A demo driver base on this framework will display the sysfs node wich conform to the s3ip sysfs specification

Co-authored-by: tianshangfei <31125751+tianshangfei@users.noreply.github.com>
2023-01-09 14:24:41 +08:00
mssonicbld
ab0533e646
two platforms supporting S3IP SYSFS (TCS8400, TCS9400) (#12386) (#13210)
Why I did it
Add two platform that support s3IP framework

How I did it
Add two platforms supporting S3IP SYSFS (TCS8400, TCS9400)

How to verify it
Manual test

Co-authored-by: tianshangfei <31125751+tianshangfei@users.noreply.github.com>
2023-01-09 11:40:35 +08:00
mssonicbld
1e522ff3a9
Add ECMP calculator tool (#12482) (#13301) 2023-01-09 00:48:56 +08:00
Richard.Yu
fb6f0b53ba
[SAIServer]Upgrade SAI server init script (#13175) (#13227)
Why I did it
why
In order to apply different config across different platform, and use the code with a unified format, reuse syncd init script to init saiserver.

How I did it
how
Reuse syncd init script

How to verify it
Test
Test in DUT s6000 and dx010 with sonic 202205
2023-01-03 16:03:05 +08:00
mssonicbld
79b0890c53
The user framework module complies with s3ip sysfs specification (#12894) (#13215) 2023-01-01 12:35:32 +08:00
mssonicbld
684b07f172
The demo driver complies with s3ip sysfs specification,which use the s3ip kernel framework (#12895) (#13214) 2023-01-01 12:35:11 +08:00
mssonicbld
4ac8359854
The CPLD and FPGA driver framework module complies with s3ip sysfs specification (#12891) (#13218) 2023-01-01 12:34:50 +08:00
mssonicbld
313406a290
The build project of s3ip frameworkk (#12896) (#13213) 2023-01-01 12:32:42 +08:00
mssonicbld
967cc38356
The PSU driver module complies with s3ip sysfs specification (#12887) (#13211) 2023-01-01 12:32:36 +08:00
mssonicbld
fe5732a4cc
The slot and switch_rootsysfs driver framework module complies with s3ip sysfs specification (#12893) (#13216) 2023-01-01 12:28:41 +08:00
mssonicbld
5489913baf
The Sensor driver framework module complies with s3ip sysfs specification (#12890) (#13219) 2023-01-01 12:27:55 +08:00
mssonicbld
29e7348c7b
The Transceiver driver framework module complies with s3ip sysfs specification (#12889) (#13220) 2023-01-01 12:26:52 +08:00
mssonicbld
8552b92b98
The LED and watchdog driver framework module complies with s3ip sysfs specification (#12892) (#13217) 2023-01-01 12:24:31 +08:00