Commit Graph

1023 Commits

Author SHA1 Message Date
Nazarii Hnydyn
530125311e
[Mellanox]: Advance SAI submodule. (#11149)
- To fix tunnel underlay configuration

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2022-06-15 12:30:14 -07:00
shlomibitton
5c5c13a536
Add a new patch to set PSU led to green on init by Nvidia hw-mgmt package (#10912)
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2022-05-24 19:19:00 -07:00
abdosi
cd28f30969
Updated Broadcom SAI version to 3.7.6.1-1 (#10859)
Updated BRCM SAI Version to 3.7.6.1-1
2022-05-17 22:39:10 -07:00
Volodymyr Samotiy
ce7bf08144
[Mellanox] [201911] Update FW to v2008.3382 (#10798)
- Why I did it
To include the fix for the issue of Modification of shared headroom on the fly can get to negative occupancy that leads to PFC been sent from the switch continuously.

- How I did it
Updated submodule pointer and version in relevant Makefile.

- How to verify it
Build an image and run tests from sonic-mgmt.

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2022-05-11 08:39:01 +03:00
Santhosh Kumar T
9093feb113
[DellEMC][201911] S6100 CPLD upgrade support in 201911 branch porting changes (#10686)
Why I did it
Porting changes from DellEMC: S6100 CPLD upgrade #4299 and DellEMC S6100 CPLD upgrade support #3834 to 201911 branch
Added CPLD upgrade support for DellEMC S6100 platform.
2022-04-28 09:23:38 -07:00
Santhosh Kumar T
ac35a62747
[DellEMC][201911] S6100 S6000 - Show techsupport enhancement (#10690) 2022-04-27 09:17:35 -07:00
Junchao-Mellanox
2ffc9d572f
[Mellanox] [201911] Optimize thermal policies (#9664)
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.

- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable

- How to verify it
1. Manual test
2. Regression
2022-01-19 11:42:09 +02:00
Arun Saravanan Balachandran
33ef26d97b
[201911] DellEMC: S6000, S6100 - Enable thermalctld, Platform API changes (#9384)
Why I did it
To incorporate the below changes in DellEMC S6100, S6000 platforms.

Enable thermalctld
Backport Platform API changes from master branch.
How I did it
Remove 'skip_thermalctld:true' in pmon_daemon_control.json
Implement the platform API methods in the respective device files
How to verify it
Verified that platform data is displayed by show platform fan and show platform temperature commands.
2021-12-10 12:23:22 -08:00
Samuel Angebault
dfa77a54d5
[201911][Arista] Backport logrotate configuration (#9455)
Backport logrotate configuration for arista*.log files
2021-12-08 19:11:04 -08:00
Elvis Tsai
a8fed0a85e
[201911][Innovium] Update Wistron platform definition
Why I did it
Cannot retrieve and display the reboot-cause.

How I did it
Correct the platform initialization definition.

How to verify it
Manual reboot and then 'show reboot-cause'
2021-12-08 19:09:53 -08:00
Junchao-Mellanox
2b4c8ee330
[Mellanox] Fan speed should not be 100% when PSU is powered off (#9258) (#9380)
Backport #9258 to 201911

Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.

How I did it
When PSU is powered of, don't treat it as absent.

How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
Conflicts:
platform/mellanox/mlnx-platform-api/sonic_platform/thermal_infos.py
2021-12-07 18:22:51 -08:00
Volodymyr Samotiy
690f8e6919
[Mellanox] Update SDK to v4.4.3360 and FW to v2008.3358 (#9402)
- Why I did it
To include latest fixes.

1. On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting Spectrum devices with optical transceivers that support RXLOS, remote side port down might cause the switch firmware to get stuck and cause unexpected switch behavior.
3. On rare occasions, when working with port rates of 1GbE or 10GbE and congestion occurs, packets may get stuck in the chip and may cause switch to hang.
4. When ECMP has high amount of next-hops based on VLAN interfaces, in some rare cases, packets will get a wrong VLAN tag and will be dropped.
5. Using SN4600C with copper or optics loopback cables in NRZ speeds, link may raise in long link up times ( up to 70 seconds).
6. When connecting SN4600C to SN4600C after Fastboot in 50GbE No_FEC mode with a copper cable, the link up time may take ~20 seconds.

- How I did it
Updated SDK submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "soni-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-12-05 09:17:17 +02:00
Santhosh Kumar T
ddf40cb729
[201911] Dell S6000 I2C not responding to certain optics - porting (#8855) 2021-10-25 15:25:12 +05:30
Stephen Sun
1b168c36c4
[Mellanox][201911] Upgrade Mellanox-SAI to 1.19.3 to support reclaiming reserved buffer on admin down ports (#8735)
#### Why I did it

Upgrade Mellanox-SAI to 1.19.3 to support reclaiming reserved buffer on admin down ports

#### How I did it

To support reclaiming reserved buffer on admin down ports.

#### How to verify it

Regression test and manual test.
2021-09-19 20:16:10 -07:00
Junchao-Mellanox
30f2503ab3
[Mellanox] Read PSU fan max/min speed per PSU (#8563) (#8728)
New PSU could install different type of fan, so fan max/min speed should be read per PSU
2021-09-13 08:32:28 -07:00
haowei1122
acb9bbafcc
Update sonic-fanthrml-monitor (#8636)
*Thermal mapping is wrong with BMC return value
2021-09-09 09:44:56 -07:00
shlomibitton
55f86768a6
[Mellanox] Update SDK\FW to version 4.4.3326\2008.3326 (#8568)
- Why I did it
Update SDK\FW version to 4.4.3326\2008.3326. This version contains:

New Features:
1. Add support for Fast Boot for SN3800

Bug Fixing:
1. In some cases, when the total number of allocations exceeds the resource limit, an error can occur due to incorrect resource release procedure. This issue is most likely to affect the following resources: flow counters, ACL actions, PBS, WJH filter, Tunnels, ECMP containers, MC (L2 &L3)
2. On Spectrum systems, when using Async Router API with IPV6, an error message in the log regarding failing to remove ECMP container may show up. This error is not functional and can be safely ignored.
3. On Spectrum-2 systems and above, when using warm boot, setting max_bridge_num to a value greater than 1968 will cause an error and potential crash.
4. Some Molex cables do not support speed after reboot

- How I did it

- How to verify it
Was verified by running regression tests that includes complete sonic-mgmt tests supported
2021-08-25 16:34:42 +03:00
Junchao-Mellanox
c647c7ce2b
[Mellanox] Upgrade hw-mgmt to 7.0100.2344 (#8378)
Why I did it
To support new PSU fan on mellanox platforms

How I did it
Upgrade hw-mgmt to 7.0100.2344
2021-08-19 18:07:58 -07:00
Aravind Mani
c53822c9e8
[201911] Dell S6100:Add serial-getty service to monit (#8409)
Why I did it
serial-getty service exited in Dell S6100 device randomly.

How I did it
Added serial-getty to monit services.

How to verify it
Stop serial-getty in ssh session and check whether the service restarts or not
2021-08-19 10:13:34 -07:00
abdosi
de3d30f36d
Updated Broadcom SAI Debian package to 3.7.6.1 (#8365)
Updated Broadcom SAI Debian package to 3.7.6.1 Following are the major changes here:

- CS00011651922/CS00012192502 SID:Parity error in TDM Calendar memories causes traffic drop after SER correction
- CS00011222060 soc_mem_alpm_delete: unit 0: ALPM delete operation[L3_DEFIP_ALPM_IPV6_128] encountered parity error
- Cesto Phy Recovery enhancement.
- SDK compile with flag -DBCM_MONOTONIC_TIME and -DBCM_MONOTONIC_MUTEXES
2021-08-06 17:55:41 -07:00
Arun Saravanan Balachandran
d573cd141d
[201911] DellEMC S6100: Update SSD upgrade status checker (#8225)
Why I did it
To handle newer SSD firmware version in DellEMC S6100 platform (S210506G - 3IE devices).

How I did it
Update s6100_ssd_upgrade_status.sh to handle newer SSD firmware version.

How to verify it
Logs: UT_logs.txt
2021-08-05 22:43:53 -07:00
Dror Prital
2a34e8aca5
[mellanox]: Update SDK\FW to version 4.4.3228\2008_3224 (#8352)
Fix the following issue: Resource KVD hash Table tries to deallocate more resources than allocated.

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-08-05 19:05:26 -07:00
Dror Prital
949fcd21a8
Update SDK\FW to version 4.4.3222\2008.3224 (#8248)
*Update SDK\FW Version to 4.4.3222\2008.3224.
Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-07-22 18:17:05 -07:00
shihjeff
940aaa0cbe
[201911] [Innovium] Update Cameo & Wistron Drivers (#7855)
Fix #8068

Update Innovium configs on Cameo and Wistron platforms
2021-07-21 09:09:36 -07:00
Vivek Reddy
fcc7d3102a
[201911][Mellanox] Update SDK\FW ver. 4.4.3216\2008.3218 (#8145)
Signed-off-by: Dror Prital <drorp@nvidia.com>
* [Mellanox] Update FW version to 2008.3218 (#8079)
Update FW version to 2008.3218, fixing the following issues:
- 50G/100G links that are operationally down before warm-reboot are not coming up after warm-reboot
- 50G/100G links with admin shut / no shut commands are not coming up after warm-reboot
2021-07-09 17:54:25 -07:00
Vivek Reddy
d958b6c664
Update SAI Commit (#8141)
[ba669c3] Fix saisdkdump
Co-authored-by: Vivek Reddy Karri <vkarri@r-build-sonic06.mtr.labs.mlnx>
2021-07-09 15:28:00 -07:00
abdosi
0f56f8b4f4
[201911] Updated to Broadcom SAI debian package to 3.7.5.2-3 (#7887)
Updated to Broadcom SAI debian package to 3.7.5.2-3
2021-06-15 16:03:23 -07:00
Dror Prital
e2eb4e49ab
[Mellanox][201911] Update FW version to 2008_3110 (#7806)
- Why I did it
Update FW version to 2008_3110 fixing SN3800 specific warm boot scenario:

1. Disable interface
2. Warm Boot
3. Enable Interface --> link will remain down.

- How I did it
Use new FW that contains the fix for the problem mentioned above

- How to verify it
Run the scenario mentioned above and make sure that the link is up after warm boot

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-06-08 14:06:43 +03:00
Kebo Liu
7fe493aca1
[201911][Mellanox] Align PSU name convention returned from psu.get_name platform API (#7793)
Make PSU name returned from platform API aligned with the convention "PSU {X}" instead of "PSU{X}".

This PR is to backport https://github.com/Azure/sonic-buildimage/pull/7783
2021-06-04 10:38:16 -07:00
Volodymyr Samotiy
8405d2deef
[Mellanox] Update SDK to 4.4.3106 and FW to xx.2008.3106 (#7787)
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-06-03 10:02:23 -07:00
Joe LeVeque
b6acac4e6a [brcm] Fix and simplify start_led.sh (#7548)
LED_PROC_INIT_SOC variable was incorrectly referenced as LED_SOC_INIT_SOC. Introduced in #5483

Rather than fixing the typo, I decided to simplify the script, removing the need for the conditional altogether by moving the bcmcmd call inside the conditional which checks for the presence of LED_SOC_INIT_SOC.
2021-05-31 08:09:19 -07:00
zzhiyuan
f2afdf666e
[201911][Arista] Update Arista submodule to include pmbus fix (#7723)
#### Why I did it
Microsoft reported occasional daemon crashes on devices running 201911. On close inspection it was due to PMBus reads failing on IOError on very rare occasions.

#### How I did it
Add try/except block on performing reads on PMBus GPIOs.

Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>
2021-05-27 12:30:38 -07:00
Kebo Liu
77ba74be04
update hw-mgmt package to version 2304 (#7726) 2021-05-27 10:39:16 -07:00
Santhosh Kumar T
04b6112132
[DellEMC] Recovering the SSD upgrade status post reload in S6100 (#7688)
Why I did it
To recover the SSD upgrade state in case, if ONIE-uninstall or ssd_fw_upgrade folder got deleted.
To handle newer SSD version(S21506G - 3IE GPIO7 low devices).
Also correcting the error messages for non-upgraded S6100s.
2021-05-25 15:24:09 -07:00
roman_savchuk
efd4b93ac9
[BFN] Updated SAI/SDK packages to 20210519 (#7660)
Includes fix for ACL counter (changed to 64 bit)

Signed-off-by: Roman Savchuk <romanx.savchuk@intel.com>
2021-05-20 10:09:51 -07:00
Tony Titus
fbd4e452c7
[201911] [Innovium] Add new platforms and config updates (#7545)
Update Innovium configs + Add new platforms supporting Innovium chips
2021-05-17 12:30:20 -07:00
shlomibitton
b869ad1122 [Mellanox] Update FW to xx.2008.2526 (#7511)
- Why I did it
Updated FW to xx.2008.2526 version.

Fixed issues:
1. Spectrum-2, Spectrum-3 | sFlow | High CPU load and high on fully loaded switch.
2. Spectrum-2, Spectrum-3 | Fine grain LAG | in rare cases doesn’t update the right entry

- How I did it
Updated submodule pointer and version in a Makefile.

- How to verify it
Full regression and bugs validation

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-05-14 11:42:22 -07:00
Santhosh Kumar T
6204a1d809
[201911] DellEMC S6100 SSD Monitor additional changes (#7291)
Why I did it
Added soft-reboot plugin support.
Added SSD version s16425cq check
Added error message to display in console/SSH in case reboot is called in faulty/non-upgraded devices.
2021-05-04 09:48:04 -07:00
Junchao-Mellanox
6128ff6def
[Mellanox] [201911] Upgrade hw-mgmt to 7.0100.2303 (#7418)
- Update hw-mgmt pointer
- Remove unused patches
- Fix existing patch to make sure it apply successfully
2021-05-01 10:30:51 -07:00
Volodymyr Samotiy
efe0515519
[Mellanox] [201911] Update SAI submodule pointer (#7498)
* Set monitoring VLAN hostif up dy default (for VNET ping tool)

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-05-01 10:23:57 -07:00
Dror Prital
b773412a96
[mellanox]: Integrate SAI version 1.18.3.1 into 201911 branch (#7426)
- Fix ACL ANY debug counter to correctly track ACL drops
- Add VXLAN source port hard coded range, controlled by K/V

Signed-off-by: Dror Prital <drorp@nvidia.com>
2021-04-28 23:01:58 -07:00
Kebo Liu
80f0836643 [Mellanox] Update SDK to 4.4.2522 and FW to 2008.2520 (#7391)
New features and fixes in the new SDK/FW:

SN4600C | AN/LT support
SN2700 | AN/LT bugs fixes
WJH | FID_MISS support

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2021-04-28 16:06:37 -07:00
Volodymyr Samotiy
d35c31b8fc [Mellanox] Update SDK to 4.4.2508 and FW to xx.2008.2508 (#7141)
Fix the following issues:

Spectrum-2, Spectrum-3 | Port | Fix link issue when using 25 GbE rate between two ports while one is on Spectrum-2-based system and the other is on Spectrum-3-based system
All | warmboot | fail to upgrade from earlier SONiC versions with official SDK/FW 4.4.2306 (was on SONiC 201911)
All | What-Just-Happened | When enabling or disabling WJH under high traffic load to the host CPU, in very specific and low probability conditions, an error could occur, that may result in loss of data, channel failure or in extreme cases SW failure

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-04-07 18:06:46 -07:00
roman_savchuk
840f19af18
[BFN] Updated SAI/SDK packages to 20210405 (#7229)
Updated SDE due to issue in driver part
2021-04-07 14:18:49 -07:00
rkdevi27
47011a8e2c
[201911][DellEMC] Fix abrupt reboot in S6000 (#6909)
The S6000 devices, the cold reboot is abrupt and it is likely to cause issues which will cause the device to land into EFI shell. Hence the platform reboot will happen after graceful unmount of all the filesystems as in S6100.
2021-03-30 16:14:45 -07:00
Joe LeVeque
72b32a96fc
[201911][dockers][supervisor] Increase event buffer size for process exit listener (#7106)
Backport of https://github.com/Azure/sonic-buildimage/pull/7083 to the 201911 branch.

#### Why I did it

To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:

```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```

This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.

This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).

I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-29 10:07:43 -07:00
Stephen Sun
746a64e483
[mellanox]: Integrate hw-mgmt V.7.0010.1002 (#7149)
Bug fixes

-Removing critical thermal zones to prevent unexpected software system shutdown:
   Kernel 4.9 -0071-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch
   Kernel 4.19 -076-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch

- hw-mgmt: thermal: Add hardcoded critical trip point

- Removing redundant link for cpld3 for fixed systems (SN2100, SN2010).

- Fix an issue with a missed attribute for cpld3 (port CPLD) for SN2700, SN2410.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-03-28 19:23:02 -07:00
roman_savchuk
e8965e3584
[BFN] Updated SAI/SDK packages to 20210317 (#7082)
Fix for vlan-id ACL filter introduced in SONiC 201911 #234

Signed-off-by: Roman Savchuk <romanx.savchuk@intel.com>
2021-03-24 20:03:03 -07:00
Volodymyr Samotiy
88de361f96 [Mellanox] Update FW to xx.2008.2424 (#7118)
Fixed issues:
* Mellanox SN-2700 breakout port not linking up with QSA

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-03-22 19:35:32 -07:00
Kebo Liu
f2cd1ee2db
update SDK/FW and SAI to new version (#7040)
- Why I did it
To pick up new features and fix from SDK/FW and SAI

SDK/FW new Feature:

All | Added support for multiple modules and cable types. For full list contact Nvidia networking support
Spectrum-3 | SN46000C | Added support for up to 5W on ports 49 to 64 .
SDK/FW bugs' fix:

All | fast reboot | fast boot failure from latest 201811 to 201911 and above
Spectrum | 10GbE/1GbE Transceiver (FTLX8574D3BCV) stopped working after firmware upgrade
Spectrum-2 | When device is rebooted with locked Optical Transceivers in split mode, the firmware may get stuck
Spectrum-2 | SN3700 | When connecting at 200GbE to Ixia K400, Ixia receives CRC errors
Spectrum-2 | SN3800 | On rare occasions packets loss may be experienced due to signal integrity issues
Spectrum-2 | When the port is a member of a LAG, after a warmboot and port toggle on the peer-side, the port remains down
Spectrum-3 | SN4700 | While using Optic cable in Split 4x1 mode in PAM4, when two first ports are toggled, the other 2 ports go down
Spectrum-3 | SN4700 | When working in 400GbE, deleting the headroom configuration (changing buffer size to zero) on the fly may cause continual packet drops
SAI

All | Counters | Update tunnel decap counter to capture VNI miss
- How I did it
Update the related version number in the make files and update the submodule pointer accordingly.

- How to verify it
Run regression test and everything works good.
2021-03-14 08:36:03 +02:00