Commit Graph

1518 Commits

Author SHA1 Message Date
Junchao-Mellanox
7543993af3
[202012] [Mellanox] Fix issue: SFP eeprom corrupted after replacing cable with different sfp type (#13543)
- Why I did it
There are 3 tasks in xcvrd:

main task, run a loop to recover missing SFP static information to DB every 1 minute
SFP state task, a process which listens cable plug in/out event, insert SFP static information to DB while a cable is inserted
SFP DOM update task, a thread which handles cable DOM information update every 1 minute
Let assume user replaces QSFP with QSFP-DD. There are two issues:

Only SFP state task listens cable plug in/out event, main task and SFP DOM update task does not know SFP type has changed, they still “think” the SFP type is QSFP. So, main task and SFP DOM update task uses QSFP standard to parse QSFP-DD EEPROM which causes corrupted data.
There is a race condition between main task and SFP state task. They both insert SFP static information to DB. Depends on timing, it is possible that main task using wrong SFP type to override SFP static information.
The PR is to fix these two issues.

There is no such issue on 202205 and above because there is a refactor for xcvrd:

SFP state task was changed from process to thread, so that all 3 tasks share the same memory space, they always have correct SFP type.
Recover missing SFP information logical was moved from main task to SFP state task. There is no race condition anymore.

- How I did it
It is difficult to back port latest xcvrd because there are many refactor/new features in xcvrd after 202012 release. It will be huge effort to do so. Based on that, we decided to fix the issue on Nvidia platform API side. The fix is that: refreshing SFP type before any SFP API which accessing SFP EEPROM. Refreshing SFP type before any SFP API would cause a small performance down: Due to my test on 202012 branch, accessing transceiver INFO and DOM INFO for 32 ports takes 1.7 seconds before the change. The number changes to 2.4 seconds after the change. I suppose the performance down is acceptable.

- How to verify it
Manual test
Regression
2023-02-19 09:47:32 +02:00
jhli-cisco
592ce16d05
Update cisco-8000.ini (#13793)
#### Why I did it
1.57.x SDK based incremental drop that addresses:
Fix for MIGSMSFT-158
Support for VxLAN and BFD Serviceability CLI
sfputil reset platform fix to handle 100G optics
Added thermal management feature for ZR optics sensors

#### How I did it
Update cisco-8000 submodule to v0.2.5
2023-02-14 11:05:09 -08:00
Nazarii Hnydyn
83b6518ae2
[202012][mellanox]: Add BIOS upgrade infra (#13571)
- Why I did it
Added BIOS upgrade infra

- How I did it
Added new make target

- How to verify it
Copy msn3800_bios.tar.gz to platform/mellanox/bios
make configure PLATFORM=mellanox
make target/files/buster/msn3800_bios.tar.gz

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2023-02-02 10:07:03 +02:00
Richard.Yu
025e77bb5d
[202012] Update SAI version to 4.3.7.1-7 (#13431)
CS00012254651 (SONIC-66820) Fix missing break stmt

Verify

run case test_forward_ip_packet_with_0xffff_chksum_tolerant

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
2023-01-19 11:30:40 -08:00
jhli-cisco
2357eeef46
[cisco]: Update cisco-8000.ini (#13425)
1.57.x SDK based incremental drop that addresses: 
1) orchagent crash
2) Port LED issue
3) Tunnel endpoint stats
4) test_warm_reboot issue
5) nhop test failure
6) "show platform versions' CLI
2023-01-19 09:13:51 -08:00
Junchao-Mellanox
46a774e294
[202012] [Mellanox] Fix select timeout in sfp event (#13347)
- Why I did it
Backport #9795
Python select.select accept a optional timeout value in seconds, however, the value passes to it is a value in millisecond.

- How I did it
Transfer the value to millisecond.

- How to verify it
Manual test
2023-01-19 17:29:41 +02:00
Kebo Liu
a569bfc9eb
skip hw reboot cause if warm/fast reboot found from the proc cmdline (#13378)
#### Why I did it
Backport https://github.com/sonic-net/sonic-buildimage/pull/13246 to 202012 branch.

In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.

#### How I did it

Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.

#### How to verify it

a. Manual test:
> 1. Perform a power loss
> 2. Perform a warm/fast reboot
> 3. check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"

b. Run reboot cause related regression test.
2023-01-17 13:21:31 -08:00
Nazarii Hnydyn
5193a96895
[202012][Mellanox]: Update ONiE FW tool: manual reboot control. (#13359)
Partial cherry-pick of: [Mellanox] Modified Platform API to support all firmware updates in single boot #9608

- Why I did it
To allow user manual reboot control over ONiE FW upgrade

- How I did it
Added a dedicated script argument handling

- How to verify it
mlnx-onie-fw-update.sh update --no-reboot

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2023-01-16 15:27:48 +02:00
Marty Y. Lok
27d798a2b7 [armhf][sonic-installer] Fix issue of the sonic-installer install a image after sonic-installer clean (#12609)
Signed-off-by: mlok <marty.lok@nokia.com>

Signed-off-by: mlok <marty.lok@nokia.com>
2023-01-12 23:30:07 +00:00
Santhosh Kumar T
2081e6f45d [DellEMC] Master: S6100: SSD upgrade status: Moving from smartctl to iSMART (#12784)
Why I did it
smartctl tool is available only in PMON docker. Hence, the tool may be not accessible incase PMON docker goes down.
Using iSMART_64 tool to fetch the SSD firmware version and device model information.

How I did it
Replacing smartctl with iSMART_64.
2023-01-12 23:30:02 +00:00
Richard.Yu
33bf592f09
[Cherry-pick][SAIServer]Upgrade SAI server init script (#13175) (#13226)
Why I did it
why
In order to apply different config across different platform, and use the code with a unified format, reuse syncd init script to init saiserver.

How I did it
how
Reuse syncd init script

How to verify it
Test
Test in DUT s6000 and dx010 with sonic 202205
2023-01-03 13:22:32 +08:00
jhli-cisco
26709ffb86
[cisco]: Update cisco-8000.ini (#13182)
Why I did it
1.50.x SDK based drop to fix MIGSMSFT-120 ([8102] Orchagent crash as addRoutePost failed at SAI")

How I did it
Update cisco-8000 submodule to v0.121
2023-01-01 11:58:59 -08:00
Richard.Yu
830102a353
[202012][Submodule][SAI-Redis]Advance SAI Redis head pointer (#13157)
Why I did it
[202012][Submodule][SAI-Redis]Advance SAI Redis head pointer

How I did it
include changes:

sonic-net/sonic-sairedis@dcea4cd
sonic-net/sonic-sairedis@5e9bcb1
sonic-net/sonic-sairedis@8f2a53f
sonic-net/sonic-sairedis@c1d7938 [202012][Submodule][SAI]Advance SAI head pointer sonic-sairedis#1184
remove parameter skip-error, which removed from [202205][Submodule][SAI]Advance SAI head pointer sonic-sairedis#1185
How to verify it
local build
2022-12-27 08:10:42 +08:00
Dror Prital
e8c7a7c61e
[202012][Mellanox] Update SDK/FW to version 4.5.3196/2010_3196 (#12989)
- Why I did it
Update SDK/FW version - 4.5.3196/2010_3196 in order to have the following fixes:

1. ON SPC2/3 in some cases, after many ACL region resize will corrupt internal DB that in return will fail future ACLs configuration
2.. Lag Port as Analyzer Port | when removing port from distributer list SDK does not reselect another port for mirroring
3. Due to critical race at initial configuration, SDK RDQ test may test RDQ configured for WJH and fail the test

Add support for new HW SKU of SN4700

- How I did it
Update pointer for the SDK/FW

- How to verify it
Run regression tests
2022-12-08 12:16:54 +02:00
Richard.Yu
e15acb59ff
enable sai-ptf logger in sai_adapter to log all the sai api invcations (#12922)
Why I did it
enable sai-ptf logger in sai_adapter to log all the sai api invcations

How I did it
add build parameter to enable the sai-ptf logger when build sai PRC

How to verify it
local build test
test the generated sai_adapter
test with pipeline
Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
2022-12-04 22:10:28 -08:00
Richard.Yu
acd24d9804
[submodule]Update SAI SDK URL from package storage to public (#12835)
In order to make the sai update easier, change the URL pattern to a more unified format, which can be update automated latter.

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
2022-12-03 22:37:18 -08:00
jhli-cisco
0569771757
Update cisco-8000.ini (#12808)
Why I did it
1.57.x SDK based incremental drop that addresses a few egress ACL and drop counter failures. Hostname, vtysh, and incorrect queue watermark issue are addressed too.

How I did it
Update cisco-8000 submodule to v0.2.3

How to verify it
Which release branch to backport (provide reason below if selected)
2022-11-23 15:26:40 +08:00
Richard.Yu
5d7a345c09
[SAI-PTF][202012]Fix sai ptf 202012 (#12724)
* fix sai-ptf docker build error

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

* correct the docker image version

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

* update thrift package

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

* fix version upgrade issue in 202012

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

* remove useless file

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
2022-11-16 20:32:24 -08:00
zitingguo-ms
c10aa3b826
Add a parameter for libsaithrift to skip error on errno -2 (#12581) (#12617)
Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>

Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>

Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>
2022-11-07 00:07:37 -08:00
Kebo Liu
db03698ba5
fix DOM support caoability issues on QSFP and CMIS cables (#12500)
Signed-off-by: Kebo Liu <kebol@nvidia.com>

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-10-30 23:20:57 -07:00
Kebo Liu
78043b828c
[202012] [Mellanox] Read transceiver EEPROM via sdk sysfs (#12399)
- Why I did it
ethtool is not able to read certain pages(eg. page 11h) of CMIS cables.
SDK provides a set of sysfs to expose the transceiver EEPROM, now we migrate from using ethtool to read these sysfs for transceiver EEPROM reading.

- How I did it
replace ethtool with accessing the SDK sysfs for cable EEPROM reading.
Adjust the offset according to the SDK sysfs memory map.

- How to verify it
run sonic-mgmt sfp-related regression test case.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-10-30 09:34:39 +02:00
Dror Prital
5de7ae449a
Update SDK/FW to version 4.5.3186/2010_3186 (#12531)
- Why I did it
Update SDK/FW version - 4.5.3186/2010_3186 in order to have the following changes:

New functionality:
1. Added support for 6.5W (Class 8) in ports 49-50, 53-54, 57-58, and 61-62 on SN4600 system

Fix the following issues:
1. On very rare occasion (~1/100K), during I2C transaction with MMS1V50-WM and MMS1V90-WR modules on SN4700 system, the module may send unexpected stop which violate the I2C specification, possibly affecting the link up flow
2. When running 1GbE speeds on SN4600 system, the port remained active while peer side was closed
3. While toggling the cable with ‘sfputil lpmode on/off’, error msg like “ERR pmon#xcvrd: Receive PMPE error event on module 1: status {X} error type {y}” could be received
4. When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted
5. When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU
6. While moving from lossless to lossy mode while shared headroom was used, reduction of the shared headroom can only be done prior to pool type change and when shared headroom is not utilized
7. SLL configuration is missing in SDK dump
8. If TTL_CMD_COPY is used in Encap direction for a packet with no TTL, then the value passed in the ttl data structure will be used if non-zero (default 255 if zero)
9. PCI calibration changes from a static to a dynamic mechanism
10. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event
11. SDK returned error when FEC mode is set on twisted pair, when FEC was set to None

- How I did it
Update pointer for the SDK/FW

- How to verify it
Run regression tests

Signed-off-by: dprital <drorp@nvidia.com>
2022-10-30 09:29:45 +02:00
Marty Y. Lok
80870439af [armhf][sonic-installer] Fix the sonic-installer install images on armhf platform issue (#12284)
Signed-off-by: mlok <marty.lok@nokia.com>

Signed-off-by: mlok <marty.lok@nokia.com>
2022-10-26 05:47:51 +00:00
zitingguo-ms
bafbfb5a26
Pickup fix and make up BRCM SAI version to 4.3.7.1-6 (#12486)
Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>

Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>
2022-10-26 09:52:48 +08:00
jhli-cisco
23c274a225
Update cisco-8000 submodule to v0.120 (#12470) 2022-10-25 18:09:16 +08:00
zitingguo-ms
08d1d60ccb
Pick up fixes and make up BRCM SAI version to 4.3.7.1-3 (#12439)
Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>

Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>
2022-10-19 12:18:48 +08:00
xumia
2955a8dc72
[202012] Change submodule path from Azure to sonic-net (#12312)
Why I did it
Change the path of sonic submodules that point to "Azure" to point to "sonic-net"

How I did it
Replace "Azure" with "sonic-net" on all relevant paths of sonic submodules
2022-10-13 23:30:37 +08:00
gechiang
1e6d63a412
[202012][BRCMSAI] 4.3.7.1-2 to back out a change that broke 4.3.7.1-1 (#12298)
This is basically the same as previous PR: (#12275)
With backing out a change that was breaking the build. Copying the same info from that PR here.
2022-10-06 21:25:34 -07:00
gechiang
9c9d902ede
[202012]BRCM SAI 4.3.7.1-1 pick up fix CS00012263713 (mirrored packet with extra VLAN Tag) (#12275)
Pick up fix for CS00012263713 (mirrored packet with extra VLAN Tag) BRCM SAI 4.3.7.1-1

Preliminary tests look fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the following test cases on 7050CX3 (TD3) T0 DUT and all passed:

     fib/test_fib.py
     acl/test_acl.py
     arp/test_neighbor_mac_noptf.py
     fdb/test_fdb.py
     decap/test_decap.py
     pc/test_lag_2.py
     pc/test_po_cleanup.py
     pc/test_po_update.py
     everflow/test_everflow_ipv6.py
     everflow/test_everflow_testbed.py
     route/test_default_route.py
     ipfwd/test_dip_sip.py
     copp/test_copp.py
     crm/test_crm.py
2022-10-05 09:40:55 -07:00
Xichen96
a16843a67c Enable swap for haliburton device. (#11746)
Signed-off-by: Xichen Lin <lukelin0907@gmail.com>

Signed-off-by: Xichen Lin <lukelin0907@gmail.com>
2022-10-03 22:31:00 +00:00
jhli-cisco
109d0e9d3a
Update cisco-8000 submodule to v0.117 (#12211) 2022-09-28 14:56:22 -07:00
zitingguo-ms
95b19bbb46
Pick up fixes and make up BRCM SAI version to 4.3.7.1 (#12069)
Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>

Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>
2022-09-22 22:59:51 -07:00
jhli-cisco
daff6dbaaa
Update cisco-8000 submodule to v0.116 (#12135) 2022-09-22 13:38:13 +08:00
jhli-cisco
d8c405bf89
Update cisco-8000 submodule to v0.112 (#11983)
Update cisco-8000 submodule to v0.112
2022-09-08 09:11:29 +08:00
Aravind Mani
37d34ddd34
DellEMC Z9332f: Fix SFP issue (#11819)
* Update sfp

* Update sfp

* Update sfp.py
2022-09-07 09:35:59 -07:00
Dror Prital
edc4485d30
[202012][Mellanox] Update SDK/FW to version 4.5.2320/2010_2320 (#11975)
Update SDK/FW version - 4.5.2320/2010_2320 in order to have the following fixes:
• Spectrum-3 | PCI calibration changes from a static to a dynamic mechanism.
• [VxLAN] TTL was set to 0 for non IP traffic (such as ARP)
2022-09-07 08:33:18 +03:00
Arun Saravanan Balachandran
c1712b8c9a
[202012] DellEMC: S6000, S6100, Z9332f - Add capabilities fields in platform.json (#11772) 2022-08-31 09:06:47 -07:00
jhli-cisco
62c6fb2eab
Update cisco-8000 submodule to v0.111 (#11835)
Update cisco-8000 submodule to v0.111 drop
2022-08-26 08:14:54 +08:00
zitingguo-ms
5b5bd5e818
[202012 BRCM SAI 4.3.7.0] Pick up fixes and make up BRCM SAI version to 4.3.7.0 (#11681)
Pick upfollowing fixes and update BRCM SAI to 4.3.7.0:

CS00012208537: Add back previous commit 54c5bc4848eb748
CS00012253061,SONIC-63280: WB from 3.5 to 4.3, followed by WB to 4.3
CS00012207978: SDK-296517, time spent for SAI operations
CS00012245601,SONIC-62898: Egress ACL Counted ad Interface TX drops
Update pcbb with Fixes for CS00012243699
Upgrade on pcbb with Fixes for KB0025353, CS00012221689, CS00012221688, KB0025391, CS00012230519
commit of "CS00012221688:PFC frames egressing, PFC storm happens simultaneously on 2 ports" is purposely skipped to be picked up later due to SWSS dependency not ready.
Why I did it
How I did it
How to verify it
Tested build target, successful

Manually run these tests after installing sai binary within image 20201231.73 on 7050CX3 (TD3) T0 DUT, all passed.

vxlan/test_vxlan_decap.py
fdb/test_fdb.py
pfcwd/test_pfcwd_all_port_storm.py
acl/null_route/test_null_route_helper.py
acl/test_acl.py
vlan/test_vlan.py
platform_tests/test_reboot.py


Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>
2022-08-10 15:02:47 -07:00
Dror Prital
db37325f76
[202012][Mellanox] Update SAI version to 1.22.0.0 and SDK/FW to version 4.5.2318/2010_2318 (#11534)
- Why I did it
Update SAI version - 1.22.0.0
Update SDK/FW version - 4.5.2318/2010_2318

SAI Changes:
1. Port FEC fix for multiple speeds
2. Next hop group optimized bulk API
3. Support BFD remote-disc exchange in negotiation stage
4. Reduce verbosity of shared database already exists print

SDK/FW Fixes:
1. Cr space timeout on Hold and Release GW - at warmboot
2. SPC-1 Port in stuck PHY_UP after peer side rebooted
3. memory leak in sx_api_router_ecmp_update_set

- How I did it
Update pointer for the new SAI and SDK/FW

- How to verify it
Run regression tests
2022-07-26 21:01:36 +03:00
jhli-cisco
66d49231cf
Update cisco-8000.ini (#11522)
update cisco-8000 platform version to 202012-v0.107
2022-07-24 11:43:07 +08:00
VenkatCisco
e2042e2ad6
update cisco-8000 platform version to v106 (#11504) 2022-07-21 08:31:50 -07:00
Kebo Liu
c60bf90590
[202012] [Mellanox] Update hw-mgmt package to V.7.0010.2349 (#11421)
- Why I did it
New changes in this new HW-MGMT package:

1. hw-mgmt: chassis events: Fix voltmon address conflict on connecting
2. hw-mgmt: topology: Add COMEX BRDWL respin support
  a. Removed A2D sensor from all COMEX BRDWL boards
  b. Add COMEX BRDWL boards with register defined (config3)

- How I did it
Advance the hw-mgmt repo pointer and update the hw-mgmt version number

- How to verify it
Run platform-related regression test cases on the new testbed.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-07-20 09:00:17 +03:00
Zhijian Li
24b90d7556
[cherry-pick][202012] Fix issue where HLX module failed to do postinit (#11351)
* [HLX] Fix issue where HLX module failed to do postinit (#7274)

Signed-off-by: Jing Kan jika@microsoft.com
2022-07-06 17:27:29 +08:00
Alexander Allen
851bd9bff8 [Mellanox] Add arch folder to SDK binary location (#11278)
- Why I did it
This is for the eventual support of multiple architectures for the mellanox platform.

- How I did it
Change the location of the binaries in Switch-SDK-drivers so that the path specifies the target architecture in addition to the target distribution that the debians are built for.

This is the most straightforward way to separate binaries built against different architectures and selectively target them for installation in the mellanox SONiC image.

- How to verify it
Build SONiC for mellanox and verify it compiles successfully.
2022-07-05 20:58:01 +00:00
Santhosh Kumar T
7a7c363548
[DellEMC] S6100 Platform Service optimization (#10989)
Why I did it
- To reduce rc.local script execution time.
- Time consumption of rc.local script is around 22 seconds in S6100.
How I did it
- Moving platform-modules-s6100.service and s6100-lpc-monitor.service asynchronous to rc.local script.
How to verify it
- Load the image with the changes and the time consumption of rc.local script reduced from 22 seconds(approx.) to 14 seconds(approx.) during warm-/fast-reboot upgrades.
- sonic-mgmt test results.
2022-06-23 12:58:11 -07:00
Nazarii Hnydyn
05ff95fdfc
[Mellanox]: Advance SAI submodule. (#11164)
[Mellanox]: Advance SAI submodule. (#11164)
Fix #3074227 - don't disable used tunnel underlay interfaces
fix bfd - notify Sonic for admin-down event
2022-06-16 18:09:59 -07:00
Jon Goldberg
efdb507795 [installer]: fix armhf for installer.conf usage (#11121)
This fixes the build for armhf to be able to use '/device///installer.conf' files. Specifically, armhf needs support to be able to change the size of /var/log/ directory. It is hardcoded to 512 bytes on all armhf platforms currently. This change will allow any armhf platform to be able to use an installer.conf file to customize the installed image.
2022-06-14 09:02:01 -07:00
Eric Zhu
27cd735082 [SONiC-CEL]: fix platform fancontrol testcase failure issue (#10934) 2022-06-08 01:21:53 +00:00
Kevin Wang
a442391c7d
Update cisco-8000 ref to release: 202012-v0.97 (#11038)
Important fixes since 202012-v0.97:
V0.102:
Hwsku changes to Cisco-8102-C64
Fix for watermark clear issue
V0.101:
Fix for dhcp_relay test issue
V0.100:
Fix for container_autorestart test issue
V0.99:
Fix for everflow test issue
Fix for pfcwd test issue
Fix for copp test issue
V0.98:
Fix for qos_sai test issue
RDMA enhancements dev complete and content included in this drop (flow based VoQ, ECN, Alpha)

Signed-off-by: Kevin Wang <shengkaiwang@microsoft.com>
2022-06-07 08:26:46 +08:00