- Why I did it
There are 3 tasks in xcvrd:
main task, run a loop to recover missing SFP static information to DB every 1 minute
SFP state task, a process which listens cable plug in/out event, insert SFP static information to DB while a cable is inserted
SFP DOM update task, a thread which handles cable DOM information update every 1 minute
Let assume user replaces QSFP with QSFP-DD. There are two issues:
Only SFP state task listens cable plug in/out event, main task and SFP DOM update task does not know SFP type has changed, they still “think” the SFP type is QSFP. So, main task and SFP DOM update task uses QSFP standard to parse QSFP-DD EEPROM which causes corrupted data.
There is a race condition between main task and SFP state task. They both insert SFP static information to DB. Depends on timing, it is possible that main task using wrong SFP type to override SFP static information.
The PR is to fix these two issues.
There is no such issue on 202205 and above because there is a refactor for xcvrd:
SFP state task was changed from process to thread, so that all 3 tasks share the same memory space, they always have correct SFP type.
Recover missing SFP information logical was moved from main task to SFP state task. There is no race condition anymore.
- How I did it
It is difficult to back port latest xcvrd because there are many refactor/new features in xcvrd after 202012 release. It will be huge effort to do so. Based on that, we decided to fix the issue on Nvidia platform API side. The fix is that: refreshing SFP type before any SFP API which accessing SFP EEPROM. Refreshing SFP type before any SFP API would cause a small performance down: Due to my test on 202012 branch, accessing transceiver INFO and DOM INFO for 32 ports takes 1.7 seconds before the change. The number changes to 2.4 seconds after the change. I suppose the performance down is acceptable.
- How to verify it
Manual test
Regression
#### Why I did it
1.57.x SDK based incremental drop that addresses:
Fix for MIGSMSFT-158
Support for VxLAN and BFD Serviceability CLI
sfputil reset platform fix to handle 100G optics
Added thermal management feature for ZR optics sensors
#### How I did it
Update cisco-8000 submodule to v0.2.5
- Why I did it
Added BIOS upgrade infra
- How I did it
Added new make target
- How to verify it
Copy msn3800_bios.tar.gz to platform/mellanox/bios
make configure PLATFORM=mellanox
make target/files/buster/msn3800_bios.tar.gz
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
1.57.x SDK based incremental drop that addresses:
1) orchagent crash
2) Port LED issue
3) Tunnel endpoint stats
4) test_warm_reboot issue
5) nhop test failure
6) "show platform versions' CLI
- Why I did it
Backport #9795
Python select.select accept a optional timeout value in seconds, however, the value passes to it is a value in millisecond.
- How I did it
Transfer the value to millisecond.
- How to verify it
Manual test
#### Why I did it
Backport https://github.com/sonic-net/sonic-buildimage/pull/13246 to 202012 branch.
In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.
#### How I did it
Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.
#### How to verify it
a. Manual test:
> 1. Perform a power loss
> 2. Perform a warm/fast reboot
> 3. check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"
b. Run reboot cause related regression test.
Partial cherry-pick of: [Mellanox] Modified Platform API to support all firmware updates in single boot #9608
- Why I did it
To allow user manual reboot control over ONiE FW upgrade
- How I did it
Added a dedicated script argument handling
- How to verify it
mlnx-onie-fw-update.sh update --no-reboot
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
Why I did it
smartctl tool is available only in PMON docker. Hence, the tool may be not accessible incase PMON docker goes down.
Using iSMART_64 tool to fetch the SSD firmware version and device model information.
How I did it
Replacing smartctl with iSMART_64.
Why I did it
why
In order to apply different config across different platform, and use the code with a unified format, reuse syncd init script to init saiserver.
How I did it
how
Reuse syncd init script
How to verify it
Test
Test in DUT s6000 and dx010 with sonic 202205
Why I did it
1.50.x SDK based drop to fix MIGSMSFT-120 ([8102] Orchagent crash as addRoutePost failed at SAI")
How I did it
Update cisco-8000 submodule to v0.121
- Why I did it
Update SDK/FW version - 4.5.3196/2010_3196 in order to have the following fixes:
1. ON SPC2/3 in some cases, after many ACL region resize will corrupt internal DB that in return will fail future ACLs configuration
2.. Lag Port as Analyzer Port | when removing port from distributer list SDK does not reselect another port for mirroring
3. Due to critical race at initial configuration, SDK RDQ test may test RDQ configured for WJH and fail the test
Add support for new HW SKU of SN4700
- How I did it
Update pointer for the SDK/FW
- How to verify it
Run regression tests
Why I did it
enable sai-ptf logger in sai_adapter to log all the sai api invcations
How I did it
add build parameter to enable the sai-ptf logger when build sai PRC
How to verify it
local build test
test the generated sai_adapter
test with pipeline
Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
In order to make the sai update easier, change the URL pattern to a more unified format, which can be update automated latter.
Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
Why I did it
1.57.x SDK based incremental drop that addresses a few egress ACL and drop counter failures. Hostname, vtysh, and incorrect queue watermark issue are addressed too.
How I did it
Update cisco-8000 submodule to v0.2.3
How to verify it
Which release branch to backport (provide reason below if selected)
- Why I did it
ethtool is not able to read certain pages(eg. page 11h) of CMIS cables.
SDK provides a set of sysfs to expose the transceiver EEPROM, now we migrate from using ethtool to read these sysfs for transceiver EEPROM reading.
- How I did it
replace ethtool with accessing the SDK sysfs for cable EEPROM reading.
Adjust the offset according to the SDK sysfs memory map.
- How to verify it
run sonic-mgmt sfp-related regression test case.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
Update SDK/FW version - 4.5.3186/2010_3186 in order to have the following changes:
New functionality:
1. Added support for 6.5W (Class 8) in ports 49-50, 53-54, 57-58, and 61-62 on SN4600 system
Fix the following issues:
1. On very rare occasion (~1/100K), during I2C transaction with MMS1V50-WM and MMS1V90-WR modules on SN4700 system, the module may send unexpected stop which violate the I2C specification, possibly affecting the link up flow
2. When running 1GbE speeds on SN4600 system, the port remained active while peer side was closed
3. While toggling the cable with ‘sfputil lpmode on/off’, error msg like “ERR pmon#xcvrd: Receive PMPE error event on module 1: status {X} error type {y}” could be received
4. When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted
5. When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU
6. While moving from lossless to lossy mode while shared headroom was used, reduction of the shared headroom can only be done prior to pool type change and when shared headroom is not utilized
7. SLL configuration is missing in SDK dump
8. If TTL_CMD_COPY is used in Encap direction for a packet with no TTL, then the value passed in the ttl data structure will be used if non-zero (default 255 if zero)
9. PCI calibration changes from a static to a dynamic mechanism
10. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event
11. SDK returned error when FEC mode is set on twisted pair, when FEC was set to None
- How I did it
Update pointer for the SDK/FW
- How to verify it
Run regression tests
Signed-off-by: dprital <drorp@nvidia.com>
Why I did it
Change the path of sonic submodules that point to "Azure" to point to "sonic-net"
How I did it
Replace "Azure" with "sonic-net" on all relevant paths of sonic submodules
Pick up fix for CS00012263713 (mirrored packet with extra VLAN Tag) BRCM SAI 4.3.7.1-1
Preliminary tests look fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the following test cases on 7050CX3 (TD3) T0 DUT and all passed:
fib/test_fib.py
acl/test_acl.py
arp/test_neighbor_mac_noptf.py
fdb/test_fdb.py
decap/test_decap.py
pc/test_lag_2.py
pc/test_po_cleanup.py
pc/test_po_update.py
everflow/test_everflow_ipv6.py
everflow/test_everflow_testbed.py
route/test_default_route.py
ipfwd/test_dip_sip.py
copp/test_copp.py
crm/test_crm.py
Update SDK/FW version - 4.5.2320/2010_2320 in order to have the following fixes:
• Spectrum-3 | PCI calibration changes from a static to a dynamic mechanism.
• [VxLAN] TTL was set to 0 for non IP traffic (such as ARP)
Pick upfollowing fixes and update BRCM SAI to 4.3.7.0:
CS00012208537: Add back previous commit 54c5bc4848eb748
CS00012253061,SONIC-63280: WB from 3.5 to 4.3, followed by WB to 4.3
CS00012207978: SDK-296517, time spent for SAI operations
CS00012245601,SONIC-62898: Egress ACL Counted ad Interface TX drops
Update pcbb with Fixes for CS00012243699
Upgrade on pcbb with Fixes for KB0025353, CS00012221689, CS00012221688, KB0025391, CS00012230519
commit of "CS00012221688:PFC frames egressing, PFC storm happens simultaneously on 2 ports" is purposely skipped to be picked up later due to SWSS dependency not ready.
Why I did it
How I did it
How to verify it
Tested build target, successful
Manually run these tests after installing sai binary within image 20201231.73 on 7050CX3 (TD3) T0 DUT, all passed.
vxlan/test_vxlan_decap.py
fdb/test_fdb.py
pfcwd/test_pfcwd_all_port_storm.py
acl/null_route/test_null_route_helper.py
acl/test_acl.py
vlan/test_vlan.py
platform_tests/test_reboot.py
Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>
- Why I did it
Update SAI version - 1.22.0.0
Update SDK/FW version - 4.5.2318/2010_2318
SAI Changes:
1. Port FEC fix for multiple speeds
2. Next hop group optimized bulk API
3. Support BFD remote-disc exchange in negotiation stage
4. Reduce verbosity of shared database already exists print
SDK/FW Fixes:
1. Cr space timeout on Hold and Release GW - at warmboot
2. SPC-1 Port in stuck PHY_UP after peer side rebooted
3. memory leak in sx_api_router_ecmp_update_set
- How I did it
Update pointer for the new SAI and SDK/FW
- How to verify it
Run regression tests
- Why I did it
New changes in this new HW-MGMT package:
1. hw-mgmt: chassis events: Fix voltmon address conflict on connecting
2. hw-mgmt: topology: Add COMEX BRDWL respin support
a. Removed A2D sensor from all COMEX BRDWL boards
b. Add COMEX BRDWL boards with register defined (config3)
- How I did it
Advance the hw-mgmt repo pointer and update the hw-mgmt version number
- How to verify it
Run platform-related regression test cases on the new testbed.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
This is for the eventual support of multiple architectures for the mellanox platform.
- How I did it
Change the location of the binaries in Switch-SDK-drivers so that the path specifies the target architecture in addition to the target distribution that the debians are built for.
This is the most straightforward way to separate binaries built against different architectures and selectively target them for installation in the mellanox SONiC image.
- How to verify it
Build SONiC for mellanox and verify it compiles successfully.
Why I did it
- To reduce rc.local script execution time.
- Time consumption of rc.local script is around 22 seconds in S6100.
How I did it
- Moving platform-modules-s6100.service and s6100-lpc-monitor.service asynchronous to rc.local script.
How to verify it
- Load the image with the changes and the time consumption of rc.local script reduced from 22 seconds(approx.) to 14 seconds(approx.) during warm-/fast-reboot upgrades.
- sonic-mgmt test results.
This fixes the build for armhf to be able to use '/device///installer.conf' files. Specifically, armhf needs support to be able to change the size of /var/log/ directory. It is hardcoded to 512 bytes on all armhf platforms currently. This change will allow any armhf platform to be able to use an installer.conf file to customize the installed image.
Important fixes since 202012-v0.97:
V0.102:
Hwsku changes to Cisco-8102-C64
Fix for watermark clear issue
V0.101:
Fix for dhcp_relay test issue
V0.100:
Fix for container_autorestart test issue
V0.99:
Fix for everflow test issue
Fix for pfcwd test issue
Fix for copp test issue
V0.98:
Fix for qos_sai test issue
RDMA enhancements dev complete and content included in this drop (flow based VoQ, ECN, Alpha)
Signed-off-by: Kevin Wang <shengkaiwang@microsoft.com>