- Why I did it
In SONiC thermal control algorithm, it compares thermal zone temperature with thermal zone threshold. Previously, a thermal zone with no thermal sensor can still get its threshold. However, a recently driver patch changes this behavior: a thermal zone with no thermal sensor will return 0 for threshold. We need to ignore such thermal zone.
- How I did it
Ignore thermal zones whose temperature is 0.
- How to verify it
Added unit test case and Manual test
- Why I did it
Update MFT to version 4.18.1-16 for bugs fixes and new SN2201 support
- How I did it
Advance to MFT tool version to 4.18.1-16
- How to verify it
Manually tested on all Mellanox platforms (ASIC FW Upgrade, link debug tools, CPLD upgrade, etc.)
ded0344 Return both 'vendor_rev' and 'hardware_rev' keys from get_transceiver_info to support both earlier and later versions of xcvrd
b3442cc Change log_error to log_info when transceiver module is transitioned
3d3a73c Fix problem introduced with new SFP caching paradigm
2915746 Add script to reboot all IMMs
17ad221 Fix the position_in_parent for the psu entity
207e731 No longer force read pages at SFP init time
b387921 Fixed the voltage and power with rounding to 2 digit decimals
Signed-off-by: mlok <marty.lok@nokia.com>
- Why I did it
New version of mellanox platform management code available adding support for new platforms and fixing bugs.
- How I did it
1. Updated the submodule
2. Updated makefile version references
3. Regenerated SONiC patches
- Why I did it
Fix issue: 'sx_port_mapping_t' object has no attribute 'slot_id'. sx_port_mapping_t only has attribute slot.
- How I did it
Change slot_id to slot.
- How to verify it
Manual test
- Why I did it
Python select.select accept a optional timeout value in seconds, however, the value passes to it is a value in millisecond.
- How I did it
Transfer the value to millisecond.
- How to verify it
Manual test
- Why I did it
To include latest SDK fixes:
1. On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting SN4600C, 100GbE port with CWDM4 module (Gen 3.0), link up time is 30 seconds.
and to include SAI fixes \ changes:
1. Reduce verbosity for resource check vendor data not found
2. Fix metadata validation, check default value on conditions check
3. Add 100MB, 10MB to 2201 system
4. L3 VXLAN overlay ECMP
5. VXLAN srcport API implementation
6. Fix scheduler profile null (default values) when set on sub group scheduler group
7. Fix ACL binding restoration when port leaves a LAG
8. Fix route logic for set next hop/action and reference counter for ECMP overlay
- How I did it
1. Updated SDK/FW submodule and relevant makefiles with the required versions.
2. Update SAI submodule and relevant makefile with the required version.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Why I did it
Requirements from Microsoft for fwutil update all state that all firmwares which support this upgrade flow must support upgrade within a single boot cycle. This conflicted with a number of Mellanox upgrade flows which have been revised to safely meet this requirement.
How I did it
Added --no-power-cycle flags to SSD and ONIE firmware scripts
Modified Platform API to call firmware upgrade flows with this new flag during fwutil update all
Added a script to our reboot plugin to handle installing firmwares in the correct order with prior to reboot
How to verify it
Populate platform_components.json with firmware for CPLD / BIOS / ONIE / SSD
Execute fwutil update all fw --boot cold
CPLD will burn / ONIE and BIOS images will stage / SSD will schedule for reboot
Reboot the switch
SSD will install / CPLD will refresh / switch will power cycle into ONIE
ONIE installer will upgrade ONIE and BIOS / switch will reboot back into SONiC
In SONiC run fwutil show status to check that all firmware upgrades were successful
Why I did it
Update Broadcom SAI to version 6.0.0.13, SDK 6.5.24, saibcm-modules to 6.5.24.gpl
How I did it
Brcm SAI 6.0 EA with fixes for CS00012203367, CS00012219613, CS00012213974, CS00012218290, CS00012217169, CS00012211718, CS00012213944, CS00012215529, CS00012218100, CS00012214196, CS00012212681, CS00012205138, CS00012208537, CS00012185316, CS00012208524, CS00012203367, CS00012197364.
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.
- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable
- How to verify it
1. Manual test
2. Regression
* fix workdir for seastone2
Signed-off-by: Viktor Ekmark <viktor@ekmark.se>
* seastone2: Add I2C SFP definition for SFP1
Signed-off-by: Christian Svensson <blue@cmd.nu>
* [device/cel_seastone_2] sfputil logic for SFP1
Earlier logic resulted in the name of SFP1 being SFP33 which is not
correct. The cannonical source is seastone2_fpga module and it calls it
SFP1, so ensure the logic does as well.
Signed-off-by: Christian Svensson <blue@cmd.nu>
* [device/cel_seastone_2] sysfs paths for SFP1
Various changes that plumbs the correct port presence and DOM decoding
for the SFP1 port.
Signed-off-by: Christian Svensson <blue@cmd.nu>
Co-authored-by: Christian Svensson <blue@cmd.nu>
Why I did it
Recently additional sensors that were needed only for specific system added to all systems and caused errors.
How I did it
* Include CPU board and switch board sensors only on SN2201 system
* Fix issue in test_chassis_thermal, now it skips non existing thermals.
How to verify it
Run show platform temperature
Signed-off-by: liora <liora@nvidia.com>
- Why I did it
Add new Spectrum-4 system support SN5600 on top of Nvidia ASIC simulator.
- How I did it
Add all relevant system and simulator SKU.
Updated syseeprom.hex and related directories to reflect Nvidia SN5600 brand name.
- How to verify it
Tested init flow, basic show commands, up interfaces, traffic test.
Signed-off-by: Raphael Tryster <raphaelt@nvidia.com>
- Why I did it
Rename platform x86_64-mlnx_msn4800 to x86_64-nvidia_msn4800
- How I did it
Rename platform folder as well as all code that reference the platform name
- How to verify it
Manual test
For broadcom sai, we only need to upgrade the version, not necessary the token part in the url.
Co-authored-by: Ubuntu <xumia@xumia-vm1.jqzc3g5pdlluxln0vevsg3s20h.xx.internal.cloudapp.net>
Depends on #9358
Why I did it
Adjust LED logical according to hw-mgmt change.
How I did it
Add a trigger to set LED to blink.
How to verify it
Manual test
Why I did it
To support iTCO watchdog using watchdog APIs.
How I did it
Implemented a new watchdog class WatchdogTCO for interfacing with iTCO watchdog.
Updated reboot cause determination logic.
How to verify it
Verified that the watchdog APIs' return values are as expected.
Logs: UT_logs.txt
Update Nokia platform sonic-pmon submoduel to the latest with the following commits
c41c823 Fix transceiver module dynamic insertion/removal operations
21a1df6 Fixed pcied process FATAl issue
7fc1fd4 Fix midplane status nokia_cmd
a14ee1c Override get_module() api in chassis
8a457fc SON-326: Watchdog logger changes and file scrubbing
7250eb1 SON-410: Fix missing eeprom access routine
7a70c42 Allow only reboot of self card for OC API test
6ab5d96 Fixed the flake8 compliant issues
807de95 APIs to set thermal threshold to return false
9b38265 SON-382: platform-dump with common techsupport
3f83a67 Add model, base_mac, system_eeprom and serial number support in moduel.py
848d311 SFP: Add get_error_description and fix return status for set_lpmode
1fcb5de PSU check presence of psu instance for APIs
7c68da3 Fixed the eagle and hornet card description
0c01d07 Module support for reboot API
The same docker image is built multiple times after upgrading to bullseye, the build time is increased to about 15 hours from 6 hours.
See log: https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_apis/build/builds/50390/logs/9
Line 1437: 2021-11-11T11:15:02.7094923Z [ building ] [ target/docker-sonic-telemetry.gz ]
Line 1446: 2021-11-11T11:37:41.1073304Z [ finished ] [ target/docker-sonic-telemetry.gz ]
Line 1459: 2021-11-11T11:38:20.6293007Z [ building ] [ target/docker-sonic-telemetry.gz-load ]
Line 1462: 2021-11-11T11:38:28.1250201Z [ finished ] [ target/docker-sonic-telemetry.gz-load ]
Line 2906: 2021-11-11T18:57:42.8207365Z [ building ] [ target/docker-sonic-telemetry.gz ]
Line 2917: 2021-11-11T19:43:47.1860961Z [ finished ] [ target/docker-sonic-telemetry.gz ]
Line 3997: 2021-11-11T22:49:35.0196252Z [ building ] [ target/docker-sonic-telemetry.gz ]
Line 4002: 2021-11-11T23:14:00.4127728Z [ finished ] [ target/docker-sonic-telemetry.gz ]
How I did it
Place the python wheels in another folder relative to the build distribution.
Co-authored-by: Ubuntu <xumia@xumia-vm1.jqzc3g5pdlluxln0vevsg3s20h.xx.internal.cloudapp.net>
Why I did it
Nvidia platform API does not support set LED to orange
How I did it
Allow user to set LED to orange
How to verify it
Added unit test
Manual test
- Why I did it
Add support for SN2201 platform
- How I did it
Add required content for SN2201 platform
Note: still missing kernel driver support for this system. Once all is upstream will be updated as well.
- How to verify it
Install and basic sanity tests including traffic.
Signed-off-by: liora liora@nvidia.com
Why I did it
Adding platform support for centec v682-48y8c and v682-48x8c.
V682-48y8c switch has 48 SFP+ (1G/10G/25G) ports, 8 QSFP28 (40G/100G) ports on CENTEC TsingMa.MX.
V682-48y8c is different from V682-48y8c_d in that:
transceiver is managed by cpu smbus rather than TsingMa.MX i2c bus.
port led is managed by mcu inside TsingMa.MX.
fan, psu, sensors, leds are managed by cpu smbus other than the cpu board vendor's close sourse driver.
V682-48x8c switch has 48 SFP+ (1G/10G) ports, 8 QSFP28 (40G/100G) ports on CENTEC TsingMa.MX.
CPU used in v682-48y8c and v682-48x8c is Intel(R) Xeon(R) CPU D-1527.
How I did it
Modify related code in platform and device directory.
Upgrade centec sai to v1.9.
upgrade python to python3 and kernel version to 5.0 for V682-48y8c_d.
How to verify it
Build centec amd64 sonic image, verify platform functions (port, sfp, led etc) on centec v682-48y8c and v682-48x8c board.
Co-authored-by: shil <shil@centecnetworks.com>
- Why I did it
To include latest fixes.
SAI
1. Reclaim buffers for port which is admin down
2. Support for Spectrum-4 os Nvidia ASIC simulation
3. Support for SN2201
4. Fix host interface table entry, one channel per trap (fix sflow double registration)
5. 2 new queue counters - ecn marked packets + shared current occupancy
6. Fix storm policer unknown unicast
7. Add key/value for accuflow counters
8. Add MAC move
9. Add mirror congestion mode attribute
SDK
1. Under various circumstances, Ethernet ports falsely showed that InfiniBand cables were connected.
2. In SN4600C, at times, the link up time in both DAC and optics cables may, in the worst case, take up to 15 seconds.
3. Using SN4600C with copper or optics loopback cables in NRZ speeds, link may raise in long link up times
4. When ECMP has high amount of next-hops based on VLAN interfaces, in some rare cases, packets will get a wrong VLAN tag and will be dropped.
5. When connecting Spectrum devices with optical transceivers that support RXLOS, remote side port down might cause the switch firmware to get stuck and cause unexpected switch behavior.
6. Aggregation event is missing for WJH L2 drop reason 'Unicast egress port list is empty'.
7. Tying the SCL and SDA of the optical modules to 3.3V causes errors.
8. On SN4600, there was a delay of more than 10 seconds from the time a data packet is sent from CPU until it is transmitted through one of the switch ports.
9. While using SN4600C system with Finisar FTLC1157RGPL 100GbE CWDM4 modules, intermittent link flaps across multiple ports may be observed.
10. In Spectrum-2 and Spectrum-3 systems, link did not work in auto-negotiation when connected to Marvell PHY. KR mechanism has been enhanced to integrate with Marvell PHY.
11. The tunnel counter counts the drop packets now for Spectrum-2 and Spectrum-3 and consistent with Spectrum behavior and count the ECN dropped packets as well.
12. When connecting SN3800 to Cisco-9000, fast-linkup flow will fail and will rise in the normal flow.
13. Race condition in WJH library: when multiple threads load the LAG shared memory concurrently, the program may crash.
14. Add WJH L2 drop reason 'Unicast egress port list is empty' as a new drop reason.
15. Fixed a memory leak in sx_api_port_sflow_statistics_get API.
16. During initialization flow, the command interface that is used by the minimal driver and SDK caused the collision in the firmware since the same buffer is used in the firmware for the two interfaces.
17. Fix route issue on Kernel 5.10
- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Why I did it
To have an ability to use PRM sniffer.
- How I did it
Enabled the option in configure flags.
- How to verify it
Built and ran on switch. Enabled the feature in runtime and checked the sniffer recording.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
- Why I did it
To fix an issue that hw-mgmt patches were not applied. One patch was already in upstream hw-mgmt package thus applying it again caused an error and no other patches were applied. Also, I did it to improve the Makefile, so that the make will fail in case patches fail to apply.
- How I did it
Removed obsolete patch, made applying patches a hard failure in the build.
- How to verify it
Run the make and verify patches are applied.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
#### Why I did it
Updated hw-mgmt pointer to updated branch and to include new bugfixes. The hw-mgmt submodule was previously pointing to an orphaned commit which could not be fetched from github, this has now been resolved.
#### How I did it
Updated submodule pointer.
#### How to verify it
Clone down repository and update all submodules.
Signed-off-by: Stephen Sun stephens@nvidia.com
Why I did it
Support zero buffer profiles
Add buffer profiles and pool definition for zero buffer profiles
Support applying zero profiles on INACTIVE PORTS
Enable dynamic buffer manager to load zero pools and profiles from a JSON file
Dependency: It depends on Azure/sonic-swss#1910 and submodule advancing PR once the former merged.
How I did it
Add buffer profiles and pool definition for zero buffer profiles
If the buffer model is static:
Apply normal buffer profiles to admin-up ports
Apply zero buffer profiles to admin-down ports
If the buffer model is dynamic:
Apply normal buffer profiles to all ports
buffer manager will take care when a port is shut down
Update buffers_config.j2 to support INACTIVE PORTS by extending the existing macros to generate the various buffer objects, including PGs, queues, ingress/egress profile lists
Originally, all the macros to generate the above buffer objects took active ports only as an argument
Now that buffer items need to be generated on inactive ports as well, an extra argument representing the inactive ports need to be added
To be backward compatible, a new series of macros are introduced to take both active and inactive ports as arguments
The original version (with active ports only) will be checked first. If it is not defined, then the extended version will be called
Only vendors who support zero profiles need to change their buffer templates
Enable buffer manager to load zero pools and profiles from a JSON file:
The JSON file is provided on a per-platform basis
It is copied from platform/<vendor> folder to /usr/share/sonic/temlates folder in compiling time and rendered when the swss container is being created.
To make code clean and reduce redundant code, extract common macros from buffer_defaults_t{0,1}.j2 of all SKUs to two common files:
One in Mellanox-SN2700-D48C8 for single ingress pool mode
The other in ACS-MSN2700 for double ingress pool mode
Those files of all other SKUs will be symbol link to the above files
Update sonic-cfggen test accordingly:
Adjust example output file of JSON template for unit test
Add unit test in for Mellanox's new buffer templates.
How to verify it
Regression test.
Unit test in sonic-cfggen
Run regression test and manually test.
* Add macsec-xpn-support iproute2 in syncd
Signed-off-by: Ze Gan <ganze718@gmail.com>
* Polish code
Signed-off-by: Ze Gan <ganze718@gmail.com>
* Remove useless files
Signed-off-by: Ze Gan <ganze718@gmail.com>
* Add self-compiled iproute2 to docker sonic vs
Signed-off-by: Ze Gan <ganze718@gmail.com>
* Enhance apt install for iproute2 dependencies
Signed-off-by: Ze Gan <ganze718@gmail.com>
Sonic master has moved to use SAI v1.9.1. For consistency, use the latest credo sai v0.7.2 package, built with
SAI header v1.9.1 too.
How I did it
Update credo sai url for v0.7.2
Add a debug tool crshell
- Why I did it
This is to update the common sonic-buildimage infra for reclaiming buffer.
- How I did it
Render zero_profiles.j2 to zero_profiles.json for vendors that support reclaiming buffer
The zero profiles will be referenced in PR [Reclaim buffer] Reclaim unused buffers by applying zero buffer profiles #8768 on Mellanox platforms and there will be test cases to verify the behavior there.
Rendering is done here for passing azure pipeline.
Load zero_profiles.json when the dynamic buffer manager starts
Generate inactive port list to reclaim buffer
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.
- How I did it
When PSU is powered of, don't treat it as absent.
- How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
Why I did it
Fix#9059. It provides common gbsyncd.service.j2 to start for platform specific gbsyncd docker, which must be named 'gbsyncd'.
How I did it
All of platform specific gbsyncd dockers use a common name 'gbsyncd'
Use a unique systemd service template gbsyncd.service.j2 for gbsyncd docker
- add `set_write_max` and `get_write_max` apis to Sfp API
- add `get_revision` platform API
- implement `is_midplane_reacheable` API
- add software fallback for setting Xcvr low power mode
- add `get_error_description` API
- add support for Woodleaf SKU
- various fixes/refactors/cleanups
upgrade centec arm64 sai to v1.9.1 and fix syncd compile error:
```
2021-11-21T19:58:09.0294367Z /usr/bin/ld: libSyncd.a(libSyncd_a-VendorSai.o): in function `syncd::VendorSai::queryStatsCapability(unsigned long, _sai_object_type_t, _sai_stat_capability_list_t*)':
2021-11-21T19:58:09.0297367Z ./syncd/VendorSai.cpp:439: undefined reference to `sai_query_stats_capability'
2021-11-21T19:58:09.0298900Z collect2: error: ld returned 1 exit status
```
Co-authored-by: shil <shil@centecnetworks.com>
- Why I did it
Support PSU voltage high/low thresholds and power max threshold
1. Add thresholds support for voltage and power.
2. As thresholds are not supported on all platforms, we need to check the capability first and fetch thresholds only if it is supported.
- How I did it
- How to verify it
Run regression test and manual test.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Why I did it
Fix#9059. It provides common gbsyncd.service.j2 to start for platform specific gbsyncd docker, which must be named 'gbsyncd'.
How I did it
All of platform specific gbsyncd dockers use a common name 'gbsyncd'
Use a unique systemd service template gbsyncd.service.j2 for gbsyncd docker
Signed-off-by: pettershao-ragilenetworks pettershao@ragilenetworks.com
What I did it
Add new platform x86_64-ragile_ra-b6510-32c-r0 (Trident 3)
ASIC Vendor: Broadcom
Switch ASIC: Trident 3
Port Config: 32x100G
Add new platform x86_64-ragile_ra-b6920-4s-r0 (Tomahawk 3)
ASIC Vendor: Broadcom
Switch ASIC: Tomahawk 3
Port Config: 128x100G
-How I did it
Provide device and platform related files.
-How to verify it
show platform fan
show platform ssdhealth
show platform psustatus
show platform summary
show platform syseeprom
show platform temperature
show interface status