Why I did it
sonic-mgmt test failure is seen for update_firmware component API
Microsoft ADO: 25208748
How I did it
Edited API 2.0 to fix this issue.
How to verify it
Run sonic-mgmt test after the fix and verify it passes.
This feature was meant to be enabled but was accidentally left disabled.
Also downgrades the select/deselect messages to KERN_INFO to reduce log
spam.
Fixes#14546.
Signed-off-by: Christian Svensson <blue@cmd.nu>
- Why I did it
To support the building of ARM-based docker-sonic-vs.gz
- How I did it
Fixed SYNCD_VS build rule to be architecture-specific.
- How to verify it
make configure PLATFORM=vs PLATFORM_ARCH=arm64
make target/docker-sonic-vs.gz
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
- Why I did it
SAI bug Fixes
1. When creating an ACL rule with SAI_ACL_ENTRY_ATTR_FIELD_SRC_IP/SAI_ACL_ENTRY_ATTR_FIELD_DST_IP enabled, and then disabling the field by setting enable=false, a match on L3_type=IPv4 will remain programmed for the rule Issue resolved after the fix
2. Allow the max scale of virtual routers to be configure for SPC-1, SPC-2, SPC-3 which is 255 when fastboot enable and 511 when fastboot disable
3. Remove default hash key of SRC_MAC, DST_MAC and ETH_TYPE
SAI features
1. Port init profile
2. Dual ToR Active-Standby | Additional MAC support
SDK/FW bug fixes
1. When preforming fast boot from an old SDK version (currently installed) to a newer one (target version), and the system was initially loaded with a new SDK version (past version), and the system has not been wiped, under specific conditions, the fast boot would use the past version's data and may fail.
- How I did it
Update SAI version to SAIBuild2211.25.1.4
Update SDK/FW version to 4.6.1062/2012.1062
- Why I did it
1. Update Mellanox HW-MGMT package to newer version V.7.0030.1011
2. Replace the SONiC PMON Thermal control algorithm with the one inside the HW-MGMT package on all Nvidia platforms
3. Support Spectrum-4 systems
- How I did it
1. Update the HW-MGMT package version number and submodule pointer
2. Remove the thermal control algorithm implementation from Mellanox platform API
3. Revise the patch to HW-MGMT package which will disable HW-MGMT from running on SIMX
4. Update the downstream kernel patch list
Signed-off-by: Kebo Liu <kebol@nvidia.com>
### Why I did it
1. Enhance the diagnosis information collecting mechanism
- If the option `-v` is fed, it will pass additional diagnosis flags to mlxfwmanager
- Collect all the output from mlxfwmanager and print them to syslog if it fails
2. Abort syncd in case waiting for device or upgrading firmware fails
Signed-off-by: Stephen Sun <stephens@nvidia.com>
### How I did it
#### How to verify it
Regression and manual test
- Why I did it
Because the Spectrum4 devices don't support mlxtrace utility.
- How I did it
Edit sai.profile and remove mlxtrace_spectrum4_itrace_*.cfg.ext files
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
Why I did it
To avoid errors when the sfputil show error-status -hw is called from the host OS (not from the pmon docker).
How I did it
Remove the self.sdk_handle parameter from the _get_module_info() function.
How to verify it
Execute the sfputil show error-status -hw
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
Why I did it
Support Intel Tofino based platforms Netberg Aurora 750
ASIC: Intel Tofino BFN-T10-064Q
Pors: 64x 100G
How I did it
Added specification to device/netberg directory
Added platform/barefoot/sonic-platform-modules-netberg contains kernel modules, scripts and sonic_platform packages.
Modified the platform/barefoot/platform-modules-netberg.mk to include Aurora 750 related ID.
Signed-off-by: Andrew Sapronov <andrew.sapronov@gmail.com>
What I did it
Add new platform arm64-ragile_ra-b6010-48gt4x-r0 (Centec)
ASIC Vendor: Centec
Switch ASIC: Centec
Port Config: 48x1G+4x10G
Why I did it
Add new platform RA-B6010-48GT4X
How I did it
Add new platform RA-B6010-48GT4X
Signed-off-by: pettershao-ragilenetworks <pettershao@ragilenetworks.com>
Ignore intermittent IO errors during get_change_event in the Platform API
Fix tunings for some ports on CatalinaDD
Fix kernel module build for 6.1 kernel in preparation of bookworm upgrade
Why I did it
System health config is missing in few Dell platforms.
How I did it
Added system health monitoring config and its related API's
How to verify it
show system-health summary/detail commands.
- Why I did it
watchdogutil uses platform API watchdog instance to control/query watchdog status. In Nvidia watchdog status, it caches "armed" status in a object member "WatchdogImplBase.armed". This is not working for CLI infrastructure because each CLI will create a new watchdog instance, the status cached in previous instance will totally lose. Consider following commands:
admin@sonic:~$ sudo watchdogutil arm -s 100 =====> watchdog instance1, armed=True
Watchdog armed for 100 seconds
admin@sonic:~$ sudo watchdogutil status ======> watchdog instance2, armed=False
Status: Unarmed
admin@sonic:~$ sudo watchdogutil disarm =======> watchdog instance3, armed=False
Failed to disarm Watchdog
- How I did it
Use sysfs to query watchdog status
- How to verify it
Manual test
Unit test
- Why I did it
Change Mellanox platform API implementation to use ASIC driver sysfs for the module operational state and status error fields.
- How I did it
Modify the platform/mellanox/mlnx-platform-api/sonic_platform/sfp.py file by change the call of sx_mgmt_phy_module_info_get() SDK API to sysfs
- How to verify it
Simulate the unplug cable event
Check the CLI output
sfputil show presence
sfputil show error-status -hw
Simulate the plug cable event
Repeat 2 step
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
SONiC changes:
1. Support Spectrum4 ASIC FW binary building.
2. Support new SDK sx-obj-desc lib building since new SAI need it.
3. Remove SX_SCEW debian package from Mellanox SDK build since we are no longer using it (we use libxml2 instead).
4. Update SAI, SDK, FW to version 4.6.1020/2012.1020/SAIBuild2305.25.0.3
SDK/FW bug fixes
1. In SPC-1 platforms: Fastboot mode is not operational for Split port with Force mode in 50G speed
SFP modules are kept in disabled state after set LPM (low power mode) on/off for at least 3 minutes.
2. When preforming fast boot from an old SDK version (currently installed) to a newer one (target version), and the system was initially loaded with a new SDK version (past version), and the system has not been wiped, under specific conditions, the fast boot would use the past version's data and may fail.
SDK/FW Features
1. On SN2700 all ports can support y cable by credo
SAI bug Fixes
1. When creating an ACL rule with SAI_ACL_ENTRY_ATTR_FIELD_SRC_IP/SAI_ACL_ENTRY_ATTR_FIELD_DST_IP enabled, and then disabling the field by setting enable=false, a match on L3_type=IPv4 will remain programmed for the rule Issue resolved after the fix
2. Allow the max scale of virtual routers to be configure for SPC-1, SPC-2, SPC-3 when fastboot enable
3. Remove default hash key of SRC_MAC, DST_MAC and ETH_TYPE
SAI features
1. Port init profile
- How I did it
Update SDK/FW/SAI make files
- How to verify it
Run full sonic-mgmt regression on Mellanox platform
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
Update Mellanox MFT tool to version 4.25.0-62
- How I did it
Update the MFT tool make file
- How to verify it
Run full sonic-mgmt regression.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Why I did it
Add PDDF support on following Ufispace platforms with Broadcom ASIC
S9110-32X
S8901-54XC
S7801-54XS
S6301-56ST
How I did it
Add PDDF configuration files, scripts and python files
How to verify it
Run pddf commands and show commands.
Signed-off-by: nonodark <ef67891@yahoo.com.tw>
* Update sairedis submodule
This submodule update needs to be manually done due to build changes
done in the sairedis submodule. Specifically, Debian build profiles are
now being used instead of dpkg build targets, and dbgsym packages are
being used instead of dbg packages. Because of this, there needs to be
changes on the sonic-buildimage side for this.
This is a reland of #15720, which was reverted in #15995 due to the RPC
package build failing. That failure has since been fixed, and the
PR pipeline has been updated to build the RPC package so that this is
checked at the PR stage.
This submodule update brings in the following changes:
```
4dbdb21 Fix RPC package build failure due to shell syntax issue (#1268)
588d596 Make sure new binaries replace existing binaries in docker-sonic-vs (#1269)
ce8f642 [vs] Use boost join to concatenate switch types in config (#1266)
d6055a2 [vslib]: Temporaily map DPU switch type to NVDA_MBF2H536C (#1259)
e1cdb4d [CodeQL]: Use dependencies with relevant versions in azp template. (#1262)
c08f9a2 [CI]: Fix collect log error in azp template. (#1260)
eed856c [CodeQL]: Fix syncd compilation in azp template. (#1261)
a3f1f1a Reland 'Make changes to building and packaging sairedis (#1116)' (#1194)
```
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* Update sairedis submodule with the fix for the RPC package build
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
---------
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
- Why I did it
Increase UT coverage for Nvidia platform API code
Work item tracking
Microsoft ADO (number only):
- How I did it
Focus on low coverage file:
1. component.py
2. watchdog.py
3. pcie.py
- How to verify it
Run the unit test, the coverage has been changed from 70% to 90%
- Why I did it
Added the fwtrace config files in order to be able to call the mlxstrace utility during the show techsupport dump.
Work item tracking
Microsoft ADO (number only):
- How I did it
Added fwtrace config files. Added path to these files to sai.profile for each mlnx device.
- How to verify it
Execute the show techsupport command and check if mlxstrace output is in system dump.
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
This reverts commit e0927e28af.
Why I did it
Reverts #15720
It breaks build for target/debs/bullseye/syncd_1.0.0_amd64.deb
make[2]: Entering directory '/sonic/src/sonic-sairedis'
dh_install
# Note: escape with an extra symbol
if [ -f debian/syncd-rpc/usr/bin/syncd_init_common.sh ] ; then
/bin/sh: 1: Syntax error: end of file unexpected (expecting "fi")
make[2]: *** [debian/rules:65: override_dh_install] Error 2
make[2]: Leaving directory '/sonic/src/sonic-sairedis'
make[1]: *** [debian/rules:51: binary] Error 2
make[1]: Leaving directory '/sonic/src/sonic-sairedis'
dpkg-buildpackage: error: fakeroot debian/rules binary subprocess returned exit status 2
Work item tracking
Microsoft ADO (number only): 24691535
How I did it
How to verify it
Why I did it
Updating the iSMART_64 tool for supporting latest debian releases.
How I did it
On branch new_ismart
Changes to be committed:
(use "git restore --staged ..." to unstage)
modified: platform/broadcom/sonic-platform-modules-dell/s6100/scripts/iSMART_64
How to verify it
In s6100, run the iSMART_64 tool.
md5sum - 24725730d7649769c7ba50971c1f2955
Midstone platform has compilation error in master branch, fixed the same.
How I did it
Due to bullseye migration i2c_new_dummy API is deprecated modified with i2c_new_dummy_device.
How to verify it
Verified target/debs/bullseye/platform-modules-midstone-200i_0.2.2_amd64.deb is generated
Co-authored-by: Kannan Selvaraj <skannan@celestica.com>
Why I did it
[E1031] fix pca9548 initializes failed occasionally in stress test.
When failure happened, ismt i2c bus hang up and need power cycle to
recover it.
How I did it
Add 0.5s delay between setuping and configuring pca9548 i2c mux.
How to verify it
Reboot stress test at least 100 times without failure.
This submodule update needs to be manually done due to build changes
done in the sairedis submodule. Specifically, Debian build profiles are
now being used instead of dpkg build targets, and dbgsym packages are
being used instead of dbg packages. Because of this, there needs to be
changes on the sonic-buildimage side for this.
This submodule update brings in the following changes:
ce8f642 [vs] Use boost join to concatenate switch types in config (#1266)
d6055a2 [vslib]: Temporaily map DPU switch type to NVDA_MBF2H536C (#1259)
e1cdb4d [CodeQL]: Use dependencies with relevant versions in azp template. (#1262)
c08f9a2 [CI]: Fix collect log error in azp template. (#1260)
eed856c [CodeQL]: Fix syncd compilation in azp template. (#1261)
a3f1f1a Reland 'Make changes to building and packaging sairedis (#1116)' (#1194)
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Why I did it
To fixsonic-net/sonic-mgmt#8786
How I did it
Modified Fan API to check whether the data retrieved is valid or not and return accordingly
How to verify it
Verify whether API 2.0 is loaded properly or not.
Execute CLI's like "show version", "show interface status", "show platform psustatus" etc..
= Why I did it
To optimize Mellanox platform SAI build
- How I did it
SAI debs are now downloaded as Spectrum-SDK-Drivers-SONiC-Bins release.
- How to verify it
Configure/build for Mellanox platform, check the image and ensure that correct SAI debs are included.
- Why I did it
The reset cause "reset_from_comex" has been removed by hw-management, hence removing it from platform API code
- How I did it
Remove reset_from_comex from reboot cause mapping
- How to verify it
Manual test
- Why I did it
BIOS on new generation switch can come with a file type of cap or cab. Needs to add support to these file type.
Also ONIE version on new devices can have a suffix of 'dev'.
- How I did it
Added cap & cab as possible component extensions for ComponentBIOS.
Update the ONIE version regex to include dev signed versions.
- How to verify it
Update BIOS.
Why I did it
port_config.ini and hwsku.json are needed to generate the default config
switch_type needs to be "dpu" to spawn the right set of processes during dvs initialization and to make sure that DASH APIs can be handled properly
Work item tracking
Microsoft ADO 24375371:
How I did it
Use the same hwsku.json and port_config.ini for DPU-2P as the ones used for Nvidia-MBF2H536C SKU in nvidia-sonic sonic-buildimage repo.
Set switch_type to "dpu" in DEVICE_METADATA configuration to make sure DASH specific APIs are handled properly
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
* [202012][platform/barefoot] (#8543)
Why I did it
Pcied running by python 2.
How I did it
dropped python2 support and add python3 support for pcied in file docker-pmon.supervisord.conf.j2
How to verify it
docker exec pmon supervisorctl status
* [Netberg][nba710] Added initial support for Aurora 710
Signed-off-by: Andrew Sapronov <andrew.sapronov@gmail.com>
---------
Signed-off-by: Andrew Sapronov <andrew.sapronov@gmail.com>
Co-authored-by: Kostiantyn Yarovyi <kostiantynx.yarovyi@intel.com>
Why I did it
To support dynamic swapping of module types/speeds (400G/100G/40G)
To optimize CMIS ZR optics operation
How I did it
Reinitialize xcvr_api at module removal/insertion time, and also optimize cache for ZR optics.
How to verify it
Verify that different (supported) module types can be dynamically swapped (removed/inserted) and that each is properly provisioned by Xcvrd and has its EEPROM information accurately reported in Redis DB (using "show transceiver eeprom") as well as "sfputil show eeprom" direct access.
Also verify that Xcvrd initialization and operation with 400G CMIS ZR optics is both efficient and functional.
** edit 6/14/23: pushed enhanced caching (full memory map) support and elimination of base class APIs override.
Why I did it
Sharing the storage of syncd with other proprietary application extensions allows them to communicate with syncd in differnt ways.
If one container wants to pass some information to syncd then shared storage can be used. However, today the shared storage isn't cleaned on restarts making it possible for syncd to read out-of-date information generated in the past.
NOTE: No plans to use it for standard SONIC dockers and we are working on removing the SDK dependency from PMON docker
How I did it
Implemented new service to clean the shared storage.
How to verify it
Do reboot/fast-reboot/warm-reboot/config-reload/systemctl restart swss and verify /tmp/ is cleaned after each restart in syncd container.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Define a generic 2-port NPU SKU for docker-sonic-vs to
enable DASH vstests to pass on azure pipelines
Work item tracking
Microsoft ADO 24375371:
How I did it
Define a generic 2-port NPU hwsku that is used only for DASH-specific vstests.
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
- Why I did it
SDK patches for iproute2 were added to SONiC tree as a temporary solution.
Now that SDK with the patches is available, I have removed the patches from SONiC tree and we consume them from SDK github during compilation.
- How I did it
During build we download SDK iproute2 patches from SDK github (or from the URL provided by user if compiling SDK from sources) and apply them before compilation.
- How to verify it
Compile and load on switch, verify interfaces network devices created successfully.
Verify LLDP shows connections to neighbors.
Verify ping between 2 hosts over 2 router ports is successful.
- Why I did it
Adjust the warning threshold implementation according to the latest algorithm update
- How I did it
Modify power warning and critical thresholds methods
- How to verify it
Unit test updated to cover the change
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
Add the patchwork link to the commit description for non-upstream patches if present
- How I did it
Parse the patchwork/<patch_name>.txt file from hw-mgmt
Why I did it
FPGA driver crash was observed in Dell FPGA based platforms.
How I did it
Fixed FPGA crash
How to verify it
Load FPGA driver and check whether the kernel crashes.
Why I did it
When git clone -b xxx command is used the versions-git will reset the HEAD of the git to the commit ID in the versions-git file. Which causes incorrect commit to be checked out causing build errors.
Work item tracking
Microsoft ADO (number only):
How I did it
Split ‘git clone -b’ into two steps to avoid owerwrite
Git clone
cd mrvl-prestera; git checkout ; cd ..
How to verify it
Build marvell-arm64 target using below instructions
make init
make configure PLATFORM=marvell-arm64 PLATFORM_ARCH=arm64
make target/sonic-marvell-arm64.bin SONIC_BUILD_JOBS=2
- Why I did it
Bug fix:
- * I2C bus is stuck - Unable to probe I2C bus 2-0048, which causes /var/run/hw-management/config/sfp_counter, module_counter to be zero and pmon docker unable to start.
- How I did it
Update HW-MGMT package version in the make file
Update HW-MGMT submodule pointer
-How to verify it
Run full sonic-mgmt regression
Signed-off-by: Kebo Liu <kebol@nvidia.com>
#### Why I did it
Facilitate Automatic integration of sdk kernel patches into SONiC.
**Inputs to the Script:**
1) `MLNX_SDK_VERSION` Eg: `4.5.4206`
2) `MLNX_SDK_ISSU_VERSION` Eg: `101`
**Note: If nothing is provided the one already present in the sdk.mk file is used**
3) `MLNX_SDK_SOURCE_BASE_URL:`
**Note: If nothing is provided the upstream sdk drivers url is used**
4) `CREATE_BRANCH: (y|n)` Creates a branch instead of a commit (optional, default: n)
5) `BRANCH_SONIC`: Only relevant when CREATE_BRANCH is y. `Default: master`.
Note: These should be provided through `SONIC_OVERRIDE_BUILD_VARS ` parameter
**Output:**
1) Script creates a commit in sonic-linux-kernel with any updates to sdk-kernel patches in sonic in accordance with the version provided by `MLNX_SDK_VERSION`
**Note: Script Doesn't commit anything to linux-kernel when there aren't any changes required..**
#### How I did it
1) Added a new make target which can be invoked by calling `make integrate-mlnx-sdk`
```
user@server:/sonic-buildimage/src/sonic-linux-kernel$ git rev-parse --abbrev-ref HEAD
master_6f38dca_integrate_4.5.4206
user@server:/sonic-buildimage/src/sonic-linux-kernel$ git log --oneline -n 1
d64d1e7 (HEAD -> master_6f38dca_integrate_4.5.4206) Intgerate MLNX SDK 4.5.4206 Kernel Patches
```
Changes made will be summarized under `sonic-buildimage/integrate-mlnx-sdk_user.out` file. Debugging and troubleshooting output is written to `sonic-buildimage/integrate-mlnx-sdk.log` files
[log_files.zip](https://github.com/sonic-net/sonic-buildimage/files/11226441/log_files.zip)
#### Limitations:
1) Assumes that the sdk kernel patches are always upstreamed
#### How to verify it
Build the Kernel and test
Why I did it
Align with SAI headers v1.12.0
Work item tracking
Microsoft ADO (number only):
How I did it
Update Mellanox SAI submodule
How to verify it
Compile SONiC image
- Why I did it
The current implementation of SFP reset, LPM, present relies on SDK API. This PR moves the implementation to SDK sysfs. By this PR, it gains following benefit:
1. SDK sysfs provides better performance.
2. Host side and container side share the same code.
3. Code is much cleaner.
- How I did it
Use SDK sysfs to implement SFP reset, LPM, present.
- How to verify it
1. Manual test.
2. Unit test.
SDK/FW Fixed Issues:
• When a system has more than 256 ACL entries, on rare occasion, removing/adding entries may cause some ACL entries not to work.
• When using mirror session policer on spectrum-2, spectrum-3, the actual CIR was 1.28 times more than the configured CIR value
• After warm boot process, when enabling ECN marking and the port is in split mode, traffic sent to the port under congestion (for example, when connecting two ports with a total speed of 50GbE to a single 25GbE port) is not marked.
• Warm boot might fail if the key value SAI_KEY_ACCUMULATED_FLOW_COUNTER_UNITS_IN_KB is set
• If counters are bound to an next hop group, there is a probability the next API calls that modify the next-hop group members will fail.
• In Spectrum platforms Fastboot mode is not operational for Split port with Force mode in 50G speed
• When fine grain next hop group has a size of 2K or 4K members, and group is removed, FW will remove only (size % 2048) members, resulting in leakage of KVD resources
• When reading some port statistics, or bulk reading some Queue or PG statistics, and in parallel reading or writing other counters, FW may, in rare cases, get stuck
• SN2201 Module 1 is considered to be present/linked while no cable/module is plugged
• On Spectrum-3 when port configure to 400G FW might stuck after running mlxlink while 400G interface connected and swap between upper and lower 4 lanes
SAI New features:
• ACL: Added support for an ACL match on the AETH field (SAI_ACL_TABLE_ATTR_FIELD_AETH_SYNDROME, SAI_ACL_ENTRY_ATTR_FIELD_AETH_SYNDROME) to count RoCE NAK and CNP packets.
• PLL Status: Added a new logging entry that alerts the user upon a PLL lock loss event.
• Dual ToR - Additional MAC Address: Added support for setting a MAC address for the router interface which is not part of the 10 bit MAC address available for RIFs on Spectrum-1, as part of the Dual ToR scenario.
• Dual ToR: DSCP Remapping Added support for tunnel QoS maps as part of the Dual TOR scenario.
SAI Fixed issues:
• When setting a WRED profile attribute for a color that was not enabled during the profile create time, an error would be returned. After the fix, a default profile is create on such scenario and the set attribute is applied on top of it
• When calling the flush FDB by using the SAI_FDB_FLUSH_ATTR_BRIDGE_PORT_ID attribute, the bridge bv_id value was filled on the notification callback where it should have been left empty.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Add new Nokia build target and establish an arm64 build:
Platform: arm64-nokia_ixs7215_52xb-r0
HwSKU: Nokia-7215-A1
ASIC: marvell
Port Config: 48x1G + 4x10G
How I did it
- Change make files for saiserver and syncd to use Bulleseye kernel
- Change Marvell SAI version to 1.11.0-1
- Add Prestera make files to build kernel, Flattened Device Tree blob and ramdisk for arm64 platforms
- Provide device and platform related files for new platform support (arm64-nokia_ixs7215_52xb-r0).
The S3IP (Simplified Switch System INtegration Program) sysfs specification defines a unified interface to access peripheral hardware on devices from different vendors, making it easier for SONiC to support different devices and platforms.
PDDF is a framework to simplify the driver and SONiC platform APIs development for new platforms. This effort is first step in combining the two frameworks.
This specific PR adds S3IP supported sysfs attribute in common FAN driver of PDDF.
The S3IP (Simplified Switch System INtegration Program) sysfs specification defines a unified interface to access peripheral hardware on devices from different vendors, making it easier for SONiC to support different devices and platforms.
PDDF is a framework to simplify the driver and SONiC platform APIs development for new platforms. This effort is first step in combining the two frameworks.
This specific PR adds the S3IP supported sysfs attributes in PDDF common LED driver.
Why I did it
The S3IP (Simplified Switch System INtegration Program) sysfs specification defines a unified interface to access peripheral hardware on devices from different vendors, making it easier for SONiC to support different devices and platforms.
PDDF is a framework to simplify the driver and SONiC platform APIs development for new platforms. This effort is first step in combining the two frameworks.
This specific PR adds support for pddf-s3ip-init.service and enables it in PDDF.
Why I did it
ptf_nn_agent failed to start in dnx rpc syncd because module afpacket was not installed.
Please see issue sonic-net/sonic-mgmt#7822
How I did it
Add downloading ptf afpacket module in docker file.
How to verify it
Verified that ptf_nn_agent was started successfully in dnx rpc syncd with the change.
- SAI-1.11.0 support
- SONIC 20220531.25 OC Failure: Everflow testcases failing due to SAI orchagent crash
- SONIC 20220531.25 OC Failure: ACL IPv6 testcases.
- TPID support
Signed-off-by: rajkumar38 <rpennadamram@marvell.com>
- Why I did it
As a LED indicator to help user to find switch location in the lab, UID LED is a useful LED in Mellanox switch.
- How I did it
I add a new member _led_uid in Mellanox/Chassis.py, and extend Mellanox/led.py to support blue color.
Relevant platform-common PR sonic-net/sonic-platform-common#369
- How to verify it
Add unit test cases in test.py, and do manual test including turn-on/off/show uid led.
Signed-off-by: David Xia <daxia@nvidia.com>
Fix lpmode on 7060DX5-32
Fix psu led issue on 7060DX5-64
Use sonic_xcvr lpmode if platform does not support hw lpmode
Add chassis cooling algorithm
Change cooling algorithm default interval to 10s
Force filesystem sync on linecard reboot
- Why I did it
Add the commit-id patch map in the commit message.
- How I did it
By parsing the patch DB from hw-mgmt
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
- Why I did it
Currently, LED sysfs path is hardcoded. We will need change LED code if new LED color is supported for new platforms. This PR is aimed to improve this. By this PR, LED sysfs path is deduced from LED capability file.
- How I did it
Improve LED management on Nvidia platform:
get LED capability from capability file and deduce sysfs name according to the capability
- How to verify it
Unit test
Manual test
- Why I did it
To improve ASAN backtrace output when the call stack contains a code that is not compiled with -fno-omit-frame-pointer.
- How I did it
Added fast_unwind_on_malloc=0 to the ASAN_OPTIONS
- How to verify it
Build and test docker-syncd-mlnx.gz with ENABLE_ASAN=y
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
Using timer-override.conf, we modify the fstrim.timer service.
For armhf, Nokia-7215 platform, we modify fstrim.timer to run daily
instead of weekly. This is required because the size of the SSD on
this platform is 16GB, which on average is nearly 10 times smaller than
most other sonic platforms. With smaller disk and the ever increasing
level of logging done by sonic, this change is required to prevent
the SSD from entering a read-only state due to inadequate free blocks.
- Fix watchdog reboot cause for wolverine linecard
- Fix PSU fan speed of 0% by adding max RPM to most psu descriptions
- Add product DCS-7060DX5-64
- Add product DCS-7060DX5-32
- Why I did it
Add NVIDIA Copyright header to NVIDIA files added lately
- How I did it
Add NVIDIA Copyright header for the relevant files
- How to verify it
N/A (only commented text was added).
Normally doesn't need to measure i2c calls.
Also switched to use timespec64_sub() to ensure time delta normalized
Co-authored-by: Kostiantyn Yarovyi <kostiantynx.yarovyi@intel.com>
- Why I did it
Mellanox syncd container will be based on Debian iproute2 plus patches instead of Nvidia internal version of iproute2
- How I did it
Download iproute2 from Debian repository, apply patches and compile to create a new target.
The target is then deployed in syncd container of Mellanox switches only.
The new target is called IPROUTE2_MLNX.
- How to verify it
Compile and load on switch, verify interfaces network devices created successfully.
Verify LLDP shows connections to neighbors.
Verify ping between 2 hosts over 2 router ports is successful.
Why I did it
Update sonic-platform submodule for Nokia-7250IXRE Platform. This requires the new NDK 22.9.8 and above
How I did it
Update submodule sonic-platform for Nokia-7250IXRE platform.
c9f316e Disparate process and thread-safe protection for MDIPC transport, and refactored presence logic to better align with SfpStateUpdateTask operation
a3486cc Added _get_module_bulk_info() and cache the info for 5 seconds to optimize the chassisd update.
4b2e729 Fixed the nokia_cmd show qfpga help display
7b87049 Fixed the nokia_cmd show midplane helper dispaly.
83eabea Add "nokia_cmd set ndk-monitor-action" and "nokia_cmd set ndk-log-level" commands
8aad7de Add nokia_cmd show ndk-version
d2c55e3 Modify the psu.py and module.py to optimize the psud running time
Signed-off-by: mlok <marty.lok@nokia.com>
Signed-off-by: Stepan Blyschak stepanb@nvidia.com
DEPENDS: #12852
Why I did it
To support BGP pending FIB suppression.
How I did it
I backported patches from FRR 8.4 feature that allows communicating ASIC route status back to FRR.
Also, added a new field in DEVICE_METADATA YANG model table. Added UT for YANG model changes.
How to verify it
Run on the switch.
- Why I did it
Package Marvell/Innovium CLI shell.
- How I did it
Include shell packages.
- How to verify it
Platform specific shell commands.
Signed-off-by: rck-innovium rck@innovium.com
- Why I did it
Facilitate Automatic integration of new hw-mgmt version into SONiC.
Inputs to the Script:
MLNX_HW_MANAGEMENT_VERSION Eg: 7.0040.5202
CREATE_BRANCH: (y|n) Creates a branch instead of a commit (optional, default: n)
BRANCH_SONIC: Only relevant when CREATE_BRANCH is y. Default: master.
Note: These should be provided through SONIC_OVERRIDE_BUILD_VARS parameter
Output:
Script creates a commit (in each of sonic-buildimage, sonic-linux-kernel) with all the changes required for upgrading the hw-management version to a version provided by MLNX_HW_MANAGEMENT_VERSION
Brief Summary of the changes made:
MLNX_HW_MANAGEMENT_VERSION flag in the hw-management.mk file
hw-mgmt submodule is updated to the corresponding version
Updates are made to non-upstream-patches/patches and series.patch file
series, kconfig-inclusion and kconfig-exclusion files can be updated in the sonic-linux-kernel repo
sonic-linux-kernel/patches folder is updated with the corresponding upstream patches
Based on the inputs, there could be a branch seen in the local for each of the repo's. Branch is named as <branch>_<parent_commit>_integrate_<hw_mgmt_version>
- How I did it
Added a new make target which can be invoked by calling make integrate-mlnx-hw-mgmt
user@server:/sonic-buildimage$ git rev-parse --abbrev-ref HEAD
master_23193446a_integrate_7.0020.5052
user@server:/sonic-buildimage$ git log --oneline -n 2
f66e01867 (HEAD -> master_23193446a_integrate_V.7.0020.5052, show) Intgerate HW-MGMT V.7.0020.5052 Changes
23193446a (master_intg_hw_mgmt) Update logic
user@server:/sonic-buildimage/src/sonic-linux-kernel$ git rev-parse --abbrev-ref HEAD
master_6847319_integrate_7.0020.4104
user@server:/sonic-buildimage/src/sonic-linux-kernel$ git log --oneline -n 2
6094f71 (HEAD -> master_6847319_integrate_V.7.0020.5052) Intgerate HW-MGMT V.7.0020.5052 Changes
6847319 (origin/master, origin/HEAD) Read ID register for optoe1 to find pageable bit in optoe driver (#308)
Changes made will be summarized under sonic-buildimage/integrate-mlnx-hw-mgmt_user.out file. Debugging and troubleshooting output is written to sonic-buildimage/integrate-mlnx-hw-mgmt.log files
User output file & stdout file:
log_files.tar.gz
Limitations:
Assumes the changes would only work for amd64
Assumes the non-upstream patches in mellanox only belong to hw-mgmt
- How to verify it
Build the Kernel
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Why I did it
Support to add SONiC OS Version in device info.
It will be used to display the version info in the SONiC command "show version". The version is used to do the FIPS certification. We do not do the FIPS certification on a specific release, but on the SONiC OS Version.
SONiC Software Version: SONiC.master-13812.218661-7d94c0c28
SONiC OS Version: 11
Distribution: Debian 11.6
Kernel: 5.10.0-18-2-amd64
How I did it
- Why I did it
Currently, non upstream patches are applied only after upstream patches.
Depends on sonic-net/sonic-linux-kernel#313. Can be merged in any order, preferably together
- What I did it
Non upstream Patches that reside in the sonic repo will not be saved in a tar file bur rather in a folder pointed out by EXTERNAL_KERNEL_PATCH_LOC. This is to make changes to the non upstream patches easily traceable.
The build variable name is also updated to INCLUDE_EXTERNAL_PATCHES
Files/folders expected under EXTERNAL_KERNEL_PATCH_LOC
EXTERNAL_KERNEL_PATCH_LOC/
├──── patches/
├── 0001-xxxxx.patch
├── 0001-yyyyyyyy.patch
├── .............
├──── series.patch
series.patch should contain a diff that is applied on the sonic-linux-kernel/patch/series file. The diff should include all the non-upstream patches.
How to verify it
Build the Kernel and verified if all the patches are applied properly
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
[S6100] Improve S6100 serial-getty monitor, wait and re-check when getty not running to avoid false alert.
#### Why I did it
On S6100, the serial-getty service some time can't auto-restart by systemd. So there is a monit unit to check serial-getty service status and restart it.
However, this monit will report false alert, because in most case when serial-getty not running, systemd can restart it successfully.
To avoid the false alert, improve the monitor to wait and re-check.
Steps to reproduce this issue:
1. User login to device via console, and keep the connection.
2. User login to device via SSH, check the serial-getty@ttyS1.service service, it's running.
3. Run 'monit reload' from SSH connection.
4. Check syslog 1 minutes later, there will be false alert: ' 'serial-getty' process is not running'
#### How I did it
Add check-getty.sh script to recheck again later when getty service not running.
And update monit unit to check serial-getty service status with this script to avoid false alert.
#### How to verify it
Pass all UT.
Manually check fixed code work correctly:
```
admin@***:~$ sudo systemctl stop serial-getty@ttyS1.service
admin@***:~$ sudo /usr/local/bin/check-getty.sh
admin@***:~$ echo $?
1
admin@***:~$ sudo systemctl status serial-getty@ttyS1.service
● serial-getty@ttyS1.service - Serial Getty on ttyS1
Loaded: loaded (/lib/systemd/system/serial-getty@.service; enabled-runtime; vendor preset: enabled)
Active: inactive (dead) since Tue 2023-03-28 07:15:21 UTC; 1min 13s ago
admin@***:~$ sudo /usr/local/bin/check-getty.sh
admin@***:~$ echo $?
0
admin@***:~$ sudo systemctl status serial-getty@ttyS1.service
● serial-getty@ttyS1.service - Serial Getty on ttyS1
Loaded: loaded (/lib/systemd/system/serial-getty@.service; enabled-runtime; vendor preset: enabled)
```
syslog:
```
Mar 28 07:10:37.597458 *** INFO systemd[1]: serial-getty@ttyS1.service: Succeeded.
Mar 28 07:12:43.010550 *** ERR monit[593]: 'serial-getty' status failed (1) -- no output
Mar 28 07:12:43.010744 *** INFO monit[593]: 'serial-getty' trying to restart
Mar 28 07:12:43.010846 *** INFO monit[593]: 'serial-getty' stop: '/bin/systemctl stop serial-getty@ttyS1.service'
Mar 28 07:12:43.132172 *** INFO monit[593]: 'serial-getty' start: '/bin/systemctl start serial-getty@ttyS1.service'
Mar 28 07:13:43.286276 *** INFO monit[593]: 'serial-getty' status succeeded (0) -- no output
```
#### Description for the changelog
[S6100] Improve S6100 serial-getty monitor.
#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
Why I did it
After sonic-install install a new image, print_menu is set echo without any data. No image info between Hit any key to stop autoboot: 0 and Start USB
Board configuration detected:
Net:
| port | Interface | PHY address |
|--------|-----------|--------------|
No ethernet found.
Hit any key to stop autoboot: 0
(Re)start USB...
USB0: Port (usbActive) : 0 Interface (usbType = 2) : USB EHCI 1.00
scanning bus 0 for devices... 3 USB Device(s) found
scanning usb for storage devices... 0 Storage Device(s) found
How I did it
The fw_setenv print_menu is missing the double quotes. That causes the value is truncated. Using double quotes to in the environment setting.
How to verify it
Install new image with this fix. And reboot the system. The following section should be shown:
Signed-off-by: mlok <marty.lok@nokia.com>
Why I did it
Add platform files for critical processes and default qos config for Innovium platforms
How I did it
Added default files for critical processes and qos config
How to verify it
Tested with autorestart/test_container_autorestart.py::test_containers_autorestart
Signed-off-by: rck-innovium rck@innovium.com
Why I did it
Fix the installation candidate not found issue when building docker-sonic-vs
How I did it
Need to run the command "apt-get update" to update the mirror indexes before installing the package gnupg
How to verify it
Why I did it
There is rare condition, emc2305 hold SMBus and cause SMBus completion wait timed out.
How I did it
Enable EMC2305 SMBus timeout feature, 30ms period of inactivity will reset the interface.
How to verify it
Use 'i2cget -y -f 23 0x4d 0x20 b' to read EMC2305 configuration register and check DIS_TO bit not set.
Signed-off-by: Eric Zhu <erzhu@celestica.com>
* Upgrade docker-sonic-vs and docker-syncd-vs to Bullseye
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* iproute2: Force a new version and timestamp to be used for the package
There is an issue with Docker's overlay2 storage driver when not using
native diffs (and thus falling back to naive diff mode), which is the
case in the CI builds. The way the naive diff mode detects changes is by
comparing the file size and comparing the timestamps (specifically, I
believe it's the modification timestamp), and if there's a change there,
then it's considered a change that needs to be recorded as part of that
layer.
The problem is that with the code being added in the patch, the file
size remains the same, and the timestamp of binary files appear to be
the same timestamp as the changelog entry (likely for reproducible build
purposes). The file size remains the same likely due to extra padding
within the file introduced by relro. Because of this, Docker doesn't
detect this file has changed, and doesn't save the new file as part of
this layer.
To work around this, create a new changelog entry (with a new version as
well) with a new timestamp. This will result in the binary files having
a different timestamp, and thus will get saved by Docker as part of that
layer.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
---------
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
- Why I did it
Sometimes Nvidia watchdog device isn't ready when watchdog-control service is up after first installation from ONIE
need to delay watchdog control service to go up after hw-mgmt which gets devices up and ready
- How I did it
Delay Nvidia watchdog-control service before hw-mgmt has started on Mellanox platform in order to avoid missing or not ready watchdog device.
- How to verify it
verification test of ONIE installation of image in a loop
making sure watchdog service is always up (not failed) after first installation from ONIE
Why I did it
To enhance pddf_eeprom.py to use caching and fix#13835
How I did it
Utilising the in-built caching mechanism in the base class eeprom_base.py.
Adding a cache file to store the eeprom data.
How to verify it
By running 'decode-syseeprom' or 'show platform syseeprom' commands.
Why I did it
To enable FPGA support in PDDF.
How I did it
Added FPGAI2C and FPGAPCI in the build path for the PDDF debian package
Added the support for FPGA access APIs in the drivers of fan, xcvr, led etc.
Added the FPGA device creation support in PDDF utils and parsers
How to verify it
These changes can be verified on some platform using such FPGAs. For testing purpose, we took Dell S5232f platform and brought it up using PDDF. In doing so, FPGA devices are created using PDDF and optics eeproms were accessed using common FPGA drivers. Below are some of the logs.
- Why I did it
To include latest fixes:
Fix traffic loss on all routed traffic when moving from 4.4.3372/XX_2008_3388 to 4.5.4118-012/XX_2010_4120-010. Issue occurred after ISSU process in Spectrum 1 only, When upgrading from older version to a new one. Neighbor entries are overwritten.
Fix When using mirror session policer on SPC2/3, the actual CIR was 1.28 times more than the configured CIR value.
Fix Creation of router interface of type bridge may occasionally fail if create is performed immediately after delete.
Fix False errors during SDK deinitialization may be seen in the syslog
- How I did it
Updated SDK submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Why I did it
Upgrade SAI XGS version to 8.4.0.2 and migrate to DMZ repo.
How I did it
Update SAI XGS version in sai.mk.
How to verify it
Run the SONiC and SAI test with the SAI pipeline.
Signed-off-by: zitingguo-ms zitingguo@microsoft.com
- Why I did it
In sfplpm API, the number of logical ports is hardcoded as 64. When a system contains more port than this, the SDK APIs would fail with a syslog as below
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [MGMT_LIB.ERR] Slot [0] Module [0] has logport [0x00010069] in enabled state
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [SDK_MGMT_LIB.ERR] Failed in __sdk_mgmt_phy_module_pwr_attr_set, error: Internal Error
Mar 7 03:53:58.106118 r-leopard-58 ERR pmon#-c: Error occurred when setting power mode for SFP module 0, slot 0, error code 1
- How I did it
Remove the hardcoded value of 64. Obtained the number of logical ports from SDK
- How to verify it
Manual testing
Fixes#13568
Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:
admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb 8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.
- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.
- How to verify it
Upgrade from 201911 to master
master to 201911 downgrade
master -> master reboot
ONIE -> master boot (First FW burn)
Which release branch to backport (provide reason below if selected)
- Why I did it
To optimize Mellanox platform build
- How I did it
sdk debs are now downloaded as Spectrum-SDK-Drivers-SONiC-Bins release
sx kernel is downloaded as zip from Spectrum-SDK-Drivers
- How to verify it
configure/build for Mellanox platform
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
Why I did it
Platform cases test_tx_disable, test_tx_disable_channel, test_power_override failed in dx010.
How I did it
Add i2c access algorithm for CPLD i2c adapters.
How to verify it
Verify it with platform_tests/api/test_sfp.py::TestSfpApi test cases.
- Why I did it
FW for Spectrum-4 ASIC not yet available
- How I did it
Remove in Mellanox fw make files to Spectrum-4 ASIC firmware binaries.
Remove from firmware upgrade scripts to be able Spectrum-4 ASIC.
- How to verify it
Run regression test
d768d19 Remove warning msg when a transceiver op takes > 200ms
7451689 Support the module.py in IMM to query the Supervisor card eeprom info
Signed-off-by: mlok <marty.lok@nokia.com>
- Why I did it
On Mellanox platform, system EEPROM is a soft link provided by hw-management. There is chance that config-setup service accessing the EEPROM before hw-management creating it. It causes errors. The PR is aim to fix it.
- How I did it
Waiting EEPROM creation in platform API up to 10 seconds.
- How to verify it
Manual test
- Why I did it
Add support for systems 4600/4600C/2201 that are using sonic interface names aligned to 4 instead of 8 (which is the max number of lanes per port).
Improve DB access calls, now we use Python library functions.
- How I did it
Use addition information taken from Config DB in order to create map from SDK logical index to sonic interface name.
- How to verify it
Run ECMP calculator on 4600, 4600C and 2201 platforms.
- Why I did it
sfp_event.py gets a PMPE message when a cable event is available. In PMPE message, there is no label port available. Current sfp_event.py is using sx_api_port_device_get to get 64 logical ports attributes, and find the label port from those 64 attributes. However, if there are more than 64 ports, sfp_event.py might not be able to find the label port and drop the PMPE message.
- How I did it
Don't use hardcoded 64, get logical port number instead.
- How to verify it
Manual test
- Why I did it
Add PYTHON3_SWSSCOMMON as build time dependency to Mellanox platform API to avoid issue like:
19:34:11 ImportError while loading conftest '/sonic/platform/mellanox/mlnx-platform-api/tests/conftest.py'.
19:34:11 tests/conftest.py:28: in <module>
19:34:11 from sonic_platform import utils
19:34:11 sonic_platform/__init__.py:18: in <module>
19:34:11 from sonic_platform import *
19:34:11 sonic_platform/platform.py:28: in <module>
19:34:11 raise ImportError(str(e) + "- required module not found")
19:34:11 E ImportError: No module named 'swsscommon'- required module not found- required module not found
19:34:11 [ FAIL LOG END ] [ target/python-wheels/bullseye/mlnx_platform_api-1.0-py3-none-any.whl ]
The issue only happens when calling below command:
make target/python-wheels/bullseye/mlnx_platform_api-1.0-py3-none-any.whl
- How I did it
Add PYTHON3_SWSSCOMMON as build time dependency to Mellanox platform API
- How to verify it
Run build
- Why I did it
Add non-upstream kernel patches for the Nvidia platforms
These patches are not yet upstream but needed for new technology.
A flow to upstream them is in progress and once they will be approved they will be moved officially to sonic-linux-kernel.
Till then to include them in the build (not must) the build option INCLUDE_EXTERNAL_PATCH_TAR=y should be included
- How I did it
Zip all the patches in to a tar.gz tarball.
- How to verify it
Manually test
Signed-off-by: Stephen Sun <stephens@nvidia.com>
add SEU reporting on chassis
fix fallback logic for Clearlake eeprom identification
fix fan speed reporting for a specific model
move pcie timeout configuration for Upperlake in platform code (deprecates hwsku-init)
- Why I did it
Currently, when building MFT, it can only download the source code from the official download site: http://www.mellanox.com/downloads/MFT/, it's not possible to integrate an internal version that has not been officially released yet.
The intention of this PR is to make it possible to download the source code from any valid link.
- How I did it
Add a new parameter "MLNX_MFT_INTERNAL_SOURCE_BASE_URL", if an URL is given, it will download the source code from the given URL, otherwise, it downloads from the default official site.
- How to verify it
Specify a valid URL in the make file, the MFT debs should be built successfully.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
Support per PSU slope value for PSU power threshold according to hardware team requirement
- How I did it
Pass the PSU number as a parameter when fetching the slope value of PSU.
- How to verify it
Running regression and manual test
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.
To fix issue #13591.
- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected
- How to verify it
Manual test
- Why I did it
Advance hw-mgmt service to V.7.0020.4100
Add missing thermal sensors that are supported by hw-mgmt package
Delay system health service before hw-mgmt has started on Mellanox platform in order to avoid reading some sensors before ready.
Depends on sonic-net/sonic-linux-kernel#305
- How I did it
1. Update hw mgmt version
2. Add missing sensors
3. Delay service
- How to verify it
Regression test.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Why I did it
High CPU utilization by entropy.py
How I did it
Remove entropy script as it does not work anymore and is no longer needed for bullseye(202205).
In Buster(202012) the max available poolsize (entropy_avail) for entropy is 4096 and our entropy.py script was based on this value. With the change in kernel to bullseye on 202205 this entropy poolsize was changed to 256 which also causes our script to fail.
This script was initially added to provide SW assistance to improve the system entropy value available early on in the Sonic boot sequence on buster.
On bullseye (Linux kernel 5.10) this is no longer needed as this feature has been improved.
How to verify it
run "top" command to check CPU usage.
Why I did it
Command "sudo sfputil show error-status -hw" shows "OK (Not implemented)" in the output.
How I did it
Add a new SFP API get_error_description support in Nokia sonic-platform sfp.py module.
How to verify it
Run the new image and execute command "sudo sfputil show error-status -hw"
Why I did it
Some of the platform vendors use FPGA in the HW design. This FPGA is connected to the CPU via PCIe interface. This FPGA also works as an I2C controller having other devices attached to the I2C channels emanating from it. Adding a common module, a driver and a platform specific algorithm module to be used for such FPGA in PDDF.
How I did it
Added 'pddf_fpgapci_module', 'pddf_fpgapci_driver' and a sample algorithm module for Xilinx device 7021. Kernel modules which takes the platform dependent data from PDDF JSON files and initialises the PCIe FPGA. The sample algorithm module can be used by the ODMs in case the communication algorithms are same for their device. Else, they need to come up with similar algo module.
How to verify it
Any platform having such an FPGA and brought up using PDDF would use these kernel modules. The detail representation of such a device in PDDF JSON file is covered in the HLD.
Why I did it
Sometime, SIGTERM processing by psud takes more then default 10sec (please see stopwaitsecs in http://supervisord.org/configuration.html).
Due to this, the following two testcases may fail:
test_pmon_psud_stop_and_start_status
test_pmon_psud_term_and_start_status
How I did it
Update PSU plugin to process sigterm signal so that psud runs faster to end last cycle in time
How to verify it
Run SONiC CTs:
test_pmon_psud_stop_and_start_status
test_pmon_psud_term_and_start_status
Why I did it
dplane_fpm_nl is a new FPM implementation in FRR. The old plugin fpm will not have any new features implemented. Usage of the new plugin gives us ability to use BGP suppression feature and next hop groups in the future.
How I did it
Switch to dplane_fpm_nl zebra plugin from old fpm plugin which is not supported anymore
Remove stale patches for old fpm plugin and add similar patches for dplane_fpm_nl
How to verify it
Build and run on the switch.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Why I did it
Upgrade both Centec X86 and ARM64 platform containers(syncd/saiserver/syncd-rpc) to bullseye
Optimize Centec X86 platform makefile, change sdk.mk to sai.mk
How I did it
Modify Makefile and Dockerfile to use bullseye
Change filename form sdk.mk to sai.mk, optimize and modify related files
How to verify it
For Centec X86 platform, compile the code with : a) make configure PLATFORM=centec; b) make all
For Centec ARM64 platform, cmpile the code with: a) make configure PLATFORM=centec-arm64 PLATFORM_ARCH=arm64; b) make all
Verifiy the sonic-centec.bin and sonic-centec-arm64.bin on Centec chip based board.
- Why I did it
Added platform specific script to be invoked during SAI failure dump. Added some generic changes to mount /var/log/sai_failure_dump as read write in the syncd docker
- How I did it
Added script in docker-syncd of mellanox and copied it to /usr/bin
- How to verify it
Manual UT and new sonic-mgmt tests
Why I did it
LED driver changed due to introduction of FPGA support. The PDDF parser and APIs need to be updated. In turn the common platform APIs also require changes.
How I did it
Changed the get/set status LED APIs for PSU, fan and fan_drawer.
Changed the color strings to plain color name. e.g. 'STATUS_LED_COLOR_GREEN' has been changed to 'green'
Added support for LED color get operation via BMC
How to verify it
Verified the new changes on Accton AS7816-64X platform.
root@sonic:/home/admin#
root@sonic:/home/admin# show platform summary
Platform: x86_64-accton_as7816_64x-r0
HwSKU: Accton-AS7816-64X
ASIC: broadcom
ASIC Count: 1
Serial Number: AAA1903AAEV
Model Number: FP3AT7664000A
Hardware Revision: N/A
root@sonic:/home/admin#
root@sonic:/home/admin# show ver |more
SONiC Software Version: SONiC.master.0-dirty-20230111.010655
Distribution: Debian 11.6
Kernel: 5.10.0-18-2-amd64
Build commit: 3176b15ae
Build date: Wed Jan 11 09:12:54 UTC 2023
Built by: fk410167@sonic-lvn-csg-006
Platform: x86_64-accton_as7816_64x-r0
HwSKU: Accton-AS7816-64X
ASIC: broadcom
ASIC Count: 1
Serial Number: AAA1903AAEV
Model Number: FP3AT7664000A
Hardware Revision: N/A
Uptime: 09:24:42 up 4 days, 22:45, 1 user, load average: 1.97, 1.80, 1.51
Date: Mon 23 Jan 2023 09:24:42
Docker images:
REPOSITORY TAG IMAGE ID SI
ZE
docker-orchagent latest 63262c7468d7 38
5MB
root@sonic:/home/admin#
root@sonic:/home/admin#
root@sonic:/home/admin# pddf_ledutil getstatusled LOC_LED
off
root@sonic:/home/admin# pddf_ledutil getstatusled DIAG_LED
green
root@sonic:/home/admin#
root@sonic:/home/admin#
root@sonic:/home/admin# pddf_ledutil setstatusled DIAG_LED red
True
root@sonic:/home/admin# pddf_ledutil getstatusled DIAG_LED
red
root@sonic:/home/admin#
root@sonic:/home/admin#
root@sonic:/home/admin#
root@sonic:/home/admin# pddf_ledutil setstatusled DIAG_LED amber
Invalid color
False
root@sonic:/home/admin# pddf_ledutil getstatusled DIAG_LED
red
root@sonic:/home/admin#
root@sonic:/home/admin#
root@sonic:/home/admin# pddf_ledutil setstatusled DIAG_LED green
True
root@sonic:/home/admin# pddf_ledutil getstatusled DIAG_LED
green
root@sonic:/home/admin#
root@sonic:/home/admin#
root@sonic:/home/admin#
root@sonic:/home/admin# pddf_ledutil getstatusled LOC_LED
off
root@sonic:/home/admin# pddf_ledutil setstatusled LOC_LED amber
True
root@sonic:/home/admin# pddf_ledutil getstatusled LOC_LED
amber
root@sonic:/home/admin# pddf_ledutil setstatusled LOC_LED off
True
root@sonic:/home/admin# pddf_ledutil getstatusled LOC_LED
off
root@sonic:/home/admin#
Why I did it
Some of the platform vendors use FPGA in the HW design. This FPGA is connected to the CPU via I2C bus. Adding a common module and a driver to be used for such FPGA in PDDF.
How I did it
Added 'pddf_fpgai2c_module' and 'pddf_fpgai2c_driver' kernel modules which takes the platform dependent data from PDDF JSON files and creates an I2C client for the FPGA.
How to verify it
Any platform having such an FPGA and brought up using PDDF would use these kernel modules. The detail representation of such a device in PDDF JSON file is covered in the HLD.
- Why I did it
To include latest fixes and new functionality
SDK/FW
1. Fixed bug in recovery mechanism in case of I2C error when trying to access the XSFP module.
2. On the NVIDIA Spectrum-2 switch, when receiving a packet with Symbol Errors on ports that are configured to cut-thought mode, a pipeline might get stuck.
3. On the Spectrum-2 and Spectrum-3 switch, if you enable ECN marking and the port is in split mode, traffic sent to the port under congestion (for example, when connecting two ports with a total speed of 50GbE to a single 25GbE port) is not marked.
4. Modifying existing entry/Adding new one when switch is at its maximum capacity (full by maximum allowed entries from any type such as routes, FDB, and so forth), will fail with an error.
5. When many ports are active (e.g., 70 ports up), and the configuration of shared buffer is applied on the fly, occasionally, the firmware might get stuck.
6. When a system has more than 256 ACL rules, on rare occasion, removing/adding rules may cause some ACL rules not to work.
7. On SN2201 system, on RJ45 port, the link might appear in 'down' state even if it operations properly.
8. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event.
9. When setting LAG as a SPAN analyzer, the distributor mode of the LAG members was not taken into account. It may happen that the LAG member with distributor mode disabled will be set as a SPAN analyzer port.
- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Why I did it
To improve ASIC FW upgrade logging and have information about the cause of FW update failure in the log.
- How I did it
Added syslog logger support
In case the FW update has failed the update tool will give the cause of the failure in the output in the last line, starting with "Fail".
When running the tool, in case of a failed update, we will parse the output to retrieve the cause and log it.
Device #1:
----------
Device Type: ConnectX6DX
Part Number: MCX623106AN-CDA_Ax
Description: ConnectX-6 Dx EN adapter card; 100GbE; Dual-port QSFP56; PCIe 4.0/3.0 x16;
PSID: MT_0000000359
PCI Device Name: /dev/mst/mt4125_pciconf0
Base GUID: 0c42a103007d22d4
Base MAC: 0c42a17d22d4
Versions: Current Available
FW 22.32.0498 22.32.0498
PXE 3.6.0500 3.6.0500
UEFI 14.25.0015 14.25.0015
Status: Forced update required
---------
Found 1 device(s) requiring firmware update...
Device #1: Updating FW ...
FSMST_INITIALIZE - OK
Writing Boot image component - OK
Fail : The Digest in the signature is wrong
- How to verify it
mlnx-fw-upgrade.sh --upgrade
Add script usage and more information to script description being printed in help option.
- Why I did it
Missing information in script description in help option.
- How I did it
Expand script description and add script usage.
- How to verify it
Run the script with -h option.
Why I did it
Update Nokia sonic-platform submodule
81a9c77 [Supervisor] Modifed the get_description to fix the name for Nokia-IXR7250E-SUP-10 card.
e49ddfb Fix the LedContorlCommon to get the physical index from port mapping
dd143f1 [module] modify the chassis.py and module.py to allow supervisor to retrieve the line card eemprom info
How I did it
Update Nokia sonic-platform submodule
81a9c77 [Supervisor] Modifed the get_description to fix the name for Nokia-IXR7250E-SUP-10 card.
e49ddfb Fix the LedContorlCommon to get the physical index from port mapping
dd143f1 [module] modify the chassis.py and module.py to allow supervisor to retrieve the line card eemprom info
How to verify it
On supervisor, "show chassis module status" should show Nokia-IXR7250E-SUP-10 instead of Nokia-IXR7250-SUP-10
Signed-off-by: mlok <marty.lok@nokia.com>