- Why I did it
Changing LPMODE timing is different between cables.
We want to add functionality to make sure LPMODE has changed.
For that, the wait_until utility is used and every 1 second (until timeout), it will check with lower-layers what is the current Lpmode.
Once it is the expected mode, set_lpmode() functino will return True.
If after seconds, Lpmode is still not in the expected mode, set_lpmode() function will return False.
- How I did it
Add use of wait_until function to make sure lpmode was changed.
- How to verify it
sfputil lpmode on
sfputil lpmode off
- Why I did it
The creation of system EEPROM VPD file "/var/run/hw-management/eeprom/vpd_info" is triggered by the udev event during the system boot up, in case the CPU is busy during the bootup, the udev event handling can be delayed, and need to wait for some more time for the file creation.
- How I did it
Extend the waiting time from 10s to 20s to overcome some extreme case.
- How to verify it
continuously run reboot case and verify whether still can see error msg "ERR decode-syseeprom: Nowhere to read syseeprom from! No symlink found"
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Why I did it
Enable build cache for marvell-arm64 build to decrease PR check duration.
Work item tracking
Microsoft ADO (number only): 26340500
How I did it
How to verify it
- Why I did it
when reading sysfs fd upon python poller events, there's end of line garbage like "# 012" (without space between the 2 parts) trailing the real value of 1 or 0
- How I did it
using python strip() to remove end of line
- How to verify it
run the CMIS host management feature on a switch
wait few minutes until switch completes boot up sequence including CMIS host manager
then disconnect or reconnect a port to create a poller event
- Why I did it
Fix the code to work also after warm reboot to work with FW controlled ports.
In warm reboot the control state sysfs of each port does not change unlike reboot or fast boot.
- How I did it
1. Check procfs cmdline if warm reboot done this is due to the fact pmon don't recognize warm reboot when it's taking place since pmon is loaded after warm reboot is finished.
2. If warm reboot done, check in static detection part for each port if it's FW controlled. If so, leave it this way and stop the state machine flow (set it to final state).
- How to verify it
1. Boot a switch with CMIS host management with at least one FW controlled port (non active cables or non cmis cables) then run warm reboot.
2. Verify no errors of sysfs reading appears for control sysfs
- Why I did it
If a PSU is not present, there could be error log while restarting psud or thermalctld:
Jan 8 17:15:52.689616 sonic ERR pmon#psud: Thermal sysfs /run/hw-management/thermal/psu2_temp1_max does not exist
Jan 8 17:15:57.747723 sonic ERR pmon#thermalctld: Thermal sysfs /run/hw-management/thermal/psu2_temp1 does not exist
- How I did it
if a PSU is not present, we should not check the PSU temperature sysfs.
* fix hw_reset low polarity (reverse values)
* move seek to beginning of sysfs fd before reading to resolve power_good
sysfs returns empty upon plug out cable
- Why I did it
1. Thermal updater should wait more time for module to be initialized
2. sfp should get temperature threshold from EEPROM because SDK sysfs is not yet supported
3. Rename sfp function to fix typo
4. sfp.get_presence should return False if module is under initialization
- How I did it
1. Thermal updater should wait more time for module to be initialized
2. sfp should get temperature threshold from EEPROM because SDK sysfs is not yet supported
3. Rename sfp function to fix typo
4. sfp.get_presence should return False if module is under initialization
- How to verify it
Manual test
Unit test
- Why I did it
watchdog-control service always disarm watchdog during system startup stage. It could be the case that watchdog is not fully initialized while the watchdog-control service is accessing it. This PR adds a wait to make sure watchdog has been fully initialized.
- How I did it
adds a wait to make sure watchdog has been fully initialized.
- How to verify it
Manual test
sonic regression
Signed-off-by: Nazarii Hnydyn nazariig@nvidia.comCloses#17345
This W/A was proposed by Nvidia FRR team before the long term solution is ready.
Why I did it
A W/A to fix default route installation during LAG member flap
Work item tracking
N/A
How I did it
Disabled FRR next hop group support
How to verify it
Do LAG member flap
- Why I did it
For CMIS host management module, we need a different implementation for sfp.reset. This PR is to implement it
- How I did it
For SW control modules, do reset from hw_reset
For FW control modules, do reset as the original way
- How to verify it
Manual test
sonic-mgmt platform test
- Why I did it
New implementation of Nvidia platform_wait due to:
1. sysfs deprecated by hw-mgmt
2. new dependencies to SDK
3. For CMIS host management mode
- How I did it
wait hw-management ready
wait SDK sysfs nodes ready
- How to verify it
manual test
unit test
sonic-mgmt regression
These changes, in conjunction with NDK version >= 22.9.17 address the thermal logging issues discussed at Nokia-ION/ndk#27. While the changes contained at this PR do not require coupling to NDK version >= 22.9.17, thermal logging enhancements will not be available without updated NDK >= 22.9.17. Thus, coupling with NDK >=22.9.17 is preferred and recommended.
Why I did it
To address thermal logging deficiencies.
Work item tracking
Microsoft ADO (number only): 26365734
How I did it
The following changes are included:
Threshold configuration values are provided in the associated device data .json files. There is also a change included to better handle the condition where an SFP module read fails.
Modify the module.py reboot to support reboot linecard from Supervisor
- Modify reboot to call _reboot_imm for single IMM card reboot
- Add log to the ndk_cmd to log the operation of "reboot-linecard" and "shutdown/satrtup the sfm"
Add new nokia_cmd set command and modify show ndk-status output
- Add a new function reboot_imm() to nokia_common.py to support reboot a single IMM slot from CPM
- Added new command: nokia_cmd set reboot-linecard <slot> [forece] for CPM
- Append a new column "RebootStatus" at the end of output of "nokia_cmd show ndk-status"
- Provide ability for IMM to disable all transceiver module TX at reboot time
- Remove defunct xcvr-resync service
- Why I did it
Fix issue xcvrd crashes due to cannot import name 'initialize_sfp_thermal':
Nov 27 09:47:16.388639 sonic ERR pmon#xcvrd: Exception occured at CmisManagerTask thread due to ImportError("cannot import name 'initialize_sfp_thermal' from partially initialized module 'sonic_platform.thermal' (most likely due to a circular import) (/usr/local/lib/python3.9/dist-packages/sonic_platform/thermal.py)")
- How I did it
Add lock for creating SFP object
- How to verify it
Unit test
Manual Test
- Why I did it
When module is totally under software control, driver cannot get module temperature/temperature threshold from firmware. In this case, sonic needs to get temperature/temperature threshold from EEPROM. In this PR, a thread thermal updater is created to update module temperature/temperature threshold while software control is enabled.
- How I did it
Query ASIC temperature from SDK sysfs and update hw-management-tc periodically
Query Module temperature from EEPROM and update hw-management-tc periodically
- How to verify it
Manual test
New Unit tests
- Why I did it
Enable CMIS host management for Mellanox devices which are expected to support the feature
- How I did it
new thread in a new file and changing logic in platform code in chassis.py which is calling this thread from get_change_event()
this thread in the new file handles the state machine per port.
first the static detection takes place once the thread is up (during switch bootup sequence), until final decision if it's FW control or SW control module.
After it ends, the dynamic detection takes place, listening to changes in the sysfs fds, per port,
so it will be able to detect plug in or out events of a cable.
- How to verify it
Enhanced unit tests
run sonic mgmt on Nvidia SN4700 with CMIS host management enabled
Co-authored-by: dbarashinvd <105214075+dbarashinvd@users.noreply.github.com>
- Why I did it
Provide a dummy implementation for SFP error description when CMIS host management is enabled. A future feature shall be raised to implement SFP error description for such mode.
- How I did it
if SFP is under software control, provide "Not supported" as error description
if SFP is under initialization, provide "Initializing" as error description
- How to verify it
unit test
To modify EEPROM API serial_number_str to return service tag instead of serial number in Dell S6100.
Ref PR: #1239
How I did it
Update EEPROM API serial_number_str to return service tag instead of serial number.
How to verify it
Verify decode-syseeprom -s returns service tag in Dell S6100.
FCS/CRC Errors will only be reported as RX_ERR.
Fix to avoid the mac port related errors.
Fix for sharedResSize testcase failure in QoS-SAI
Fix the issue related to voltage in 'show platform psustatus'.
Support WRED drop for lossy queues.
Fixed an issue where lossy traffic was getting dropped.
Enhancement of SAI logging for errors and interrupts
- Why I did it
The current low power mode setting implementation requests the user to set the port to admin down first before toggling LP mode, this is not backward compatible, now revert it to the old way so that the user can toggle the LP mode regardless of the port admin status.
- How I did it
Revert the recent changes related to LPM in PR #14130 and #16545
- How to verify it
Run all sfputil and SFP platform API related tests on all the Mellanox platforms.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
* [Marvell-arm64] Add platform support for rd98DX35xx
This change adds following two variants of rd98DX35xx board to arm64
build.
Board with CPU integrated into the 98DX35xx switching chip:
Platform: arm64-marvell_rd98DX35xx-r0
HwSKU: rd98DX35xx
ASIC: marvell
Port Config: 32x1G + 16x2.5G + 6x25G
Board with external CN9131 CPU connected over PCI to 98DX35xx
switching chip:
Platform: arm64-marvell_rd98DX35xx_cn9131-r0
HwSKU: rd98DX35xx_cn9131
ASIC: marvell
Port Config: 32x1G + 16x2.5G + 6x25G
Change-Id: I21dc9fe972417daaabb20a5bddf7779d72b7972e
Signed-off-by: Pavan Naregundi <pnaregundi@marvell.com>
* Add HWSKU for rd98DX35xx and rd98DX35xx_cn9131
This patch adds new HWSKU's for Marvell arm64 platforms rd98DX35xx
and rd98DX35xx_cn9131.
Change-Id: Id7c14f49f0e304335cc4ca73dcae52362c49d231
Signed-off-by: Pavan Naregundi <pnaregundi@marvell.com>
---------
Signed-off-by: Pavan Naregundi <pnaregundi@marvell.com>
- Why I did it
Support running hw-management service on MSN4700 emulation platform.
- How I did it
Use physical EEPROM instead of the fake one
Do not skip PSUd, PCId, thermal control daemon
Adjust PCIe and thermal configuration files
Adjust platform.json for different chassis names and thermals
Remove a patch to hw-management in order to enable it
- How to verify it
Run Nvidia simulation on SN4700 (ASIC and Platform)
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
In order to activate FW after it was upgraded need to perform reboot.
If reboot wasn't performed and user need to upgrade to another SONiC image then it will fail.
The reason for that is that during SONiC upgrade new FW should be installed but it will fail because previously installed FW wasn't activated.
In order to allow 2nd FW upgrade without reboot in-between need to reactivate FW image.
This change handles such flow.
Example of issue scenario:
User installed SONiC image on the switch
Then for some reason FW was upgraded by user or script but reboot was not performed to activate it.
After that upgrade to new SONiC image will fail because new image need to install FW but it fails due to previous one wasn't activated.
- How I did it
In "mlnx-fw-upgrade" script check if FW upgrade failed with the error that FW was already installed but reboot was not performed.
If so then perform FW image reactivation and try to upgrade FW again.
- How to verify it
Install SONiC image on the switch
Then upgrade FW but don't perform reboot.
After that upgrade to new SONiC image and check that upgrade was successfull.
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
Why I did it
- Convert hw-dump into generate-dump plugins
- Enable DRAM scrubber on some products
- Fix xcvr driver active low register bit logic
- Improve cooling algorithm (now considers xcvrs and modules)
- Add linecard graceful shutdown (disabled by default)
The scrubber was enabled for the following products:
- DCS-7050QX-32S
- DCS-7050CX3-32S
- DCS-7060CX-32S
This is CSP CS00012280996.
The issue to fix is that the checksum was incorrect for all TCP packets leaving the system so that the BGP connection cannot be established. We found the issue on BCM56993, and it is possible to affect all platforms using linux_ngknet.
Why I did it
XGS saibcm-modules 8.4 is needed. #14471
Work item tracking
Microsoft ADO (number only): 24917414
How I did it
Copy files from xgs SDK 8.4 repo and modify makefiles to build the image.
Upgrade version to 8.4.0.2 in saibcm-modules.mk.
How to verify it
Build a private image and run full qualification with it: https://elastictest.org/scheduler/testplan/650419cb71f60aa92c456a2b