Commit Graph

195 Commits

Author SHA1 Message Date
Sasha
26bc08850b
Check if PSU files exists when getting psu_voltage properties (#17042)
- Why I did it
Error messages occured when trying to read PSU files on init:
ERR pmon#psud: Failed to read from file /var/run/hw-management/power/psu1_volt_out2_capability - FileNotFoundError(2, 'No such file or directory')

This can happen when the power cord is disconnected from the PSU, so some PSU files may be absent, e.g.:
/var/run/hw-management/power/psu2_volt_out2
/var/run/hw-management/power/psu2_volt_out2_capability

- How I did it
Check if a file exists for a specific PSU parameter If not, return None so we can't read the PSU file any further

- How to verify it
Disconnect power cord from PSU and power supply from system
Wait few minutes and then connect power supply to system without power cord
Check logs for errors

Signed-off-by: Oleksandra Bella <oleksandrab@nvidia.com>
2024-03-04 19:35:06 +02:00
noaOrMlnx
c2aa77c438
[Mellanox] Fix timing issue in lpmode change (#18223)
- Why I did it
Changing LPMODE timing is different between cables.
We want to add functionality to make sure LPMODE has changed.
For that, the wait_until utility is used and every 1 second (until timeout), it will check with lower-layers what is the current Lpmode.
Once it is the expected mode, set_lpmode() functino will return True.
If after seconds, Lpmode is still not in the expected mode, set_lpmode() function will return False.

- How I did it
Add use of wait_until function to make sure lpmode was changed.

- How to verify it
sfputil lpmode on
sfputil lpmode off
2024-03-04 16:38:51 +02:00
Kebo Liu
881ceb7034
[Mellanox] Extend the time to wait for EEPROM VPD file creation (#18146)
- Why I did it
The creation of system EEPROM VPD file "/var/run/hw-management/eeprom/vpd_info" is triggered by the udev event during the system boot up, in case the CPU is busy during the bootup, the udev event handling can be delayed, and need to wait for some more time for the file creation.

- How I did it
Extend the waiting time from 10s to 20s to overcome some extreme case.

- How to verify it
continuously run reboot case and verify whether still can see error msg "ERR decode-syseeprom: Nowhere to read syseeprom from! No symlink found"

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2024-02-29 15:46:49 +02:00
Lahav-Nvidia
8a7e38b3a3
[Mellanox] Add N/A as a valid fan direction for Nvidia platforms (#17930)
- Why I did it
On some Nvidia platforms, fan direction could not be determined. Therefore 'N/A' becomes a valid value for those cases.

- How I did it
Add 'N/A' to the valid fan direction mapping, to avoid an error in the log.

- How to verify it
Check fan direction on Nvidia platforms, and make sure there aren't errors in the log.
2024-02-22 11:35:10 +02:00
dbarashinvd
7a34d4a275
[Mellanox] fix code for warm reboot to work with FW controlled ports (#18065)
- Why I did it
Fix the code to work also after warm reboot to work with FW controlled ports.
In warm reboot the control state sysfs of each port does not change unlike reboot or fast boot.

- How I did it
1. Check procfs cmdline if warm reboot done this is due to the fact pmon don't recognize warm reboot when it's taking place since pmon is loaded after warm reboot is finished.
2. If warm reboot done, check in static detection part for each port if it's FW controlled. If so, leave it this way and stop the state machine flow (set it to final state).

- How to verify it
1. Boot a switch with CMIS host management with at least one FW controlled port (non active cables or non cmis cables) then run warm reboot.
2. Verify no errors of sysfs reading appears for control sysfs
2024-02-08 14:49:56 +02:00
Yevhen Fastiuk
2f35079979
[Mellanox] Fix uninitialized variable on module plug event (#17011)
- Why I did it
To fix uninitialized variable

- How I did it
Add initial value

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
2024-02-05 19:41:16 +02:00
dbarashinvd
0aacc1f28e
[Mellanox] fix sysfs reading that gets garbage end of line using strip (#17830)
- Why I did it
when reading sysfs fd upon python poller events, there's end of line garbage like "# 012" (without space between the 2 parts) trailing the real value of 1 or 0

- How I did it
using python strip() to remove end of line

- How to verify it
run the CMIS host management feature on a switch
wait few minutes until switch completes boot up sequence including CMIS host manager
then disconnect or reconnect a port to create a poller event
2024-02-05 19:39:55 +02:00
dbarashinvd
927dde73f1
fix low polarity wrong value for hw_reset deassert and seek(0) before reading sysfs upon poll event (#17627)
* fix hw_reset low polarity (reverse values)

* move seek to beginning of sysfs fd before reading to resolve power_good
sysfs returns empty upon plug out cable
2024-01-22 10:53:55 -08:00
Junchao-Mellanox
91d77fe7ae
Fix error log while creating PSU thermal object (#17789)
- Why I did it
If a PSU is not present, there could be error log while restarting psud or thermalctld:

Jan  8 17:15:52.689616 sonic ERR pmon#psud: Thermal sysfs /run/hw-management/thermal/psu2_temp1_max does not exist

Jan  8 17:15:57.747723 sonic ERR pmon#thermalctld: Thermal sysfs /run/hw-management/thermal/psu2_temp1 does not exist

- How I did it
if a PSU is not present, we should not check the PSU temperature sysfs.
2024-01-22 16:22:07 +02:00
Junchao-Mellanox
ee49d0dfec
[Mellanox] Fix issues found for CMIS host management (#17637)
- Why I did it
1. Thermal updater should wait more time for module to be initialized
2. sfp should get temperature threshold from EEPROM because SDK sysfs is not yet supported
3. Rename sfp function to fix typo
4. sfp.get_presence should return False if module is under initialization

- How I did it
1. Thermal updater should wait more time for module to be initialized
2. sfp should get temperature threshold from EEPROM because SDK sysfs is not yet supported
3. Rename sfp function to fix typo
4. sfp.get_presence should return False if module is under initialization

- How to verify it
Manual test
Unit test
2024-01-04 09:42:33 +02:00
Junchao-Mellanox
7d388cd0e6
[Mellanox] wait until hw-management watchdog files ready (#17618)
- Why I did it
watchdog-control service always disarm watchdog during system startup stage. It could be the case that watchdog is not fully initialized while the watchdog-control service is accessing it. This PR adds a wait to make sure watchdog has been fully initialized.

- How I did it
adds a wait to make sure watchdog has been fully initialized.

- How to verify it
Manual test
sonic regression
2023-12-26 18:27:18 +02:00
Junchao-Mellanox
d8a1ffbace
[Mellanox] implement sfp.reset for CMIS management (#16862)
- Why I did it
For CMIS host management module, we need a different implementation for sfp.reset. This PR is to implement it

- How I did it
For SW control modules, do reset from hw_reset
For FW control modules, do reset as the original way

- How to verify it
Manual test
sonic-mgmt platform test
2023-12-17 08:02:47 +02:00
Junchao-Mellanox
c1cb292310
[Mellanox] implement platform wait in python code (#17398)
- Why I did it
New implementation of Nvidia platform_wait due to:
1. sysfs deprecated by hw-mgmt
2. new dependencies to SDK
3. For CMIS host management mode

- How I did it
wait hw-management ready
wait SDK sysfs nodes ready

- How to verify it
manual test
unit test
sonic-mgmt regression
2023-12-14 12:04:24 +02:00
Junchao-Mellanox
f373a16e95
[Mellanox] Fix race condition while creating SFP (#17441)
- Why I did it
Fix issue xcvrd crashes due to cannot import name 'initialize_sfp_thermal':

Nov 27 09:47:16.388639 sonic ERR pmon#xcvrd: Exception occured at CmisManagerTask thread due to ImportError("cannot import name 'initialize_sfp_thermal' from partially initialized module 'sonic_platform.thermal' (most likely due to a circular import) (/usr/local/lib/python3.9/dist-packages/sonic_platform/thermal.py)")

- How I did it
Add lock for creating SFP object

- How to verify it
Unit test
Manual Test
2023-12-14 12:01:11 +02:00
Junchao-Mellanox
1b84f3daa5
[Mellanox] update asic and module temperature in a thread for CMIS management (#16955)
- Why I did it
When module is totally under software control, driver cannot get module temperature/temperature threshold from firmware. In this case, sonic needs to get temperature/temperature threshold from EEPROM. In this PR, a thread thermal updater is created to update module temperature/temperature threshold while software control is enabled.

- How I did it
Query ASIC temperature from SDK sysfs and update hw-management-tc periodically
Query Module temperature from EEPROM and update hw-management-tc periodically

- How to verify it
Manual test
New Unit tests
2023-12-13 14:19:44 +02:00
Junchao-Mellanox
0d62cf0e92
[Mellanox] Remove EEPROM write limitation if it is software control (#17030)
- Why I did it
When module is under software control (CMIS host management enabled), EEPROM should be controlled by software and there should be no limitation for any write operation.

- How I did it
Remove EEPROM write limitation if a module is under software control

- How to verify it
Manual test
UT
2023-12-13 14:16:40 +02:00
Junchao-Mellanox
b0bb3d40d3
[Mellanox] Implement low power mode for cmis host management (#17159)
- Why I did it
For cmis host management mode, the prevous sysfs cannot be used for low power mode setting. This PR reuses existing low power mode implementation in sonic_xcvr package when CMIS host management mode is enabled

- How I did it
Use sonic_xcvr low power mode implementation when CMIS host management mode is enabled.

- How to verify it
Manual test for CMIS host management mode
Regression test for old mode and backward compatible test
2023-12-11 10:42:01 +02:00
dbarashinvd
000a2ef818
[Mellanox] Enable CMIS host management (#16846)
- Why I did it
Enable CMIS host management for Mellanox devices which are expected to support the feature

- How I did it
new thread in a new file and changing logic in platform code in chassis.py which is calling this thread from get_change_event()
this thread in the new file handles the state machine per port.
first the static detection takes place once the thread is up (during switch bootup sequence), until final decision if it's FW control or SW control module.
After it ends, the dynamic detection takes place, listening to changes in the sysfs fds, per port,
so it will be able to detect plug in or out events of a cable.

- How to verify it
Enhanced unit tests
run sonic mgmt on Nvidia SN4700 with CMIS host management enabled
2023-12-07 14:54:56 +02:00
Junchao-Mellanox
26bf38b610
[Mellanox] Provide default implementation for sfp error description when CMIS host management is enabled (#17294)
- Why I did it
Provide a dummy implementation for SFP error description when CMIS host management is enabled. A future feature shall be raised to implement SFP error description for such mode.

- How I did it
if SFP is under software control, provide "Not supported" as error description
if SFP is under initialization, provide "Initializing" as error description

- How to verify it
unit test
2023-12-05 20:27:29 +02:00
Kebo Liu
bf4a2e3002
[Mellanox] Revert LPM implementation to the old way (#17096)
- Why I did it
The current low power mode setting implementation requests the user to set the port to admin down first before toggling LP mode, this is not backward compatible, now revert it to the old way so that the user can toggle the LP mode regardless of the port admin status.

- How I did it
Revert the recent changes related to LPM in PR #14130 and #16545

- How to verify it
Run all sfputil and SFP platform API related tests on all the Mellanox platforms.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-11-27 13:23:04 +02:00
Vivek
787dd7221d [Mellanox] Upgrade HW-MGMT to 7.0030.2008 and update platform-api (#17134)
Why I did it
Add platform support for Debian 12 (Bookworm) on Mellanox Platform

How I did it
Update hw-management to v7.0030.2008
Deprecate the sfp_count == module_count approach in favour of asic init completion
Ref: Mellanox/hw-mgmt@bf4f593
Add xxd package to base image which is required by hw-management scripts
Add the non-upstream flag into linux kernel cache options
Update the thermalctl logic based on new sysfs attributes
Fix the integrate-mlnx-hw-mgmt script to not populate the arm64 Kconfig
How to verify it
Build kernel and run platform tests

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>
Co-authored-by: Junchao-Mellanox <junchao@nvidia.com>
Co-authored-by: Junchao-Mellanox <57339448+Junchao-Mellanox@users.noreply.github.com>
2023-11-21 18:53:15 -08:00
Stephen Sun
b93852d53d
[Mellanox] Support running hw-management service on MSN4700 emulation platform (#16584)
- Why I did it
Support running hw-management service on MSN4700 emulation platform.

- How I did it
Use physical EEPROM instead of the fake one
Do not skip PSUd, PCId, thermal control daemon
Adjust PCIe and thermal configuration files
Adjust platform.json for different chassis names and thermals
Remove a patch to hw-management in order to enable it

- How to verify it
Run Nvidia simulation on SN4700 (ASIC and Platform)

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-11-19 11:03:46 +02:00
Junchao-Mellanox
c2edc6f9d5
Revert "[Mellanox] Align PSU temperature sysfs node name with hw-management change (#16820)" (#16956)
This reverts commit 0846322e9a.
2023-10-23 11:55:27 +03:00
Junchao-Mellanox
0846322e9a
[Mellanox] Align PSU temperature sysfs node name with hw-management change (#16820)
- Why I did it
hw-management renamed PSU temperature related sysfs:

psu1_temp -> psu1_temp1
psu2_temp -> psu2_temp1
psu1_temp_max -> psu1_temp1_max
psu2_temp_max -> psu2_temp1_max
This PR is to align the change in SONiC.

- How I did it
Use new sysfs node for PSU temperature and PSU temperature threshold

- How to verify it
Manual test
sonic-mgmt Regression test
2023-10-10 19:21:27 +03:00
Junchao-Mellanox
aedffd333b
[Mellanox] wait reset cause ready (#16722)
Why I did it
SONiC service determine-reboot-cause might run before driver creating reset cause files. In that case, the reset cause will be "Unknown". This PR introduces a wait mechanism to wait for reset cause sysfs files ready.

How I did it
/run/hw-management/config/reset_attr_ready is the file to indicate all reset cause files are ready. In chassis.get_reboot_cause function, it waits /run/hw-management/config/reset_attr_ready for up to 45 seconds.

How to verify it
Manual test on master/202211/202205
2023-10-03 18:58:31 -07:00
Vivek
456a90e1ab
[Nvidia] Remove the dependency on python_sdk_api for sfp api (#16545)
Sfp api can now be called from the host which doesn't have the python_sdk_api installed. Also, sfp api has been migrated to use sysfs instead of sdk handle.

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
2023-09-23 00:19:27 -07:00
Junchao-Mellanox
5138afe4e7
[Mellanox] add new platform 2700 a1 (#16515)
- new pcie.yaml
- new sensors.conf
- new thermal support
- new platform.json file
- adjust test code
2023-09-23 00:15:17 -07:00
Kebo Liu
e286869b24
[Mellanox] Update HW-MGMT package to new version V.7.0030.1011 (#16239)
- Why I did it
1. Update Mellanox HW-MGMT package to newer version V.7.0030.1011
2. Replace the SONiC PMON Thermal control algorithm with the one inside the HW-MGMT package on all Nvidia platforms
3. Support Spectrum-4 systems

- How I did it
1. Update the HW-MGMT package version number and submodule pointer
2. Remove the thermal control algorithm implementation from Mellanox platform API
3. Revise the patch to HW-MGMT package which will disable HW-MGMT from running on SIMX
4. Update the downstream kernel patch list

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-09-06 11:32:08 +03:00
Vadym Hlushko
9e3fdded69
[Mellanox][SFP] Remove unused function parameter (#16318)
Why I did it
To avoid errors when the sfputil show error-status -hw is called from the host OS (not from the pmon docker).

How I did it
Remove the self.sdk_handle parameter from the _get_module_info() function.

How to verify it
Execute the sfputil show error-status -hw

Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
2023-09-01 23:06:04 -07:00
Junchao-Mellanox
95f317a5e2
[Mellanox] Fix issue: watchdogutil command does not work (#16091)
- Why I did it
watchdogutil uses platform API watchdog instance to control/query watchdog status. In Nvidia watchdog status, it caches "armed" status in a object member "WatchdogImplBase.armed". This is not working for CLI infrastructure because each CLI will create a new watchdog instance, the status cached in previous instance will totally lose. Consider following commands:

admin@sonic:~$ sudo watchdogutil arm -s 100      =====> watchdog instance1, armed=True
Watchdog armed for 100 seconds
admin@sonic:~$ sudo watchdogutil status             ======> watchdog instance2, armed=False
Status: Unarmed
admin@sonic:~$ sudo watchdogutil disarm            =======> watchdog instance3, armed=False
Failed to disarm Watchdog

- How I did it
Use sysfs to query watchdog status

- How to verify it
Manual test
Unit test
2023-08-23 09:30:58 +03:00
Vadym Hlushko
b214a8a8b6
[Mellanox] Change SDK API sx_mgmt_phy_module_info_get() to sysfs (#15963)
- Why I did it
Change Mellanox platform API implementation to use ASIC driver sysfs for the module operational state and status error fields.

- How I did it
Modify the platform/mellanox/mlnx-platform-api/sonic_platform/sfp.py file by change the call of sx_mgmt_phy_module_info_get() SDK API to sysfs

- How to verify it
Simulate the unplug cable event
Check the CLI output
sfputil show presence
sfputil show error-status -hw
Simulate the plug cable event
Repeat 2 step

Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
2023-08-21 20:54:13 +03:00
Junchao-Mellanox
91f3da018e
[Mellanox] Add more unit test coverage for platform API (#15842)
- Why I did it
Increase UT coverage for Nvidia platform API code

Work item tracking
Microsoft ADO (number only):

- How I did it
Focus on low coverage file:
1. component.py
2. watchdog.py
3. pcie.py

- How to verify it
Run the unit test, the coverage has been changed from 70% to 90%
2023-08-03 13:54:31 +03:00
Junchao-Mellanox
ed21266ff4
[Mellanox] Remove reset_from_comex from reboot cause mapping (#15793)
- Why I did it
The reset cause "reset_from_comex" has been removed by hw-management, hence removing it from platform API code

- How I did it
Remove reset_from_comex from reboot cause mapping

- How to verify it
Manual test
2023-07-15 01:02:46 +03:00
DavidZagury
b06a856fba
[Mellanox] Add support for BIOS update on Spectrum-4 (#15795)
- Why I did it
BIOS on new generation switch can come with a file type of cap or cab. Needs to add support to these file type.
Also ONIE version on new devices can have a suffix of 'dev'.

- How I did it
Added cap & cab as possible component extensions for ComponentBIOS.
Update the ONIE version regex to include dev signed versions.

- How to verify it
Update BIOS.
2023-07-15 00:59:55 +03:00
Stephen Sun
238e6ffcc1
[Mellanox] Adjust warning threshold implementation according to the latest algorithm update (#15092)
- Why I did it
Adjust the warning threshold implementation according to the latest algorithm update

- How I did it
Modify power warning and critical thresholds methods

- How to verify it
Unit test updated to cover the change

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-06-13 15:14:10 +03:00
Junchao-Mellanox
18cf719d6a
[Mellanox] Use sysfs for sfp reset/LPM/presence (#14130)
- Why I did it
The current implementation of SFP reset, LPM, present relies on SDK API. This PR moves the implementation to SDK sysfs. By this PR, it gains following benefit:
1. SDK sysfs provides better performance.
2. Host side and container side share the same code.
3. Code is much cleaner.

- How I did it
Use SDK sysfs to implement SFP reset, LPM, present.

- How to verify it
1. Manual test.
2. Unit test.
2023-05-24 17:24:34 +03:00
daxia16
1175143af1
[Mellanox] Support UID LED in platform API (#11592)
- Why I did it
As a LED indicator to help user to find switch location in the lab, UID LED is a useful LED in Mellanox switch.

- How I did it
I add a new member _led_uid in Mellanox/Chassis.py, and extend Mellanox/led.py to support blue color.
Relevant platform-common PR sonic-net/sonic-platform-common#369

- How to verify it
Add unit test cases in test.py, and do manual test including turn-on/off/show uid led.

Signed-off-by: David Xia <daxia@nvidia.com>
2023-05-16 08:24:39 +03:00
Junchao-Mellanox
7962a5c0fa
[Mellanox] add PSU fan direction support (#14508)
- Why I did it
Add PSU fan direction support

- How I did it
Implement fan.get_direction for PSU fan

- How to verify it
Manual test
Unit test
2023-05-15 21:34:54 +03:00
Junchao-Mellanox
9deca05f9d
[Mellanox] get LED capability from capability file (#14584)
- Why I did it
Currently, LED sysfs path is hardcoded. We will need change LED code if new LED color is supported for new platforms. This PR is aimed to improve this. By this PR, LED sysfs path is deduced from LED capability file.

- How I did it
Improve LED management on Nvidia platform:
get LED capability from capability file and deduce sysfs name according to the capability

- How to verify it
Unit test
Manual test
2023-05-10 20:53:50 +03:00
Sudharsan Dhamal Gopalarathnam
8d82a86134
[Mellanox]Fix lpmode set when logical port is larger than 64 (#14138)
- Why I did it
In sfplpm API, the number of logical ports is hardcoded as 64. When a system contains more port than this, the SDK APIs would fail with a syslog as below

Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [MGMT_LIB.ERR] Slot [0] Module [0] has logport [0x00010069] in enabled state
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [SDK_MGMT_LIB.ERR] Failed in __sdk_mgmt_phy_module_pwr_attr_set, error: Internal Error
Mar 7 03:53:58.106118 r-leopard-58 ERR pmon#-c: Error occurred when setting power mode for SFP module 0, slot 0, error code 1

- How I did it
Remove the hardcoded value of 64. Obtained the number of logical ports from SDK

- How to verify it
Manual testing
2023-03-09 00:02:55 +02:00
Junchao-Mellanox
f6d3615bb9
[Mellanox] Check system eeprom existence in a retry manner (#13884)
- Why I did it
On Mellanox platform, system EEPROM is a soft link provided by hw-management. There is chance that config-setup service accessing the EEPROM before hw-management creating it. It causes errors. The PR is aim to fix it.

- How I did it
Waiting EEPROM creation in platform API up to 10 seconds.

- How to verify it
Manual test
2023-02-21 19:40:16 +02:00
Junchao-Mellanox
331b97e2aa
[Mellanox] Fix issue: cannot find label port for logical port when logical port number is larger than 64 (#13710)
- Why I did it
sfp_event.py gets a PMPE message when a cable event is available. In PMPE message, there is no label port available. Current sfp_event.py is using sx_api_port_device_get to get 64 logical ports attributes, and find the label port from those 64 attributes. However, if there are more than 64 ports, sfp_event.py might not be able to find the label port and drop the PMPE message.

- How I did it
Don't use hardcoded 64, get logical port number instead.

- How to verify it
Manual test
2023-02-21 08:14:29 +02:00
Stephen Sun
71b5bb6f37
[Mellanox] Support per PSU slope value for PSU power threshold (#13757)
- Why I did it
Support per PSU slope value for PSU power threshold according to hardware team requirement

- How I did it
Pass the PSU number as a parameter when fetching the slope value of PSU.

- How to verify it
Running regression and manual test

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-02-14 08:55:28 +02:00
Junchao-Mellanox
389b279ba9
[Mellanox] set select timeout to no more than 1 sec to make sure fast shutdown (#13611)
- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.

To fix issue #13591.

- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected

- How to verify it
Manual test
2023-02-14 08:26:25 +02:00
Kebo Liu
7873a9131d
[Mellanox] Skip the leftover hardware reboot cause in case of last boot is warm/fast reboot (#13246)
- Why I did it
In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.

- How I did it
Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.

- How to verify it
a. Manual test:
    - Perform a power loss
    - Perform a warm/fast reboot
    - Check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"
b. Run reboot cause related regression test.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-01-11 16:50:46 +02:00
Vadym Hlushko
1a5889ade7
[SFP] Change logging severity when failed to read EEPROM (#13011)
- Why I did it
In order to prevent the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py test failing on the log analyzer step.

The mentioned test is performing the sfputil reset EthernetX for every interface on the SONiC switch, this action will flap the SFP device status (INSTERTED -> REMOVED -> INSTERTED).

The SONiC XCVRD daemon will catch this SFP device status change (because it is monitoring the presence status of the cable).
To judge the cable presence status, currently, we are still leveraging to read the first bytes of the EEPROM, and the EEPROM could be not ready at some moment and the SONiC XCVRD daemon will print the error log to Syslog:

ERR pmon#xcvrd: Error! Unable to read data for 'xx' port, page 'xx' offset 128, rc = 1, err msg: Sending access register

- How I did it
Change logging severity from ERR to WARNING

- How to verify it
Run the sonic-mgmt/tests/platform_tests/sfp/test_sfputil.py

OR much faster way to run the next script on the switch:

#!/bin/bash

START=0
END=248

for (( intf=$START; intf<=$END; intf+=8))
do
    sfputil reset Ethernet"${intf}"
done

sfputil show presence
2022-12-20 10:05:45 +02:00
Kebo Liu
d6ee7f08c2
[Mellanox] change the implementation of is_host() to fix a stuck issue on simx platform (#13100)
- Why I did it
Following code to judge whether a process is running inside a docker could get stuck on the simx platform

subprocess.Popen(["docker", "--version"],
                                stdout=subprocess.PIPE,
                                stderr=subprocess.STDOUT,
                                universal_newlines=True)
When it gets stuck, the config-chassisdb service can not be successfully started, thus the system can not be booted up.

root@sonic:/# service config-chassisdb status
     config-chassisdb.service - Config chassis_db
     Loaded: loaded (/lib/systemd/system/config-chassisdb.service; enabled; vendor preset: enabled)
     Active: activating (start) since Thu 2022-12-15 09:23:02 UTC; 29min ago
   Main PID: 571 (config-chassisd)
      Tasks: 14 (limit: 9501)
     Memory: 132.4M
     CGroup: /system.slice/config-chassisdb.service
                        ├─571 /bin/bash /usr/bin/config-chassisdb
			├─575 /usr/bin/python3 /usr/local/bin/sonic-cfggen -H -v DEVICE_METADATA.localhost.platform
			├─602 /bin/sh -c sudo decode-syseeprom -m
			├─603 sudo decode-syseeprom -m
			├─607 /usr/bin/python3 /usr/local/bin/decode-syseeprom -m
			├─616 /bin/sh -c docker --version 2>/dev/null
			└─617 docker --version

- How I did it
Use an alternative way to implement this function and issue can be avoided:

docker_env_file = '/.dockerenv'
return os.path.exists(docker_env_file) is False

- How to verify it
run regression on real hardware and simx platform.
2022-12-20 10:00:11 +02:00
Junchao-Mellanox
9590339d69
[Mellanox] Remove TODO comments which are no longer needed (#13023)
- Why I did it
Remove TODO comments which are no longer needed

- How I did it
Remove TODO comments which are no longer needed

- How to verify it
Only comment change
2022-12-14 09:57:48 +02:00
Stephen Sun
5d457596ba
[Mellanox] Support PSU power threshold checking (#11863)
* Support power threshold

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* get_psu_power_warning_threshold => get_psu_power_warning_suppress_threshold

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Fix comments

Signed-off-by: Stephen Sun <stephens@nvidia.com>

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-11-21 14:47:43 -08:00
Junchao-Mellanox
20d885dbc2
[Mellanox] Add new thermal sensors for SN5600 (#12671)
- Why I did it
Add new thermal sensors for SN5600

- How I did it
Add new thermal sensors for SN5600: PCH and SODIMM

- How to verify it
Manual test
2022-11-14 11:10:33 -08:00