Commit Graph

69 Commits

Author SHA1 Message Date
dbarashinvd
ac35a0fafb add unit tests for CMIS host management feature (#18211)
* add unit tests for CMIS host management feature
2024-03-21 07:00:43 +08:00
Junchao-Mellanox
8d65e2c517 [Mellanox] Fix issues found for CMIS host management (#17637)
- Why I did it
1. Thermal updater should wait more time for module to be initialized
2. sfp should get temperature threshold from EEPROM because SDK sysfs is not yet supported
3. Rename sfp function to fix typo
4. sfp.get_presence should return False if module is under initialization

- How I did it
1. Thermal updater should wait more time for module to be initialized
2. sfp should get temperature threshold from EEPROM because SDK sysfs is not yet supported
3. Rename sfp function to fix typo
4. sfp.get_presence should return False if module is under initialization

- How to verify it
Manual test
Unit test
2024-01-20 06:32:58 +08:00
Junchao-Mellanox
0fbdc2b8ed [Mellanox] wait until hw-management watchdog files ready (#17618)
- Why I did it
watchdog-control service always disarm watchdog during system startup stage. It could be the case that watchdog is not fully initialized while the watchdog-control service is accessing it. This PR adds a wait to make sure watchdog has been fully initialized.

- How I did it
adds a wait to make sure watchdog has been fully initialized.

- How to verify it
Manual test
sonic regression
2024-01-19 04:32:53 +08:00
Junchao-Mellanox
56ba5b10b4 [Mellanox] implement sfp.reset for CMIS management (#16862)
- Why I did it
For CMIS host management module, we need a different implementation for sfp.reset. This PR is to implement it

- How I did it
For SW control modules, do reset from hw_reset
For FW control modules, do reset as the original way

- How to verify it
Manual test
sonic-mgmt platform test
2024-01-17 06:33:11 +08:00
Kebo Liu
cacf46ff86
[202311][Mellanox] Integrate HW-MGMT Version 7.0030.2008 (#17659)
* Intgerate HW-MGMT 7.0030.2008 Changes

 ## Patch List
* 0285-UBUNTU-SAUCE-mlxbf-gige-Fix-intermittent-no-ip-issue.patch :
* 0286-pinctrl-Introduce-struct-pinfunction-and-PINCTRL_PIN.patch :
* 0287-pinctrl-mlxbf3-Add-pinctrl-driver-support.patch :
* 0288-UBUNTU-SAUCE-gpio-mmio-handle-ngpios-properly-in-bgp.patch :
* 0289-UBUNTU-SAUCE-gpio-mlxbf3-Add-gpio-driver-support.patch :
* 0291-mlxsw-core_hwmon-Align-modules-label-name-assignment.patch :
* 0292-mlxsw-i2c-Limit-single-transaction-buffer-size.patch :
* 0293-mlxsw-reg-Limit-MTBR-register-records-buffer-by-one-.patch :
* 0296-UBUNTU-SAUCE-mmc-sdhci-of-dwcmshc-Add-runtime-PM-ope.patch :
* 0298-UBUNTU-SAUCE-mlxbf-ptm-use-0444-instead-of-S_IRUGO.patch :
* 0299-UBUNTU-SAUCE-mlxbf-ptm-add-atx-debugfs-nodes.patch :
* 0300-UBUNTU-SAUCE-mlxbf-ptm-update-module-version.patch :
* 0301-UBUNTU-SAUCE-mlxbf-gige-Fix-kernel-panic-at-shutdown.patch :
* 0302-UBUNTU-SAUCE-mlxbf-bootctl-support-SMC-call-for-sett.patch :
* 0303-UBUNTU-SAUCE-Add-BF3-related-ACPI-config-and-Ring-de.patch :
* 0306-dt-bindings-trivial-devices-Add-infineon-xdpe1a2g7.patch :
* 0307-leds-mlxreg-Add-support-for-new-flavour-of-capabilit.patch :
* 0308-leds-mlxreg-Remove-code-for-amber-LED-colour.patch :
* 0308-platform_data-mlxreg-Add-capability-bit-and-mask-fie.patch :
* 0309-hwmon-mlxreg-fan-Add-support-for-new-flavour-of-capa.patch :
* 0310-hwmon-mlxreg-fan-Extend-number-of-supporetd-fans.patch :
* 0317-platform-mellanox-Introduce-support-for-switches-equ.patch :
* 0318-mellanox-Relocate-mlx-platform-driver.patch :
* 0319-UBUNTU-SAUCE-mlxbf-tmfifo-fix-potential-race.patch :
* 0320-UBUNTU-SAUCE-mlxbf-tmfifo-Drop-the-Rx-packet-if-no-m.patch :
* 0321-UBUNTU-SAUCE-mlxbf-tmfifo-Drop-jumbo-frames.patch :
* 0322-UBUNTU-SAUCE-mlxbf-tmfifo.c-Amend-previous-tmfifo-pa.patch :
* 0323-mlxbf_gige-add-set_link_ksettings-ethtool-callback.patch :
* 0324-mlxbf_gige-fix-white-space-in-mlxbf_gige_eth_ioctl.patch :
* 0325-UBUNTU-SAUCE-mlxbf-bootctl-Fix-kernel-panic-due-to-b.patch :
* 0326-platform-mellanox-mlxreg-hotplug-Add-support-for-new.patch :
* 0327-platform-mellanox-mlx-platform-Change-register-name.patch :
* 0328-platform-mellanox-mlx-platform-Add-support-for-new-X.patch :

* [Mellanox] Don't populate arm64 Kconfig when integrating hw-mgmt

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>

* [Mellanox] Remove thermal zone related code and replace with new one

* Revert "Revert "[Mellanox] Align PSU temperature sysfs node name with hw-management change (#16820)" (#16956)"

This reverts commit c2edc6f9d5.

* Update copyright header

Signed-off-by: Kebo Liu <kebol@nvidia.com>

---------

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Co-authored-by: Vivek Reddy <vkarri@nvidia.com>
Co-authored-by: Junchao-Mellanox <junchao@nvidia.com>
Co-authored-by: Junchao-Mellanox <57339448+Junchao-Mellanox@users.noreply.github.com>
2024-01-16 08:33:50 -08:00
Junchao-Mellanox
0b511986ae
[202311][Mellanox] implement platform wait in python code (#17398) (#17719)
- Why I did it
New implementation of Nvidia platform_wait due to:
1. sysfs deprecated by hw-mgmt
2. new dependencies to SDK
3. For CMIS host management mode

- How I did it
wait hw-management ready
wait SDK sysfs nodes ready

- How to verify it
manual test
unit test
sonic-mgmt regression
2024-01-16 08:31:33 -08:00
Junchao-Mellanox
767944d7da [Mellanox] Fix race condition while creating SFP (#17441)
- Why I did it
Fix issue xcvrd crashes due to cannot import name 'initialize_sfp_thermal':

Nov 27 09:47:16.388639 sonic ERR pmon#xcvrd: Exception occured at CmisManagerTask thread due to ImportError("cannot import name 'initialize_sfp_thermal' from partially initialized module 'sonic_platform.thermal' (most likely due to a circular import) (/usr/local/lib/python3.9/dist-packages/sonic_platform/thermal.py)")

- How I did it
Add lock for creating SFP object

- How to verify it
Unit test
Manual Test
2024-01-09 14:34:47 +08:00
Junchao-Mellanox
8de7cb5988
[202311] [Mellanox] update asic and module temperature in a thread for CMIS management (#16955) (#17699)
- Why I did it
When module is totally under software control, driver cannot get module temperature/temperature threshold from firmware. In this case, sonic needs to get temperature/temperature threshold from EEPROM. In this PR, a thread thermal updater is created to update module temperature/temperature threshold while software control is enabled.

- How I did it
Query ASIC temperature from SDK sysfs and update hw-management-tc periodically
Query Module temperature from EEPROM and update hw-management-tc periodically

- How to verify it
Manual test
New Unit tests
2024-01-08 10:50:59 -08:00
mssonicbld
4060f5ce5b
[Mellanox] Remove EEPROM write limitation if it is software control (#17030) (#17694) 2024-01-07 13:16:25 +08:00
mssonicbld
fb7bad2d11
[Mellanox] Implement low power mode for cmis host management (#17159) (#17693) 2024-01-06 07:55:41 +08:00
Junchao-Mellanox
7368df7839
[Mellanox] Enable CMIS host management (#16846) (#17684)
- Why I did it
Enable CMIS host management for Mellanox devices which are expected to support the feature

- How I did it
new thread in a new file and changing logic in platform code in chassis.py which is calling this thread from get_change_event()
this thread in the new file handles the state machine per port.
first the static detection takes place once the thread is up (during switch bootup sequence), until final decision if it's FW control or SW control module.
After it ends, the dynamic detection takes place, listening to changes in the sysfs fds, per port,
so it will be able to detect plug in or out events of a cable.

- How to verify it
Enhanced unit tests
run sonic mgmt on Nvidia SN4700 with CMIS host management enabled

Co-authored-by: dbarashinvd <105214075+dbarashinvd@users.noreply.github.com>
2024-01-05 12:07:30 -08:00
Junchao-Mellanox
6d43d2f636 [Mellanox] Provide default implementation for sfp error description when CMIS host management is enabled (#17294)
- Why I did it
Provide a dummy implementation for SFP error description when CMIS host management is enabled. A future feature shall be raised to implement SFP error description for such mode.

- How I did it
if SFP is under software control, provide "Not supported" as error description
if SFP is under initialization, provide "Initializing" as error description

- How to verify it
unit test
2024-01-04 10:38:38 +08:00
Kebo Liu
f96742fb98 [Mellanox] Revert LPM implementation to the old way (#17096)
- Why I did it
The current low power mode setting implementation requests the user to set the port to admin down first before toggling LP mode, this is not backward compatible, now revert it to the old way so that the user can toggle the LP mode regardless of the port admin status.

- How I did it
Revert the recent changes related to LPM in PR #14130 and #16545

- How to verify it
Run all sfputil and SFP platform API related tests on all the Mellanox platforms.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-12-04 22:14:02 +00:00
Junchao-Mellanox
aedffd333b
[Mellanox] wait reset cause ready (#16722)
Why I did it
SONiC service determine-reboot-cause might run before driver creating reset cause files. In that case, the reset cause will be "Unknown". This PR introduces a wait mechanism to wait for reset cause sysfs files ready.

How I did it
/run/hw-management/config/reset_attr_ready is the file to indicate all reset cause files are ready. In chassis.get_reboot_cause function, it waits /run/hw-management/config/reset_attr_ready for up to 45 seconds.

How to verify it
Manual test on master/202211/202205
2023-10-03 18:58:31 -07:00
Vivek
456a90e1ab
[Nvidia] Remove the dependency on python_sdk_api for sfp api (#16545)
Sfp api can now be called from the host which doesn't have the python_sdk_api installed. Also, sfp api has been migrated to use sysfs instead of sdk handle.

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
2023-09-23 00:19:27 -07:00
Kebo Liu
e286869b24
[Mellanox] Update HW-MGMT package to new version V.7.0030.1011 (#16239)
- Why I did it
1. Update Mellanox HW-MGMT package to newer version V.7.0030.1011
2. Replace the SONiC PMON Thermal control algorithm with the one inside the HW-MGMT package on all Nvidia platforms
3. Support Spectrum-4 systems

- How I did it
1. Update the HW-MGMT package version number and submodule pointer
2. Remove the thermal control algorithm implementation from Mellanox platform API
3. Revise the patch to HW-MGMT package which will disable HW-MGMT from running on SIMX
4. Update the downstream kernel patch list

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-09-06 11:32:08 +03:00
Junchao-Mellanox
95f317a5e2
[Mellanox] Fix issue: watchdogutil command does not work (#16091)
- Why I did it
watchdogutil uses platform API watchdog instance to control/query watchdog status. In Nvidia watchdog status, it caches "armed" status in a object member "WatchdogImplBase.armed". This is not working for CLI infrastructure because each CLI will create a new watchdog instance, the status cached in previous instance will totally lose. Consider following commands:

admin@sonic:~$ sudo watchdogutil arm -s 100      =====> watchdog instance1, armed=True
Watchdog armed for 100 seconds
admin@sonic:~$ sudo watchdogutil status             ======> watchdog instance2, armed=False
Status: Unarmed
admin@sonic:~$ sudo watchdogutil disarm            =======> watchdog instance3, armed=False
Failed to disarm Watchdog

- How I did it
Use sysfs to query watchdog status

- How to verify it
Manual test
Unit test
2023-08-23 09:30:58 +03:00
Junchao-Mellanox
91f3da018e
[Mellanox] Add more unit test coverage for platform API (#15842)
- Why I did it
Increase UT coverage for Nvidia platform API code

Work item tracking
Microsoft ADO (number only):

- How I did it
Focus on low coverage file:
1. component.py
2. watchdog.py
3. pcie.py

- How to verify it
Run the unit test, the coverage has been changed from 70% to 90%
2023-08-03 13:54:31 +03:00
Stephen Sun
238e6ffcc1
[Mellanox] Adjust warning threshold implementation according to the latest algorithm update (#15092)
- Why I did it
Adjust the warning threshold implementation according to the latest algorithm update

- How I did it
Modify power warning and critical thresholds methods

- How to verify it
Unit test updated to cover the change

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-06-13 15:14:10 +03:00
Junchao-Mellanox
18cf719d6a
[Mellanox] Use sysfs for sfp reset/LPM/presence (#14130)
- Why I did it
The current implementation of SFP reset, LPM, present relies on SDK API. This PR moves the implementation to SDK sysfs. By this PR, it gains following benefit:
1. SDK sysfs provides better performance.
2. Host side and container side share the same code.
3. Code is much cleaner.

- How I did it
Use SDK sysfs to implement SFP reset, LPM, present.

- How to verify it
1. Manual test.
2. Unit test.
2023-05-24 17:24:34 +03:00
daxia16
1175143af1
[Mellanox] Support UID LED in platform API (#11592)
- Why I did it
As a LED indicator to help user to find switch location in the lab, UID LED is a useful LED in Mellanox switch.

- How I did it
I add a new member _led_uid in Mellanox/Chassis.py, and extend Mellanox/led.py to support blue color.
Relevant platform-common PR sonic-net/sonic-platform-common#369

- How to verify it
Add unit test cases in test.py, and do manual test including turn-on/off/show uid led.

Signed-off-by: David Xia <daxia@nvidia.com>
2023-05-16 08:24:39 +03:00
Junchao-Mellanox
7962a5c0fa
[Mellanox] add PSU fan direction support (#14508)
- Why I did it
Add PSU fan direction support

- How I did it
Implement fan.get_direction for PSU fan

- How to verify it
Manual test
Unit test
2023-05-15 21:34:54 +03:00
Junchao-Mellanox
9deca05f9d
[Mellanox] get LED capability from capability file (#14584)
- Why I did it
Currently, LED sysfs path is hardcoded. We will need change LED code if new LED color is supported for new platforms. This PR is aimed to improve this. By this PR, LED sysfs path is deduced from LED capability file.

- How I did it
Improve LED management on Nvidia platform:
get LED capability from capability file and deduce sysfs name according to the capability

- How to verify it
Unit test
Manual test
2023-05-10 20:53:50 +03:00
Junchao-Mellanox
f6d3615bb9
[Mellanox] Check system eeprom existence in a retry manner (#13884)
- Why I did it
On Mellanox platform, system EEPROM is a soft link provided by hw-management. There is chance that config-setup service accessing the EEPROM before hw-management creating it. It causes errors. The PR is aim to fix it.

- How I did it
Waiting EEPROM creation in platform API up to 10 seconds.

- How to verify it
Manual test
2023-02-21 19:40:16 +02:00
Stephen Sun
71b5bb6f37
[Mellanox] Support per PSU slope value for PSU power threshold (#13757)
- Why I did it
Support per PSU slope value for PSU power threshold according to hardware team requirement

- How I did it
Pass the PSU number as a parameter when fetching the slope value of PSU.

- How to verify it
Running regression and manual test

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-02-14 08:55:28 +02:00
Kebo Liu
7873a9131d
[Mellanox] Skip the leftover hardware reboot cause in case of last boot is warm/fast reboot (#13246)
- Why I did it
In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.

- How I did it
Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.

- How to verify it
a. Manual test:
    - Perform a power loss
    - Perform a warm/fast reboot
    - Check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"
b. Run reboot cause related regression test.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-01-11 16:50:46 +02:00
Stephen Sun
5d457596ba
[Mellanox] Support PSU power threshold checking (#11863)
* Support power threshold

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* get_psu_power_warning_threshold => get_psu_power_warning_suppress_threshold

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Fix comments

Signed-off-by: Stephen Sun <stephens@nvidia.com>

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-11-21 14:47:43 -08:00
Junchao-Mellanox
20d885dbc2
[Mellanox] Add new thermal sensors for SN5600 (#12671)
- Why I did it
Add new thermal sensors for SN5600

- How I did it
Add new thermal sensors for SN5600: PCH and SODIMM

- How to verify it
Manual test
2022-11-14 11:10:33 -08:00
Kebo Liu
c8c2b7fc45
[Mellanox] [Platform API] Update SN2201 dynamic minimum fan speed table (#12602)
- Why I did it
Update SN2201 dynamic minimum fan speed table according to data provided by the thermal team.

- How I did it
Update the thermal table in device_data.py

- How to verify it
Run platform related regression

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-11-08 13:37:10 +02:00
Junchao-Mellanox
830b7d8cb4
[Mellanox] Use sdk sysfs instead of ethtool (#12480) 2022-11-03 11:17:44 -07:00
Mai Bui
648ca075c7
[device/mellanox] Mitigation for security vulnerability (#11877)
Signed-off-by: maipbui <maibui@microsoft.com>
Dependency: [PR (#12065)](https://github.com/sonic-net/sonic-buildimage/pull/12065) needs to merge first.
#### Why I did it
`subprocess.Popen()` and `subprocess.check_output()` is used with `shell=True`, which is very dangerous for shell injection.
#### How I did it
Disable `shell=True`, enable `shell=False`
#### How to verify it
Tested on DUT, compare and verify the output between the original behavior and the new changes' behavior.
[testresults.zip](https://github.com/sonic-net/sonic-buildimage/files/9550867/testresults.zip)
2022-10-06 17:51:31 -04:00
Dror Prital
44356fa8d7
[Mellanox] Add NVIDIA copyright header for NVIDIA added files (#12130)
- Why I did it
Add NVIDIA Copyright header for new "NVIDIA" files

- How I did it
Add the copyright header as remark at the head of the file
2022-10-02 11:34:24 +03:00
Junchao-Mellanox
1d69f0916e
[Mellanox] Provide dummy implementation for get_rx_los and get_tx_fault (#12231)
- Why I did it
get_rx_los and get_tx_fault is not supported via the exisitng interface used, need provide dummy implementation for them.
NOTE: in later releases we will get them back via different interface.

- How I did it
Return False * lane_num for get_rx_los and get_tx_fault

- How to verify it
Added unit test
2022-09-30 09:38:05 +03:00
Junchao-Mellanox
46ebd06403
[Mellanox] Fix issue: set lpmode by platform API does not work (#11732)
- Why I did it
Fix issue: set lpmode by platform API does not work

- How I did it
Fix miss return value in code

- How to verify it
Manual test
2022-08-18 13:07:38 +03:00
orfarfara
aec1248258
[Mellanox] add PSU input voltage and current (#11510)
- Why I did it
Add PSU input voltage and input current to mlnx platform api.

- How I did it
Implement 2 function of getting the psu voltage and psu current input:
Get the values from "power/psu{}_curr_in" , "power/psu{}_volt_in"

- How to verify it
Manual test.
Run sonic-mgmt regression

Signed-off-by: orfar1994 <orfar1994@gmail.com>
2022-08-10 18:10:55 +03:00
Stephen Sun
8282d427e4
Fix chassis test issue (#11460)
Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-07-16 19:34:45 -07:00
Stephen Sun
81600fafe9
[Mellanox] Support new platform API get_port_or_cage_type for RJ45 ports (#11336)
- Why I did it
Support get_port_or_cage_type for RJ45 ports

- How I did it
Implement the new platform API get_port_or_cage_type
Fix the issue: unable to import SFP when chassis object is destructed

- How to verify it
Manually test and regression test

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-07-14 12:20:16 +03:00
Junchao-Mellanox
2863945f7c
[Mellanox] Fix issue: failed to decode Json while there is no hwsku.json (#11436)
- Why I did it
Fix bug: pmon report error on start up because some SKUs do not have hwsku.json

- How I did it
If hwsku.json, do not extract RJ45 port information

- How to verify it
Manual test.
Unit test.
2022-07-14 09:24:39 +03:00
Sudharsan Dhamal Gopalarathnam
23d68883f5
[Mellanox]Check dmi file permission before access (#11309)
Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com

Why I did it
During the system boot up when 'show platform status' or 'show version' command is executed before STATE_DB CHASSIS_INFO table is populated, the show will try to fallback to use the platform API. The DMI file in mellanox platforms require root permission for access. So if the show commands are executed as admin or any other user, the following error log will appear in the syslog

Jun 28 17:21:25.612123 sonic ERR show: Fail to decode DMI /sys/firmware/dmi/entries/2-0/raw due to PermissionError(13, 'Permission denied')

How I did it
Check the file permission before accessing it.

How to verify it
Added UT to verify. Manually verified if the error log is not thrown.
2022-07-01 17:29:07 -07:00
Kebo Liu
7ac590b5c5
[Mellanox] Enhance Platform API to support SN2201 - RJ45 ports and new components mgmt. (#10377)
* Support new platform SN2201 and RJ45 port

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* remove unused import and redundant function

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* fix error introduced by rebase

Signed-off-by: Kebo Liu <kebol@nvidia.com>

* Revert the special handling of RJ45 ports (#56)

* Revert the special handling of RJ45 ports

sfp.py
sfp_event.py
chassis.py

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Remove deadcode

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Support CPLD update for SN2201

A new class is introduced, deriving from ComponentCPLD and overloading _install_firmware
Change _install_firmware from private (starting with __) to protected, making it overloadable

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Initialize component BIOS/CPLD

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Remove swb_amb which doesn't on DVT board any more

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Remove the unexisted sensor - switch board ambient - from platform.json

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Do not report error on receiving unknown status on RJ45 ports

Translate it to disconnect for RJ45 ports
Report error for xSFP ports

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Add reinit for RJ45 to avoid exception

Signed-off-by: Stephen Sun <stephens@nvidia.com>

Co-authored-by: Stephen Sun <5379172+stephenxs@users.noreply.github.com>
Co-authored-by: Stephen Sun <stephens@nvidia.com>
2022-06-20 19:12:20 -07:00
Junchao-Mellanox
af5e5c4c94
[Mellanox] Adjust PSU voltage WA (#10619)
- Why I did it
InvalidPsuVolWA.run might raise exception if user power off PSU when it is running. This exception is not caught and will be raised to psud which causes psud failed to update PSU data to DB.

- How I did it
1. Change the log level when WA does not work. This could happen when user power off PSU, hence changing the log level from error to warning is better
2. Change the wait time from 5 to 1 to avoid introduce too much delay in psud. 1 second is usually enough per my test
3. Give a default return value for function get_voltage_low_threshold and get_voltage_high_threshold to avoid exception reach to psud

- How to verify it
Manual test.
Run sonic-mgmt regression
2022-04-22 11:02:30 +03:00
Junchao-Mellanox
0191300b96
[Mellanox] Auto correct PSU voltage threshold (WA) (#10394)
- Why I did it
There is a hardware bug that PSU voltage threshold sysfs returns incorrect value. The workaround is to call "sensor -s" to refresh it.

- How I did it
Call "sensor -s" when the threshold value is not incorrect and PSU is "DELTA 1100"

- How to verify it
Unit test and Manual test
2022-04-14 08:14:40 +03:00
Andriy Yurkiv
1e2e493daa
[Mellanox] Credo Y-cable read_eeprom/write_eeprom API implementation (#10320)
- Why I did it
Implement read_eeprom/write_eeprom API for Credo Y-cable for Dual ToR Active-Standby

- How I did it
Use mlxreg utility for API implementation

Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
2022-03-30 20:41:31 +03:00
Dror Prital
8bc81206c5
[Mellanox] Update NVIDIA License header for files changed since 1.1.2022 (#10289)
- Why I did it
Update NVIDIA Copyright header to "mellanox" files which were changed since 1.1.2022

- How I did it
Update the copyright header

- How to verify it
Sanity tests and PR checkers.
2022-03-23 13:19:25 +02:00
Oleksandr Ivantsiv
9565ef7a9a
[Mellanox] Refactor SFP to use new APIs. (#10317)
- Why I did it
Refactor SFP code to remove code duplication and to be able to use the latest features available in new APIs.

- How I did it
Refactor SFP code to remove code duplication and to be able to use the latest features available in new APIs.

- How to verify it
Run sonic-mgmt/platform_tests/sfp tests
2022-03-23 09:16:03 +02:00
Junchao-Mellanox
f0ddd102d5
[Mellanox] Add CPU thermal control for Nvidia platforms (#10202)
Why I did it
Add CPU thermal control for Nvidia platforms which will be enabled for platforms that have heavy CPU load. Now it is only enabled on 4800, and it will be enabled on future platforms.

How I did it
Check CPU pack temperature and update cooling level accordingly

How to verify it
Manual test
Added sonic-mgmt test case, PR link will update later
2022-03-21 09:54:52 -07:00
Junchao-Mellanox
dc04d64219
[Mellanox] Fix issue: psu might use wrong voltage sysfs which causes invalid voltage value (#10231)
- Why I did it
Fix issue: psu might use wrong voltage sysfs which causes invalid voltage value. The flow is like:

1. User power off a PSU
2. All sysfs files related to this PSU are removed
3. User did a reboot/config reload
4. PSU will use wrong sysfs as voltage node

- How I did it
Always try find an existing sysfs.

- How to verify it
Manual test
2022-03-20 10:34:04 +02:00
Junchao-Mellanox
fe59e0f2c0
[Mellanox] Fix issue: thermal zone threshold value 0 causes fan speed stuck at 100% (#10057)
- Why I did it
In SONiC thermal control algorithm, it compares thermal zone temperature with thermal zone threshold. Previously, a thermal zone with no thermal sensor can still get its threshold. However, a recently driver patch changes this behavior: a thermal zone with no thermal sensor will return 0 for threshold. We need to ignore such thermal zone.

- How I did it
Ignore thermal zones whose temperature is 0.

- How to verify it
Added unit test case and Manual test
2022-02-24 12:05:56 +02:00
Alexander Allen
8a07af95e5
[Mellanox] Modified Platform API to support all firmware updates in single boot (#9608)
Why I did it
Requirements from Microsoft for fwutil update all state that all firmwares which support this upgrade flow must support upgrade within a single boot cycle. This conflicted with a number of Mellanox upgrade flows which have been revised to safely meet this requirement.

How I did it
Added --no-power-cycle flags to SSD and ONIE firmware scripts
Modified Platform API to call firmware upgrade flows with this new flag during fwutil update all
Added a script to our reboot plugin to handle installing firmwares in the correct order with prior to reboot
How to verify it
Populate platform_components.json with firmware for CPLD / BIOS / ONIE / SSD
Execute fwutil update all fw --boot cold
CPLD will burn / ONIE and BIOS images will stage / SSD will schedule for reboot
Reboot the switch
SSD will install / CPLD will refresh / switch will power cycle into ONIE
ONIE installer will upgrade ONIE and BIOS / switch will reboot back into SONiC
In SONiC run fwutil show status to check that all firmware upgrades were successful
2022-01-24 00:56:38 -08:00
Junchao-Mellanox
4ae504a813
[Mellanox] Optimize thermal control policies (#9452)
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.

- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable

- How to verify it
1. Manual test
2. Regression
2022-01-19 11:44:37 +02:00