- Why I did it
InvalidPsuVolWA.run might raise exception if user power off PSU when it is running. This exception is not caught and will be raised to psud which causes psud failed to update PSU data to DB.
- How I did it
1. Change the log level when WA does not work. This could happen when user power off PSU, hence changing the log level from error to warning is better
2. Change the wait time from 5 to 1 to avoid introduce too much delay in psud. 1 second is usually enough per my test
3. Give a default return value for function get_voltage_low_threshold and get_voltage_high_threshold to avoid exception reach to psud
- How to verify it
Manual test.
Run sonic-mgmt regression
- Why I did it
There is a hardware bug that PSU voltage threshold sysfs returns incorrect value. The workaround is to call "sensor -s" to refresh it.
- How I did it
Call "sensor -s" when the threshold value is not incorrect and PSU is "DELTA 1100"
- How to verify it
Unit test and Manual test
- Why I did it
Implement read_eeprom/write_eeprom API for Credo Y-cable for Dual ToR Active-Standby
- How I did it
Use mlxreg utility for API implementation
Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
- Why I did it
Update NVIDIA Copyright header to "mellanox" files which were changed since 1.1.2022
- How I did it
Update the copyright header
- How to verify it
Sanity tests and PR checkers.
- Why I did it
Refactor SFP code to remove code duplication and to be able to use the latest features available in new APIs.
- How I did it
Refactor SFP code to remove code duplication and to be able to use the latest features available in new APIs.
- How to verify it
Run sonic-mgmt/platform_tests/sfp tests
Why I did it
Add CPU thermal control for Nvidia platforms which will be enabled for platforms that have heavy CPU load. Now it is only enabled on 4800, and it will be enabled on future platforms.
How I did it
Check CPU pack temperature and update cooling level accordingly
How to verify it
Manual test
Added sonic-mgmt test case, PR link will update later
- Why I did it
Fix issue: psu might use wrong voltage sysfs which causes invalid voltage value. The flow is like:
1. User power off a PSU
2. All sysfs files related to this PSU are removed
3. User did a reboot/config reload
4. PSU will use wrong sysfs as voltage node
- How I did it
Always try find an existing sysfs.
- How to verify it
Manual test
- Why I did it
In SONiC thermal control algorithm, it compares thermal zone temperature with thermal zone threshold. Previously, a thermal zone with no thermal sensor can still get its threshold. However, a recently driver patch changes this behavior: a thermal zone with no thermal sensor will return 0 for threshold. We need to ignore such thermal zone.
- How I did it
Ignore thermal zones whose temperature is 0.
- How to verify it
Added unit test case and Manual test
Why I did it
Requirements from Microsoft for fwutil update all state that all firmwares which support this upgrade flow must support upgrade within a single boot cycle. This conflicted with a number of Mellanox upgrade flows which have been revised to safely meet this requirement.
How I did it
Added --no-power-cycle flags to SSD and ONIE firmware scripts
Modified Platform API to call firmware upgrade flows with this new flag during fwutil update all
Added a script to our reboot plugin to handle installing firmwares in the correct order with prior to reboot
How to verify it
Populate platform_components.json with firmware for CPLD / BIOS / ONIE / SSD
Execute fwutil update all fw --boot cold
CPLD will burn / ONIE and BIOS images will stage / SSD will schedule for reboot
Reboot the switch
SSD will install / CPLD will refresh / switch will power cycle into ONIE
ONIE installer will upgrade ONIE and BIOS / switch will reboot back into SONiC
In SONiC run fwutil show status to check that all firmware upgrades were successful
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.
- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable
- How to verify it
1. Manual test
2. Regression
Why I did it
Recently additional sensors that were needed only for specific system added to all systems and caused errors.
How I did it
* Include CPU board and switch board sensors only on SN2201 system
* Fix issue in test_chassis_thermal, now it skips non existing thermals.
How to verify it
Run show platform temperature
Signed-off-by: liora <liora@nvidia.com>
Why I did it
Nvidia platform API does not support set LED to orange
How I did it
Allow user to set LED to orange
How to verify it
Added unit test
Manual test
Depends on #9358
Why I did it
Adjust LED logical according to hw-mgmt change.
How I did it
Add a trigger to set LED to blink.
How to verify it
Manual test
- Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.
- How I did it
When PSU is powered of, don't treat it as absent.
- How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
- Why I did it
* To support systems with dynamic port configuration
* Apply lazy initialization to faster the speed of loading platform API
- How I did it
* Add module.py to implement dynamic port configuration (aka line card model)
* Adjust chassis.py, platform.py, thermal.py, sfp.py to support dynamic port configuration
* Optimize existing code
- How to verify it
Platform regression on MSN4700, MSN3800 and MSN2700, 100% pass
Unit test covers all new changes.
- Why I did it
Add NVIDIA Copyright header to "mellanox" files
- How I did it
Add NVIDIA Copyright header as a comment for Mellanox files
- How to verify it
Sanity tests and PR checkers.
Why I did it
The fwutil update all utility expects the auto_update_firmware method in the Platform API to execute the update_firmware() call and not the install_firmware() call.
How I did it
Changed the method in the mellanox platform API component implementation.
How to verify it
Run fwutil update all with a CPLD update on a Mellanox platform and verify that it properly updates the firmware using the MPFA file.
#### Why I did it
Currently, SONiC use a single value to represent SFP error, however, multiple SFP errors could exist at the same time. This PR is aimed to support it
#### How I did it
Return bitmap instead of single value when a SFP event occurs
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Why I did it
The Mellanox platform is required to support the fwutil auto-update feature defined here
This is to allow switches, when performing SONiC upgrades to choose whether to perform firmware upgrades that may interrupt the data plane through a cold boot.
How I did it
Two methods were added to the component implementations for mellanox.
In the base Component class we add a default function that chooses to skip the installation of any firmware unless the cold boot option is provided. This is because the Mellanox platform, by default, does not support installing firmware on ONIE, the CPLD, or the BIOS "on-the-fly".
In the ComponentSSD class we add a function that behaves similarly but uses the Mellanox specific SSD firmware upgrade tool to check if the current SSD supports being upgraded on the fly in order to decide whether to skip or perform the installation.
How to verify it
Unit tests are included with this PR. These test will run on build of target sonic-mellanox.bin
You may also perform fwutil auto-update ... commands after Azure/sonic-utilities#1242 is merged in.
#### Why I did it
According to thermalctld hld, each fan must belong to a fan drawer, if the fan drawer does not physically exist, put fan into a virtual fan drawer. This PR is to clear fan from chassis._fan_list
#### How I did it
1. Don't put fan to chassis._fan_list
2. Always query fan from fan_drawer
Originally, SFP modules were always accessed from platform daemons, and arbitrary SFP modules can be accessed in the daemon. So all SFP modules were initialized in one shot once one of the following chassis APIs called
- get_all_sfps
- get_sfp_numbers
- get_sfp
Recently, we noticed that SFP modules can also be accessed from CLI, eg. the latest refactor of `sfputil`.
In this case, only one SFP module is accessed in the chassis object's life cycle.
To initialize all SFP modules in one shot is waste of time and causes the CLI to take much more time to finish.
So we would like to optimize the initialization flow by introducing a two-phase initialization approach:
- Partial initialization, which means the `chassis._sfp_list` has been initialized with proper length and all elements being `None`
- Full initialization, which means all elements in `chassis._sfp_list` are created
If the relevant function is called,
- `get_sfp`, only partial initialization will be done, and then the specific SFP module is initialized.
- `get_all_sfps` or `get_num_sfps`, full initialization will be done, which means all SFP modules are initialized.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
**- Why I did it**
After migrating to python3, the operator '/' always get a float result, but it gets integer result in python2. Need fix this in thermal_conditions.
**- How I did it**
1. cast float value to int
2. change the unit test case to cover this situation
**- How to verify it**
Manually test and regression test
In order to support SONiC physical entity mib extension, a few new platform API are added to sonic-platform-common, this PR is to provide an mellanox platform implementation for those new APIs.
* [thermal control] Fix pmon docker stop issue on 3800
* [thermal fix] Fix QA test issue
* [thermal fix] change psu._get_power_available_status to psu.get_power_available_status
* [thermal fix] adjust log for PSU absence and power absence
* [thermal fix] add unit test for loading thermal policy file with duplicate conditions in different policies
* [thermal] fix fan.get_presence for non-removable SKU
* [thermal fix] fix issue: fan direction is based on drawer
* Fix issue: when fan is not present, should not read fan direction from sysfs but directly return N/A
* [thermal fix] add unit test for get_direction for absent FAN
* Unplugable PSU has no FAN, no need add a FAN object for this PSU
* Update submodules
Co-authored-by: Stephen Sun <5379172+stephenxs@users.noreply.github.com>