- Why I did it
ethtool print error logs when EEPROM of a SFP is not available. It prints error like this:
INFO pmon#/supervisord: xcvrd Cannot get module EEPROM information: Input/output error
INFO pmon#/supervisord: xcvrd Cannot get Module EEPROM data: Invalid argument
However, this log does not contain the relevant SFP index which is hard for developer/qa to find the exactly SFP.
- How I did it
Redirect ethtool stderr to subprocess and log it better
- How to verify it
Manual test
- Why I did it
Add PSU input voltage and input current to mlnx platform api.
- How I did it
Implement 2 function of getting the psu voltage and psu current input:
Get the values from "power/psu{}_curr_in" , "power/psu{}_volt_in"
- How to verify it
Manual test.
Run sonic-mgmt regression
Signed-off-by: orfar1994 <orfar1994@gmail.com>
- Why I did it
Add more log while doing sysfs reading to increase the debug capability
- How I did it
Log the relevant file path and error number while sysfs reading return None
- How to verify it
Manual test
- Why I did it
Support get_port_or_cage_type for RJ45 ports
- How I did it
Implement the new platform API get_port_or_cage_type
Fix the issue: unable to import SFP when chassis object is destructed
- How to verify it
Manually test and regression test
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
Fix bug: pmon report error on start up because some SKUs do not have hwsku.json
- How I did it
If hwsku.json, do not extract RJ45 port information
- How to verify it
Manual test.
Unit test.
Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com
Why I did it
During the system boot up when 'show platform status' or 'show version' command is executed before STATE_DB CHASSIS_INFO table is populated, the show will try to fallback to use the platform API. The DMI file in mellanox platforms require root permission for access. So if the show commands are executed as admin or any other user, the following error log will appear in the syslog
Jun 28 17:21:25.612123 sonic ERR show: Fail to decode DMI /sys/firmware/dmi/entries/2-0/raw due to PermissionError(13, 'Permission denied')
How I did it
Check the file permission before accessing it.
How to verify it
Added UT to verify. Manually verified if the error log is not thrown.
* Support new platform SN2201 and RJ45 port
Signed-off-by: Kebo Liu <kebol@nvidia.com>
* remove unused import and redundant function
Signed-off-by: Kebo Liu <kebol@nvidia.com>
* fix error introduced by rebase
Signed-off-by: Kebo Liu <kebol@nvidia.com>
* Revert the special handling of RJ45 ports (#56)
* Revert the special handling of RJ45 ports
sfp.py
sfp_event.py
chassis.py
Signed-off-by: Stephen Sun <stephens@nvidia.com>
* Remove deadcode
Signed-off-by: Stephen Sun <stephens@nvidia.com>
* Support CPLD update for SN2201
A new class is introduced, deriving from ComponentCPLD and overloading _install_firmware
Change _install_firmware from private (starting with __) to protected, making it overloadable
Signed-off-by: Stephen Sun <stephens@nvidia.com>
* Initialize component BIOS/CPLD
Signed-off-by: Stephen Sun <stephens@nvidia.com>
* Remove swb_amb which doesn't on DVT board any more
Signed-off-by: Stephen Sun <stephens@nvidia.com>
* Remove the unexisted sensor - switch board ambient - from platform.json
Signed-off-by: Stephen Sun <stephens@nvidia.com>
* Do not report error on receiving unknown status on RJ45 ports
Translate it to disconnect for RJ45 ports
Report error for xSFP ports
Signed-off-by: Stephen Sun <stephens@nvidia.com>
* Add reinit for RJ45 to avoid exception
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Co-authored-by: Stephen Sun <5379172+stephenxs@users.noreply.github.com>
Co-authored-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
"import sonic_platform" takes about 600ms ~ 1000ms, it is kind of slow. After this optimization, the time is about 100ms. The benefit is that those CLIs which does not need the slow import sentence would be faster than before.
- How I did it
Find slow import and call them when need.
- How to verify it
Measure the import time.
- Why I did it
Script fails when there is an exception while reading
- How I did it
Add more logs and checks. Fix wrong variable naming and messages.
- How to verify it
Provoke exception while read_eeprom() and check that it is handled properly
- Why I did it
InvalidPsuVolWA.run might raise exception if user power off PSU when it is running. This exception is not caught and will be raised to psud which causes psud failed to update PSU data to DB.
- How I did it
1. Change the log level when WA does not work. This could happen when user power off PSU, hence changing the log level from error to warning is better
2. Change the wait time from 5 to 1 to avoid introduce too much delay in psud. 1 second is usually enough per my test
3. Give a default return value for function get_voltage_low_threshold and get_voltage_high_threshold to avoid exception reach to psud
- How to verify it
Manual test.
Run sonic-mgmt regression
- Why I did it
Implement newly added reboot causes in PR Azure/sonic-platform-common#277
- How I did it
Map the reboot cause sysfs to the newly added reboot causes.
- How to verify it
manual test, check whether the reboot cause is correct after rebooting the switch in various ways.
run the community reboot test to see whether the reboot cause checker is passing.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
There is a hardware bug that PSU voltage threshold sysfs returns incorrect value. The workaround is to call "sensor -s" to refresh it.
- How I did it
Call "sensor -s" when the threshold value is not incorrect and PSU is "DELTA 1100"
- How to verify it
Unit test and Manual test
- Why I did it
Implement read_eeprom/write_eeprom API for Credo Y-cable for Dual ToR Active-Standby
- How I did it
Use mlxreg utility for API implementation
Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
- Why I did it
Update NVIDIA Copyright header to "mellanox" files which were changed since 1.1.2022
- How I did it
Update the copyright header
- How to verify it
Sanity tests and PR checkers.
- Why I did it
Refactor SFP code to remove code duplication and to be able to use the latest features available in new APIs.
- How I did it
Refactor SFP code to remove code duplication and to be able to use the latest features available in new APIs.
- How to verify it
Run sonic-mgmt/platform_tests/sfp tests
Why I did it
Add CPU thermal control for Nvidia platforms which will be enabled for platforms that have heavy CPU load. Now it is only enabled on 4800, and it will be enabled on future platforms.
How I did it
Check CPU pack temperature and update cooling level accordingly
How to verify it
Manual test
Added sonic-mgmt test case, PR link will update later
- Why I did it
Fix issue: psu might use wrong voltage sysfs which causes invalid voltage value. The flow is like:
1. User power off a PSU
2. All sysfs files related to this PSU are removed
3. User did a reboot/config reload
4. PSU will use wrong sysfs as voltage node
- How I did it
Always try find an existing sysfs.
- How to verify it
Manual test
- Why I did it
In SONiC thermal control algorithm, it compares thermal zone temperature with thermal zone threshold. Previously, a thermal zone with no thermal sensor can still get its threshold. However, a recently driver patch changes this behavior: a thermal zone with no thermal sensor will return 0 for threshold. We need to ignore such thermal zone.
- How I did it
Ignore thermal zones whose temperature is 0.
- How to verify it
Added unit test case and Manual test
- Why I did it
Fix issue: 'sx_port_mapping_t' object has no attribute 'slot_id'. sx_port_mapping_t only has attribute slot.
- How I did it
Change slot_id to slot.
- How to verify it
Manual test
- Why I did it
Python select.select accept a optional timeout value in seconds, however, the value passes to it is a value in millisecond.
- How I did it
Transfer the value to millisecond.
- How to verify it
Manual test
Why I did it
Requirements from Microsoft for fwutil update all state that all firmwares which support this upgrade flow must support upgrade within a single boot cycle. This conflicted with a number of Mellanox upgrade flows which have been revised to safely meet this requirement.
How I did it
Added --no-power-cycle flags to SSD and ONIE firmware scripts
Modified Platform API to call firmware upgrade flows with this new flag during fwutil update all
Added a script to our reboot plugin to handle installing firmwares in the correct order with prior to reboot
How to verify it
Populate platform_components.json with firmware for CPLD / BIOS / ONIE / SSD
Execute fwutil update all fw --boot cold
CPLD will burn / ONIE and BIOS images will stage / SSD will schedule for reboot
Reboot the switch
SSD will install / CPLD will refresh / switch will power cycle into ONIE
ONIE installer will upgrade ONIE and BIOS / switch will reboot back into SONiC
In SONiC run fwutil show status to check that all firmware upgrades were successful
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.
- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable
- How to verify it
1. Manual test
2. Regression
Why I did it
Recently additional sensors that were needed only for specific system added to all systems and caused errors.
How I did it
* Include CPU board and switch board sensors only on SN2201 system
* Fix issue in test_chassis_thermal, now it skips non existing thermals.
How to verify it
Run show platform temperature
Signed-off-by: liora <liora@nvidia.com>
- Why I did it
Rename platform x86_64-mlnx_msn4800 to x86_64-nvidia_msn4800
- How I did it
Rename platform folder as well as all code that reference the platform name
- How to verify it
Manual test
- Why I did it
Add new Spectrum-4 system support SN5600 on top of Nvidia ASIC simulator.
- How I did it
Add all relevant system and simulator SKU.
Updated syseeprom.hex and related directories to reflect Nvidia SN5600 brand name.
- How to verify it
Tested init flow, basic show commands, up interfaces, traffic test.
Signed-off-by: Raphael Tryster <raphaelt@nvidia.com>
Why I did it
Nvidia platform API does not support set LED to orange
How I did it
Allow user to set LED to orange
How to verify it
Added unit test
Manual test
- Why I did it
Add support for SN2201 platform
- How I did it
Add required content for SN2201 platform
Note: still missing kernel driver support for this system. Once all is upstream will be updated as well.
- How to verify it
Install and basic sanity tests including traffic.
Signed-off-by: liora liora@nvidia.com
Depends on #9358
Why I did it
Adjust LED logical according to hw-mgmt change.
How I did it
Add a trigger to set LED to blink.
How to verify it
Manual test
- Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.
- How I did it
When PSU is powered of, don't treat it as absent.
- How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
- Why I did it
Support PSU voltage high/low thresholds and power max threshold
1. Add thresholds support for voltage and power.
2. As thresholds are not supported on all platforms, we need to check the capability first and fetch thresholds only if it is supported.
- How I did it
- How to verify it
Run regression test and manual test.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
* To support systems with dynamic port configuration
* Apply lazy initialization to faster the speed of loading platform API
- How I did it
* Add module.py to implement dynamic port configuration (aka line card model)
* Adjust chassis.py, platform.py, thermal.py, sfp.py to support dynamic port configuration
* Optimize existing code
- How to verify it
Platform regression on MSN4700, MSN3800 and MSN2700, 100% pass
Unit test covers all new changes.
- Why I did it
Add NVIDIA Copyright header to "mellanox" files
- How I did it
Add NVIDIA Copyright header as a comment for Mellanox files
- How to verify it
Sanity tests and PR checkers.
Why I did it
Currently the mellanox platform API is validating the file extensions of firmware packages to be installed for basic sanity checking. However, ONIE packages do not have an extension and as such if there is a "." in the name it is taken to be an extension and then fails the sanity check.
How I did it
I removed the check which ensures that ONIE images don't have a file extension.
How to verify it
Name the ONIE updater file 2021.onie and attempt to install it via fwutil install fw 2021.onie --yes
Why I did it
The fwutil update all utility expects the auto_update_firmware method in the Platform API to execute the update_firmware() call and not the install_firmware() call.
How I did it
Changed the method in the mellanox platform API component implementation.
How to verify it
Run fwutil update all with a CPLD update on a Mellanox platform and verify that it properly updates the firmware using the MPFA file.
- Why I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high, so that thermal algorithm would set fan speed to minimum allowed earlier and save power.
- How I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high
- How to verify it
Manual test
#### Why I did it
New PSU could install different type of fan, so fan max/min speed should be read per PSU
#### How I did it
The existing implementation read PSU max/min fan speed from a common file, change it to read from per PSU file
#### How to verify it
Manual test
Why I did it
BIOS upgrade on rare cases cannot guarantee bus value remain the same on every BIOS release. Ignoring this field in order for pcied not to fail but still verify device id in a different way. The solution is future proof and will not require changes in code when new BIOS version is available
How I did it
Since bus is not a fixed value (it is determined by the bios version) we are ignoring this field, and instead checking if there is a device that match on all other fields that and in addition has a matching device id.
How to verify it
Verify no errors or failures in pcied on different BIOS version with the same code base.
Avoid initializing sfp/thermal/components/fan/psu/leds on simx and create vpd_info file on hw_management when we use mellanox simulator platform
- Why I did it
this is a fix for issue in mellanox simulator platforms. the syseepromd failed on the pmon docker. also "decode-syseeprom" failed also
- How I did it
before initializing thermal/components/fan/psu/leds --> check if we are running on simx
creating the vpd_info on the hw_management folder.
- How to verify it
check if syseepromd process was loaded properly on the pmon docker.
decode-syseeprom is working well without errors/warnings
- Why I did it
to prevent python exception error when executing warm-reboot command on mellanox simulator platform
- How I did it
return None on the watchdog python script on cases that watchdog file is not exist
- How to verify it
warm-reboot is running well without the python error. error message will appear on log on these cases.
in order to avoid this error message we can simulate the watchdog on mellanox simulator platform
#### Why I did it
Currently, SONiC use a single value to represent SFP error, however, multiple SFP errors could exist at the same time. This PR is aimed to support it
#### How I did it
Return bitmap instead of single value when a SFP event occurs
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
The methods get_model, get_serial, and get_revision have been implemented by reading relevant information from VPD and then recording the information into relevant fields.
However, there is no VPD data on platforms with fixed PSUs and relevant fields haven't been initialized, which causes the methods to throw exceptions. which in turn prevents psud from inserting fields into PSU table.
Eventually, this causes show platform psustatus doesn't output correct info.
- How I did it
Initialize those fields as N/A on systems with fixed PSUs.
- How to verify it
Manually test.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
Remove EEPROM cache file and use DB instead
- How I did it
Read EEPROM data from DB if possible
If data is not ready in DB, read from hardware using a visitor pattern
- How to verify it
Manual test and regression
Why I did it
The Mellanox platform is required to support the fwutil auto-update feature defined here
This is to allow switches, when performing SONiC upgrades to choose whether to perform firmware upgrades that may interrupt the data plane through a cold boot.
How I did it
Two methods were added to the component implementations for mellanox.
In the base Component class we add a default function that chooses to skip the installation of any firmware unless the cold boot option is provided. This is because the Mellanox platform, by default, does not support installing firmware on ONIE, the CPLD, or the BIOS "on-the-fly".
In the ComponentSSD class we add a function that behaves similarly but uses the Mellanox specific SSD firmware upgrade tool to check if the current SSD supports being upgraded on the fly in order to decide whether to skip or perform the installation.
How to verify it
Unit tests are included with this PR. These test will run on build of target sonic-mellanox.bin
You may also perform fwutil auto-update ... commands after Azure/sonic-utilities#1242 is merged in.
- Why I did it
Enhance the Python3 support for platform API. Originally, some platform APIs call SDK API which didn't support Python 3. Now the Python 3 APIs have been supported in SDK 4.4.3XXX, Python3 is completely supported by platform API
- How I did it
Start all platform daemons from python3
1. Remove #/usr/bin/env python at the beginning of each platform API file as the platform API won't be started as daemons but be imported from other daemons.
2. Adjust SDK API calls accordingly
- How to verify it
Manually test and run regression platform test
Signed-off-by: Stephen Sun <stephens@nvidia.com>