Manual cherry-pick of https://github.com/sonic-net/sonic-buildimage/pull/14138
- Why I did it
In sfplpm API, the number of logical ports is hardcoded as 64. When a system contains more port than this, the SDK APIs would fail with a syslog as below
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [MGMT_LIB.ERR] Slot [0] Module [0] has logport [0x00010069] in enabled state
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [SDK_MGMT_LIB.ERR] Failed in __sdk_mgmt_phy_module_pwr_attr_set, error: Internal Error
Mar 7 03:53:58.106118 r-leopard-58 ERR pmon#-c: Error occurred when setting power mode for SFP module 0, slot 0, error code 1
- How I did it
Remove the hardcoded value of 64. Obtained the number of logical ports from SDK
- How to verify it
Manual testing
- Why I did it
sfp_event.py gets a PMPE message when a cable event is available. In PMPE message, there is no label port available. Current sfp_event.py is using sx_api_port_device_get to get 64 logical ports attributes, and find the label port from those 64 attributes. However, if there are more than 64 ports, sfp_event.py might not be able to find the label port and drop the PMPE message.
- How I did it
Don't use hardcoded 64, get logical port number instead.
- How to verify it
Manual test
- Why I did it
There are 3 tasks in xcvrd:
main task, run a loop to recover missing SFP static information to DB every 1 minute
SFP state task, a process which listens cable plug in/out event, insert SFP static information to DB while a cable is inserted
SFP DOM update task, a thread which handles cable DOM information update every 1 minute
Let assume user replaces QSFP with QSFP-DD. There are two issues:
Only SFP state task listens cable plug in/out event, main task and SFP DOM update task does not know SFP type has changed, they still “think” the SFP type is QSFP. So, main task and SFP DOM update task uses QSFP standard to parse QSFP-DD EEPROM which causes corrupted data.
There is a race condition between main task and SFP state task. They both insert SFP static information to DB. Depends on timing, it is possible that main task using wrong SFP type to override SFP static information.
The PR is to fix these two issues.
There is no such issue on 202205 and above because there is a refactor for xcvrd:
SFP state task was changed from process to thread, so that all 3 tasks share the same memory space, they always have correct SFP type.
Recover missing SFP information logical was moved from main task to SFP state task. There is no race condition anymore.
- How I did it
It is difficult to back port latest xcvrd because there are many refactor/new features in xcvrd after 202012 release. It will be huge effort to do so. Based on that, we decided to fix the issue on Nvidia platform API side. The fix is that: refreshing SFP type before any SFP API which accessing SFP EEPROM. Refreshing SFP type before any SFP API would cause a small performance down: Due to my test on 202012 branch, accessing transceiver INFO and DOM INFO for 32 ports takes 1.7 seconds before the change. The number changes to 2.4 seconds after the change. I suppose the performance down is acceptable.
- How to verify it
Manual test
Regression
- Why I did it
Backport #9795
Python select.select accept a optional timeout value in seconds, however, the value passes to it is a value in millisecond.
- How I did it
Transfer the value to millisecond.
- How to verify it
Manual test
#### Why I did it
Backport https://github.com/sonic-net/sonic-buildimage/pull/13246 to 202012 branch.
In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.
#### How I did it
Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.
#### How to verify it
a. Manual test:
> 1. Perform a power loss
> 2. Perform a warm/fast reboot
> 3. check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"
b. Run reboot cause related regression test.
- Why I did it
ethtool is not able to read certain pages(eg. page 11h) of CMIS cables.
SDK provides a set of sysfs to expose the transceiver EEPROM, now we migrate from using ethtool to read these sysfs for transceiver EEPROM reading.
- How I did it
replace ethtool with accessing the SDK sysfs for cable EEPROM reading.
Adjust the offset according to the SDK sysfs memory map.
- How to verify it
run sonic-mgmt sfp-related regression test case.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
4600C is using wrong thermal profile and it displays 2 CPU core thermal in show platform temperature output, there should be 4 CPU core thermal.
- How I did it
Change 4600C to use thermal profile 10.
- How to verify it
Manual test
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.
- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable
- How to verify it
1. Manual test
2. Regression
Backport https://github.com/Azure/sonic-buildimage/pull/9259 to 202012
#### Why I did it
Nvidia platform API does not support set LED to orange.
#### How I did it
Allow user to set LED to orange
#### How to verify it
Manual test
- Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.
- How I did it
When PSU is powered of, don't treat it as absent.
- How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
- Why I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high, so that thermal algorithm would set fan speed to minimum allowed earlier and save power.
- How I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high
- How to verify it
Manual test
#### Why I did it
New PSU could install different type of fan, so fan max/min speed should be read per PSU
#### How I did it
The existing implementation read PSU max/min fan speed from a common file, change it to read from per PSU file
#### How to verify it
Manual test
Why I did it
BIOS upgrade on rare cases cannot guarantee bus value remain the same on every BIOS release. Ignoring this field in order for pcied not to fail but still verify device id in a different way. The solution is future proof and will not require changes in code when new BIOS version is available
How I did it
Since bus is not a fixed value (it is determined by the bios version) we are ignoring this field, and instead checking if there is a device that match on all other fields that and in addition has a matching device id.
How to verify it
Verify no errors or failures in pcied on different BIOS version with the same code base.
- Why I did it
to prevent python exception error when executing warm-reboot command on mellanox simulator platform
- How I did it
return None on the watchdog python script on cases that watchdog file is not exist
- How to verify it
warm-reboot is running well without the python error. error message will appear on log on these cases.
in order to avoid this error message we can simulate the watchdog on mellanox simulator platform
- Why I did it
Remove EEPROM cache file and use DB instead
- How I did it
Read EEPROM data from DB if possible
If data is not ready in DB, read from hardware using a visitor pattern
- How to verify it
Manual test and regression
- Why I did it
This is to back-port Azure 7410 to 202012 branch.
Enhance the Python3 support for platform API. Originally, some platform APIs call SDK API which didn't support Python 3. Now the Python 3 APIs have been supported in SDK 4.4.3XXX, Python3 is completely supported by platform API
- How I did it
Start all platform daemons from python3
1. Remove #/usr/bin/env python at the beginning of each platform API file as the platform API won't be started as daemons but be imported from other daemons.
2. Adjust SDK API calls accordingly
Signed-off-by: Stephen Sun <stephens@nvidia.com>
#### Why I did it
According to thermalctld hld, each fan must belong to a fan drawer, if the fan drawer does not physically exist, put fan into a virtual fan drawer. This PR is to clear fan from chassis._fan_list
#### How I did it
1. Don't put fan to chassis._fan_list
2. Always query fan from fan_drawer
#### Why I did it
Adopt a single way to get fan direction for all ASIC types.
It depends on hw-mgmt V.7.0010.2000.2303. Depends on https://github.com/Azure/sonic-buildimage/pull/7419
#### How I did it
Originally, the get_direction was implemented by fetching and parsing `/var/run/hw-management/system/fan_dir` on the Spectrum-2 and the Spectrum-3 systems. It isn't supported on the Spectrum system.
Now, it is implemented by fetching `/var/run/hw-management/thermal/fanX_dir` for all the platforms.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
The following error message is observed during chassis object being destroyed
"Exception ignored in: <function Chassis.__del__ at 0x7fd22165cd08>
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sonic_platform/chassis.py", line 83, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
The chassis tries to import deinitialize_sdk_handle during being destroyed for the purpose of releasing the sdk_handle.
However, importing another module during shutting down can cause the error because some of the fundamental infrastructures are no longer available."
This error occurs when a chassis object is created and then destroyed in the Python shell.
- How I did it
To fix it, record the deinitialize_sdk_handle in the chassis object when sdk_handle is being initialized and call the deinitialize handler when the chassis object is being destroyed
- How to verify it
Manually test.
- Why I did it
There was a change to replace platform utils with sonic platform API in psuutil. However, psu API is not initialized on host side. The PR is to fix it.
Backport of #7016 to the 202012 branch.
- How I did it
Initialize PSU API on both host and non-host side
- How to verify it
Manual test
- Why I did it
The existing Fan led and Psu led object initialize itself to green color in init method. However, there are multiple daemons calls sonic platform API and there could be a case that:
A PSU is removed from system
Reboot switch
psud detects that 1 PSU is missing and set PSU led to red
Other daemon just start up and call sonic platform API, the API set PSU led to green by call PsuLed.init
This PR is a partial fix for the issue. As we also need guarantee that the led is initialized with a correct value. I checked existing psud and thermalctld code. psud always initialize the PSU led color on boot up, thermalcltd need some changes to initialize led color on the first run
- How I did it
Remove the led color initialization code from FanLed.init and PsuLed.init
- How to verify it
Manual test
- Why I did it
Add support for new 64x200G SN4600 systems
- How I did it
Add all relevant files (w/o platform.json and hwsku.json as they will come later) with default SKU.
- How to verify it
Install image on switch, verify all ports are up and configured properly, run full platform SONiC tests.
In preparation for the merging of Azure/sonic-platform-common#173, which properly defines class and instance members in the Platform API base classes.
It is proper object-oriented methodology to call the base class initializer, even if it is only the default initializer. This also future-proofs the potential addition of custom initializers in the base classes down the road.
Submodule commits included:
* src/sonic-platform-common 6ad0004...bd4dc03 (1):
> [sonic_sfp/qsfp_dd.py] Update DOM capability method name to align with other drivers (#163)
Also align all calling function names to match.
**- Why I did it**
After migrating to python3, the operator '/' always get a float result, but it gets integer result in python2. Need fix this in thermal_conditions.
**- How I did it**
1. cast float value to int
2. change the unit test case to cover this situation
**- How to verify it**
Manually test and regression test
**- Why I did it**
PR https://github.com/Azure/sonic-platform-common/pull/102 modified the name of the SFF-8436 (QSFP) method to align the method name between all drivers, renaming it from `parse_qsfp_dom_capability` to `parse_dom_capability`. Once the submodule was updated, the callers using the old nomenclature broke. This PR updates all callers to use the new naming convention.
**- How I did it**
Update the name of the function globally for all calls into the SFF-8436 driver.
Note that the QSFP-DD driver still uses the old nomenclature and should be modified similarly. I will open a PR to handle this separately.
In rare case can see that xcvrd failed due to "UnboundLocalError: local variable 'label_port' referenced before assignment"
Init "label_port" as None at the beginning of the function, to avoid the case that "label_port" not assigned.
python2 is end of life and SONiC is going to support python3. This PR is going to support:
1. Mellanox SONiC platform API python3 support
2. Install both python2 and python3 verson of Mellanox SONiC platform API or pmon and host side
EEPROM cache file is not refreshed after install a new ONIE version even if the eeprom data is updated. The current Eeprom class always try to read from the cache file when the file exists. The PR is aimed to fix it.
Submodule updates include the following commits:
* src/sonic-utilities 9dc58ea...f9eb739 (18):
> Remove unnecessary calls to str.encode() now that the package is Python 3; Fix deprecation warning (#1260)
> [generate_dump] Ignoring file/directory not found Errors (#1201)
> Fixed porstat rate and util issues (#1140)
> fix error: interface counters is mismatch after warm-reboot (#1099)
> Remove unnecessary calls to str.decode() now that the package is Python 3 (#1255)
> [acl-loader] Make list sorting compliant with Python 3 (#1257)
> Replace hard-coded fast-reboot with variable. And some typo corrections (#1254)
> [configlet][portconfig] Remove calls to dict.has_key() which is not available in Python 3 (#1247)
> Remove unnecessary conversions to list() and calls to dict.keys() (#1243)
> Clean up LGTM alerts (#1239)
> Add 'requests' as install dependency in setup.py (#1240)
> Convert to Python 3 (#1128)
> Fix mock SonicV2Connector in python3: use decode_responses mode so caller code will be the same as python2 (#1238)
> [tests] Do not trim from PATH if we did not append to it; Clean up/fix shebangs in scripts (#1233)
> Updates to bgp config and show commands with BGP_INTERNAL_NEIGHBOR table (#1224)
> [cli]: NAT show commands newline issue after migrated to Python3 (#1204)
> [doc]: Update Command-Reference.md (#1231)
> Added 'import sys' in feature.py file (#1232)
* src/sonic-py-swsssdk 9d9f0c6...1664be9 (2):
> Fix: no need to decode() after redis client scan, so it will work for both python2 and python3 (#96)
> FieldValueMap `contains`(`in`) will also work when migrated to libswsscommon(C++ with SWIG wrapper) (#94)
- Also fix Python 3-related issues:
- Use integer (floor) division in config_samples.py (sonic-config-engine)
- Replace print statement with print function in eeprom.py plugin for x86_64-kvm_x86_64-r0 platform
- Update all platform plugins to be compatible with both Python 2 and Python 3
- Remove shebangs from plugins files which are not intended to be executable
- Replace tabs with spaces in Python plugin files and fix alignment, because Python 3 is more strict
- Remove trailing whitespace from plugins files
In order to support SONiC physical entity mib extension, a few new platform API are added to sonic-platform-common, this PR is to provide an mellanox platform implementation for those new APIs.
New driver support fetching additional pages from the cable EEPROM.
There are additional information to parse now: RX/TX power, TX bias, TX fault and RX LOS.
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
When detecting a new SFP insertion, read its SFP type and DOM capability from EEPROM again.
SFP object will be initialized to a certain type even if no SFP present. A case could be:
1. A SFP object is initialized to QSFP type by default when there is no SFP present
2. User insert a SFP with an adapter to this QSFP port
3. The SFP object fail to read EEPROM because it still treats itself as QSFP.
This PR fixes this issue.
Now we are reading base mac, product name from eeprom data, and the data read from eeprom contains multiple "\0" characters at the end, need trim them to make the string clean and display correct.
The `get_serial_number()` method in the ChassisBase and ModuleBase classes was redundant, as the `get_serial()` method is inherited from the DeviceBase class. This method was removed from the base classes in sonic-platform-common and the submodule was updated in https://github.com/Azure/sonic-buildimage/pull/5625.
This PR aligns the existing vendor platform API implementations to remove the `get_serial_number()` methods and ensure the `get_serial()` methods are implemented, if they weren't previously.
Note that this PR does not modify the Dell platform API implementations, as this will be handled as part of https://github.com/Azure/sonic-buildimage/pull/5609