Why I did it
Power cycle test case fails for Z9332f in sonic-mgmt framework(#8605).
How I did it
Modified the platform API to return expected strings.
How to verify it
Power cycle the device and verify the reboot reason.
Run sonic-mgmt test_reboot script.
To enable saiserver docker on different platforms, it needs different configuration files. make the saiserver docker mount them in hwsku folder.
Co-authored-by: Ubuntu <richardyu@richardyu-ubuntu-vm0.trsxrdzozv2e1czsze2t05vqzh.ix.internal.cloudapp.net>
Fix Chassis.get_name to return the same value than what's in platform.json
Fix Chassis.get_system_eeprom_info when running from within pmon.
Fix Watchdog.get_remaining_time (fixes [202012 platform_tests] TestWatchdogApi::test_remaining_time failure on vms20-t1-7050cx3-3.1 #8440 and [ 202012 platform_tests ] TestWatchdogApi::test_arm_disarm_states failure on vms20-t1-7050cx3-3.1 #8439)
Implement missing thermal infos and conditions (fixes [202012 platform_tests] test_platform_info.py::test_thermal_control_psu_absence error #8453)
Fix Chassis.set_status_led return value (fixes [2020 platform_tests] TestChassisApi::test_status_led failure on vms20-t0-7050cx3-1 #8464)
To fix failed test cases of Haliburton platform APIs that found on platform_tests script
- How I did it
- Add device/celestica/x86_64-cel_e1031-r0/platform.json
- Update functions to support python3.7
- Add more functions follow latest sonic_platform_base
- Fix the bug
Signed-off-by: Wirut Getbamrung [wgetbumr@celestica.com]
Why I did it
fix the dx010 system eeprom unavailable issue
How I did it
enable the i2c slave 30ms timeout mechanism
How to verify it
i2cstress test in DX010 iSMT controller bus
Co-authored-by: nicwu-cel <nicwu@celestica.com>
Why I did it
serial-getty service exited in Dell S6100 device randomly.
How I did it
Added serial-getty to monit services.
How to verify it
Stop serial-getty in ssh session and check whether the service restarts or not.
Why I did it
platform test suite failed for few API's in DellEMC Z9332f platform.
How I did it
Modified the API's to return the expected values in the script.
How to verify it
Run platform test suite after making the changes.
This is to pick up BRCM SAI 4.3.5.1 which contains the following fix:
CS00012201406: [4.3.3.9] SAI_STATUS_FAILURE on FDB flush after all ports flapped
Preliminary tests looks fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the following test cases on z9332f (TH3) T0 DUT and all passed:
```
ipfwd/test_dir_bcast.py
fib/test_fib.py
```
Manually ran the following test cases on S6100 (TH) and all passed:
```
ipfwd/test_dir_bcast.py
fdb/test_fdb.py
```
This PR only contains backports from master
Fix leak discovered on master, though 202012 is not affected it's better to have the fix (fixes [master] thermalctld leak on Arista devices makes them unreachable when memory is exhausted #7515)
Fix EepromDecoderimplementation in the platform API (fixes syseepromd crashing repeatedly on SONiC.20201231.02 #8263)
Fix Mineral platform definition and configuration
Fix build issues in environments where /proc is not mounted/restricted (fixes PLATFORM=broadcom fails arista "ReloadCauseManagerTest" first time #7800)
Fix some pytest issues
Add sfp-eeprom C API and also mount it in pmon
- Removed the old function for detecting a faulty fan.
- Removed the old function for detecting excess temperature.
- Implement thermal_manager APIs based on ThermalManagerBase
- Implement thermal_conditions APIs based on ThermalPolicyConditionBase
- Implement thermal_actions APIs based on ThermalPolicyActionBase
- Implement thermal_info APIs based on ThermalPolicyInfoBase
- Add thermal_policy.json
Why I did it
To determine the revision of the pcie.yaml to be used based on BIOS version in DellEMC S6100 platform.
Depends on: Azure/sonic-platform-common#195
How I did it
Added two revisions of pcie.yaml pcie_1.yaml and pcie_2.yaml
Included a platform-specific Pcie class to provide the revision of the pcie.yaml to be used by pcieutil/pcied.
How to verify it
Execute pcieutil check (Azure/sonic-utilities#1672) command and verify the list of PCIe devices displayed.
Logs: UT_logs.txt
Use udevadm to trigger the udev rules on the first boot
How to verify:
- Connect C0 with E1031;
- Install or upgrade the sonic os to 202012 branch;
- When access to sonic check if /dev/C0-1 to /dev/C0-48 are existed.
#### Why I did it
Support API 2.0 for S5248F platform
#### How I did it
Making changes to S5248F platform specific directory
Co-authored-by: Arun LK <Arun_L_K@dell.com>
#### Why I did it
The debian install files are required for installing sonic_platform packages
#### How I did it
Add install files to under debian folder
Co-authored-by: robert.hong <robert.hong@qct.io>
Add device and platform code for ix7-bwde, ix8a-bwde.
Support platform API 2.0 for all quanta platforms except for ix1b
Co-authored-by: robert.hong <robert.hong@qct.io>
Why I did it
There were two regression issues introduced by BRCM SAI 4.3.3.8:
CS00012196056 [4.3.3.8][WARMBOOT] syncd[2584]: segfault at 5616ad6c3d80 ip 00007f61e0c6bc65 sp 00007fff0c5a7a90 error 4 in libsai.so.1.0[7f61e0a95000+3cd8000]
CS00012195956 [4.3.3.8] [TD3]Syncd Crash at brcm_sai_tnl_mp_create_tunnel()
How I did it
Patch for CS00012195956 from BRCM was validated to have addressed the tunnel creation issue.
Temporary worked around the issue by commenting out a portion of questionable code in BRCM SAI that seems to be the root cause of CS00012196056 .
How to verify it
See the BRCM cases for details.
There was an issue on master where `thermal.get_position_in_parent` in the platform API was returning -1 instead of a proper index. This is a backport of the fix for that issue.
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.
How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.
How to verify it
I verified this on the device str-7260cx3-acs-1.
#### Why I did it
- After [sonic-linux-kernel#177](https://github.com/Azure/sonic-linux-kernel/pull/177) changes, the I2C mux channels of Baseboard and Switchboard CPLDs are moved from i2c-4 and i2c-5 to i2c-36 and i2c-37 respectively.
- This caused QSFP driver initialization of i2c-36 to i2c-41 to fail causing the ports from Ethernet208 to Ethernet248 fail.
#### How I did it
- The fix to this problem is to change the order of QSFP driver initialization to I2C mux channels.
- Instead of the order i2c-10 to i2c-41, the order i2c-4 to i2c-35 is being utilized.
- Also, need to change the i2c-mux-channel number for Baseboard CPLD and switchboard CPLD in scripts to access them.
Why I did it
Microsoft reported occasional daemon crashes on devices running 201911. On close inspection it was due to PMBus reads failing on IOError on very rare occasions.
This is the fix for 202012 branch.
How I did it
Add try/except block on performing reads on PMBus GPIOs.
How to verify it
Which release branch to backport (provide reason below if selected)
Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>
Why I did it
This SAI change is to Enable TD2 platforms to also be able to handle VFP-based sub-interfaces.
This is the feature that is also needed for TD2 platforms.
How to verify it
Guohan and Team has validated this feature change on TD2 platforms
#### Why I did it
xcvrd crashes when the application advertisement capability flag is not seen for few transceivers.
#### How I did it
Initialize the additional application capability in dunder init
Add support for Accton as9726-32d platform
This pull request is based on as9716-32d, so I reference as9716-32d to create new model: as9726-32d.
This module do not need led driver to control led, FPGA can handle it.
I also implement API2.0(sonic_platform) for this model, CPLD driver, PSU driver, Fan driver to control these HW behavior.
Why I did it
Fix issues below.
#7133#6602
So, remove the dps200 driver from the platform-specific driver.
Then, add the dps200 module driver to the Linux kernel tree.
How I did it
Remove the dps200 driver from the platform-specific driver and add the dps200 module driver to the Linux kernel.
How to verify it
Build an image with Azure/sonic-linux-kernel#207
Then, install to the Haliburton.
This is to pick up BRCM SAI 4.3.3.5-2 which contains 2 main changes:
1. hsdk-6.5.21-p1.patch (to address some field problems related to SDK issue.
2. Fix for CS00012097141 (remove some unnecessary debug setting that was causing non functional impacting problem at boot time)
Preliminary tests looks fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the following test cases on S6100 DUT and all passed:
```
ipfwd/test_dir_bcast.py
fib/test_fib.py
vxlan/test_vxlan_decap.py
decap/test_decap.py
fdb/test_fdb.py
```
LED_PROC_INIT_SOC variable was incorrectly referenced as LED_SOC_INIT_SOC. Introduced in #5483
Rather than fixing the typo, I decided to simplify the script, removing the need for the conditional altogether by moving the bcmcmd call inside the conditional which checks for the presence of LED_SOC_INIT_SOC.
#### Why I did it
- PSU data is loaded into state DB.
Following errors are seen in syslogs:
"Failed to update PSU data - '<=' not supported between instances of 'float' and 'str'"
- Issue is not seen in master image as the PSU API return type is different.
#### How I did it
- Changed the return type in PSU API's.
#### Why I did it
400G media EEPROM and DOM information are not populated properly in DellEMC Z9332f platform.
#### How I did it
Handled QSFP_DD, QSFP28/QSFP+, SFP+ accordingly based on media type detected.
Incorporate the below changes in DellEMC Z9332F platform:
- Implemented watchdog platform API support
- Implement ‘get_position_in_parent’, ‘is_replaceable’ methods for all device types
- Change return type of SFP methods to match specification in sonic_platform_common/sfp_base.py
- Added platform.json file in device directory.
Co-authored-by: V P Subramaniam <Subramaniam_Vellalap@dell.com>
This is the SAI 4.3.3.5 code drop from BRCM to address 2 CSP case and initial MMU changes
Note the MMU changes is the same as that of SAI 4.3.3.4-1 (#7341) but with official patch.
- Case CS00012178716 [4.3] Polling SAI_QUEUE_STAT_SHARED_WATERMARK_BYTES fails often on TH3
- Case CS00012159273 [4.3.3.3] [TD3][IPFWD] subnet broadcast flooding end with extra VLAN Tag if member port of VLAN interface deleted then added back to VLAN
Once we have this PR merged, will validate each one of the above to ensure they are indeed fixed...
Preliminary tests looks fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the following test cases on TD3 DUT and all passed:
ipfwd/test_dir_bcast.py
fib/test_fib.py
vxlan/test_vxlan_decap.py
decap/test_decap.py
fdb/test_fdb.py
#### Why I did it
- xcvrd crash was seen in latest 201811 images.
- For Dell S6100,API 2.0 uses poll mode while 1.0 was still using interrupt mode.
#### How I did it
- Modified get_transceiver_change_event in 1.0 to poll mode.
* 7260cx3 DualToR config.bcm support based on DualToR setting in device metadata at boot time.
For HWSKU Arista-7260CX3-C64 the MMU setting SOC for T0/T1 is also combined into the config.bcm.j2 logic so use just one config file and adding delta based on Switch Roles.
To add latest SAI drop REL_4.3.3.3 to SONIC which addresses the following CSP cases:
CS00012058054: [4.3][IPinIP][TTL-PIPE] IPinIP TTL Pipe Mode is NOT working it is behaving UNIFORM mode even programed as PIPE mode
CS00011227466: [4.3] Warmboot support with tunnel encap
#### Why I did it
- The xcvrd service requires an event detection function, unplug or plug in the transceiver.
#### How I did it
- Add sysfs interrupt to notify userspace app of external interrupt
- Implement get_change_event() in chassis api.
- Also begin installing Python 3 sonic-platform package for Celestica platforms
- Provide `hw-management-generate-dump.sh` for `show techsupport`
- Load `optoe3` for OSFP and QSFP-DD transceivers
- Enhance reboot-cause caching robustness
Signed-off-by: Samuel Angebault <staphylo@arista.com>
In preparation for the merging of Azure/sonic-platform-common#173, which properly defines class and instance members in the Platform API base classes.
It is proper object-oriented methodology to call the base class initializer, even if it is only the default initializer. This also future-proofs the potential addition of custom initializers in the base classes down the road.
#### Why I did it
To incorporate the below changes in DellEMC S5232, Z9264, Z9332 platforms.
- Update thermal high threshold values
- Make watchdog API Python2 and Python3 compatible
- Fix LGTM alerts
- Z9264: Fix get_change_event timer value
#### How I did it
- Use 'universal_newlines=True' in subprocess.Popen call.
- Change the timeout in 'get_change_event' to milliseconds to match specification in sonic_platform_common/chassis_base.py
The S6000 devices, the cold reboot is abrupt and it is likely to cause issues which will cause the device to land into EFI shell. Hence the platform reboot will happen after graceful unmount of all the filesystems as in S6100.
Moved the platform_reboot to platform_reboot_override and hooked it to the systemd shutdown services as in S6100
- Add support for `DCS-7050SX3-48YC8` and `DCS-7050SX3-48C8` platform
- Add support for more variants of `DCS-7280CR3-32[PD]4`
- Add Supervisor to Linecard consutil support
- Complete Watchdog platform API support
- Fix some PSU behavior on `DCS-7050QX-32` and `DCS-7060CX-32S`
- Fix SEU management on `DCS-7060CX-32S`
- Allow kernel modules to build up to linux 5.10
- Rename led color `orange` to `amber`
- Miscellaneous fixes
Azure/sonic-utilities#1431 changes the path to the udevprefix.conf file. The file previously inappropriately resided in the <platform>/plugins/ directory. That directory is reserved for now-deprecated Python platform plugins, and will be removed in the near future.
This PR is needed to fix the show interface counters output issue where the counters are not correct due to an issue introduced in BRCM SAI 4.3.
- How to verify it
Without the fix if one injects packets, the expected counters for the interfaces involved do not show correct count values. The RX count looks to be TX count while TX count looks to be RX count but even that the values could not be trusted.
After the fix the counters started to look correct. Here is one sample output taken after the fix is applied where I manually injected 10,000 packets into Ethernet92 to be routed out of the port channel member port Ethernet40. Also injected 10,000 invalid packets into the same Ethernet92 and all 10,000 packets were shown RX_DRP correctly.
Merged bcmsai 4.3.0.13 code to top of master bcmsai 4.3.0.10-5.
- How to verify it
Ran nightly regression on T0 and T1 topology using bcmsai 4.3.0.13. Test results are better than previous runs.
For T0 -
New test passing - CRM, Decap, FDB, platform test, VxLAN
New test failing – PFC unknown MAC
For T1 –
New test passing – Port Channel
New test failing – platform test
This PR is to fix issue: #6603
The CP210x driver not attached properly after first login (cold-plug), lsusb and dmesg shown that the usb devices has been recognized but driver is missing.
Submodule commits included:
* src/sonic-platform-common 6ad0004...bd4dc03 (1):
> [sonic_sfp/qsfp_dd.py] Update DOM capability method name to align with other drivers (#163)
Also align all calling function names to match.
When Building syncd-rpc, libthrift has dependency on libboost-atomic1.71.0,
however the debian packager install version 1.67 instead. This PR
preinstalls libboost-atomic v 1.71 to avoid falling back to v 1.67.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
ACL entry set attribute updates all the entries in the table. The correct behavior is to set the attribute on single entry.
- How I did it
Current SDK code, while setting the new attribute, is going through all the entries and updating it. Added a logic to check for requested entry and only allow for that ACL entry.
A case has filed with BRCM. Once an official fix is provided by BRCM, we will then remove this in house fix and apply the official fix.
Accton util applies lsmod to check if drivers are installed.
But lsmod may return error on startup and skip module installation.
Signed-off-by: roy_lee <roy_lee@edge-core.com>
1. BRCM SAI Debian build need not have any Kernel version dependency - Starting with 4.3 BRCM made changes in SAI so that this dependency has been cleaned up. We can now remove the Kernel Version dependency from Azure Pipeline build script.
2. Bypass PEER_MODE p2mp setting causing SYNCd crash on non-TD3 SKUs - Temporarily patch BRCM SAI code to not cause SYNCd crash when Orchagent program SAI_TUNNEL_ATTR_PEER_MODE: SAI_TUNNEL_PEER_MODE_P2MP on Non-TD3 SKUs. Will remove this when BRCM provide proper fix to address this issue.
**- Why I did it**
PR https://github.com/Azure/sonic-platform-common/pull/102 modified the name of the SFF-8436 (QSFP) method to align the method name between all drivers, renaming it from `parse_qsfp_dom_capability` to `parse_dom_capability`. Once the submodule was updated, the callers using the old nomenclature broke. This PR updates all callers to use the new naming convention.
**- How I did it**
Update the name of the function globally for all calls into the SFF-8436 driver.
Note that the QSFP-DD driver still uses the old nomenclature and should be modified similarly. I will open a PR to handle this separately.
**- Why I did it**
To incorporate the below changes in DellEMC S6100, S6000 platforms.
- S6100, S6000:
- Enable 'thermalctld'
- Implement DeviceBase methods (presence, status, model, serial) for Fantray and Component
- Implement ‘get_position_in_parent’, ‘is_replaceable’ methods for all device types
- Implement ‘get_status’ method for Fantray
- Implement ‘get_temperature’, ‘get_temperature_high_threshold’, ‘get_voltage_high_threshold’, ‘get_voltage_low_threshold’ methods for PSU
- Implement ‘get_status_led’, ‘set_status_led’ methods for Chassis
- SFP:
- Make EEPROM read both Python2 and Python3 compatible
- Fix ‘get_tx_disable_channel’ method’s return type
- Implement ‘tx_disable’, ‘tx_disable_channel’ and ‘set_power_override’ methods
- S6000:
- Move PSU thermal sensors from Chassis to respective PSU
- Make available the data of both Fans present in each Fantray
**- How I did it**
- Remove 'skip_thermalctld:true' in pmon_daemon_control.json
- Implement the platform API methods in the respective device files
- Use `bytearray` for data read from transceiver EEPROM
- Change return type of 'get_tx_disable_channel' to match specification in sonic_platform_common/sfp_base.py
- Why I did it
Fix issue: ptf_nn_agent isn't able to start in syncd-rpc docker on buster.
- How I did it
The issue is fixed by installing python-dev, cffi and nnpy for python 2 explicitly.
- How to verify it
Run copp test on RPC image.
Starting with BRCM SAI 4.3.1.5 we see the following :ethtool not fount" error in syslog during boot up:
```
Jan 27 07:36:14.712472 str-s6100-acs-1 INFO syncd#/supervisord: syncd sh: 1:
Jan 27 07:36:14.712844 str-s6100-acs-1 INFO syncd#/supervisord: syncd ethtool: not found
Jan 27 07:36:14.713228 str-s6100-acs-1 INFO syncd#/supervisord: syncd #015
Jan 27 07:36:14.713840 str-s6100-acs-1 INFO syncd#syncd: [0] SAI_API_HOSTIF:_brcm_sai_hostif_speed_set:11894 cmd ethtool -s Ethernet39 speed 40000 rc:32512
Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet39
Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet39
Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- initializePort: Initializing port alias:Ethernet36 pid:1000000000040
Jan 27 07:36:14.726793 str-s6100-acs-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet36 admin:0 oper:0 addr:4c:76:25:f5:48:80 ifindex:75 master:0
Jan 27 07:36:14.727967 str-s6100-acs-1 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet36(ok) to state db
Jan 27 07:36:14.729331 str-s6100-acs-1 NOTICE swss#orchagent: :- addHostIntfs: Create host interface for port Ethernet36
Jan 27 07:36:14.752398 str-s6100-acs-1 INFO syncd#/supervisord: syncd sh: 1: ethtool: not found#015
Jan 27 07:36:14.752689 str-s6100-acs-1 INFO syncd#syncd: [0] SAI_API_HOSTIF:_brcm_sai_hostif_speed_set:11894 cmd ethtool -s Ethernet36 speed 40000 rc:32512
Jan 27 07:36:14.756050 str-s6100-acs-1 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet36
Jan 27 07:36:14.757585 str-s6100-acs-1 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet36
```
It seems that starting with BRCM SAI 4.2.1.5 syncd is using ethtool to set the host interface speed and since this ethtool was not part of the syncd Docker, we observe these "ethtool not found" issue.
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.
Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.
- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:
The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.
If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.
- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.
Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.
- Which release branch to backport (provide reason below if selected)
201811
201911
[x ] 202006
- combine docker-ptf-saithrift into docker-ptf docker
- build docker-ptf under platform vs
- remove docker-ptf for other platforms
Signed-off-by: Guohan Lu <lguohan@gmail.com>
Fixes#6445
Because the ipmihelper.py script in the 9332 folder is slightly different than the common one (due to LGTM fixes), when the common one gets copied during build time it causes the workspace/build to become dirty.
Signed-off-by: Danny Allen <daall@microsoft.com>
**- Why I did it**
- The thermalctld daemon on the Pmon docker requires support from the thermal manager API.
**- How I did it**
- Removed the old function for detecting a faulty fan.
- Removed the old function for detecting excess temperature.
- Implement thermal_manager APIs based on ThermalManagerBase
- Implement thermal_conditions APIs based on ThermalPolicyConditionBase
- Implement thermal_actions APIs based on ThermalPolicyActionBase
- Implement thermal_info APIs based on ThermalPolicyInfoBase
- Add thermal_policy.json
Accton util applies lsmod to check if drivers are installed.
But lsmod may return error on startup and skip module installation.
Signed-off-by: Brandon Chuang <brandon_chuang@edge-core.com>
It's been reported that accton fan monitor process keeps consuming memory after few days.
The amount of memory occupied increases in linear and never leased.
Signed-off-by: roy_lee <roy_lee@edge-core.com>
Fix#119
when parallel build is enable, multiple dpkg-buildpackage
instances are running at the same time. /var/lib/dpkg is shared
by all instances and the /var/lib/dpkg/updates could be corrupted
and cause the build failure.
the fix is to use overlay fs to mount separate /var/lib/dpkg
for each dpkg-buildpackage instance so that they are not affecting
each other.
Signed-off-by: Guohan Lu <lguohan@gmail.com>