Why I did it
[Build] Support to use loosen version when failed to install python packages
It is to fix the issue #14012
How I did it
Try to use the installation command without constraint
How to verify it
Why I did it
Platform cases test_tx_disable, test_tx_disable_channel, test_power_override failed in dx010.
How I did it
Add i2c access algorithm for CPLD i2c adapters.
How to verify it
Verify it with platform_tests/api/test_sfp.py::TestSfpApi test cases.
#### Why I did it
Remove dialout as critical process as it is no longer used in prod. As part of future work, can remove dialout completely
#### How I did it
Remove from critical process list
Why I did it
Seastone does not have the psu fans' status led, need to reflect it in platform.json.
How I did it
Set the psu fans status led available to false.
How to verify it
Verify it with platform_tests/api/test_psu_fans.py::TestPsuFans::test_set_fans_led case.
- Why I did it
Update MFT version to 4.21.0-100 to include a fix for an issue reported using mlxlink on qsfp-dd
- How I did it
Update mft.mk
- How to verify it
Run regression on Mellanox platforms
Why I did it
Fix the docker-base-stretch build issue in nephos platform.
Collecting supervisord-dependent-startup==1.4.0
Downloading d386c3d2cf/supervisord_dependent_startup-1.4.0-py2.py3-none-any.whl
Collecting toposort>=1.5 (from supervisord-dependent-startup==1.4.0)
Downloading 44e51b4216/toposort-1.9.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '/tmp/pip-build-LnROQE/toposort/setup.py'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-LnROQE/toposort/
The other platforms have been upgraded to docker-base-buster, not impacted.
How I did it
Pin the toposort version to 1.7, the package supervisord-dependent-startup has dependency on it.
The toposort>=1.8 only for python3, is not applicable to python2.
Why I did it
Support to upgrade packages, do better cleanup after the build.
How I did it
Remove the no use preference version control file after the build.
How to verify it
- Why I did it
sfp_event.py gets a PMPE message when a cable event is available. In PMPE message, there is no label port available. Current sfp_event.py is using sx_api_port_device_get to get 64 logical ports attributes, and find the label port from those 64 attributes. However, if there are more than 64 ports, sfp_event.py might not be able to find the label port and drop the PMPE message.
- How I did it
Don't use hardcoded 64, get logical port number instead.
- How to verify it
Manual test
Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:
admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb 8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.
- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.
- How to verify it
Upgrade from 201911 to 202012
202012 to 201911 downgrade
202012 -> 202012 reboot
ONIE -> 202012 boot (First FW burn)
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Why I did it
Docker build has a low rate of hanging up.
It hangs on different steps. So, it looks like a bug in docker daemon.
How I did it
Start a daemon process to scan running time more than 1 hours, and kill the process.
How to verify it
- Why I did it
There are 3 tasks in xcvrd:
main task, run a loop to recover missing SFP static information to DB every 1 minute
SFP state task, a process which listens cable plug in/out event, insert SFP static information to DB while a cable is inserted
SFP DOM update task, a thread which handles cable DOM information update every 1 minute
Let assume user replaces QSFP with QSFP-DD. There are two issues:
Only SFP state task listens cable plug in/out event, main task and SFP DOM update task does not know SFP type has changed, they still “think” the SFP type is QSFP. So, main task and SFP DOM update task uses QSFP standard to parse QSFP-DD EEPROM which causes corrupted data.
There is a race condition between main task and SFP state task. They both insert SFP static information to DB. Depends on timing, it is possible that main task using wrong SFP type to override SFP static information.
The PR is to fix these two issues.
There is no such issue on 202205 and above because there is a refactor for xcvrd:
SFP state task was changed from process to thread, so that all 3 tasks share the same memory space, they always have correct SFP type.
Recover missing SFP information logical was moved from main task to SFP state task. There is no race condition anymore.
- How I did it
It is difficult to back port latest xcvrd because there are many refactor/new features in xcvrd after 202012 release. It will be huge effort to do so. Based on that, we decided to fix the issue on Nvidia platform API side. The fix is that: refreshing SFP type before any SFP API which accessing SFP EEPROM. Refreshing SFP type before any SFP API would cause a small performance down: Due to my test on 202012 branch, accessing transceiver INFO and DOM INFO for 32 ports takes 1.7 seconds before the change. The number changes to 2.4 seconds after the change. I suppose the performance down is acceptable.
- How to verify it
Manual test
Regression
#### Why I did it
Submodule update for sonic-swss-common with following change:
```
3e34309 2023-02-11 | RedisPipeline ignore flush when call dtor from another thread. (#736) [Hua Liu]
```
Why I did it
Some products might experience an occasional IO failure in the communication between CPU and SSD.
Based on some research it could be attributable to some device not handling ATA NCQ (Native Command Queue).
This issue currently affect 4 products:
DCS-7170-32C*
DCS-7170-64C
DCS-7060DX4-32
DCS-7260CX3-64
How I did it
This change disable NCQ on the affected drive for a small set of products.
How to verify it
When the fix is applied, these 2 patterns can be found in the dmesg.
ata1.00: FORCE: horkage modified (noncq)
NCQ (not used)
Test results using: fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
with NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (depth 32), AA)
READ: bw=33.9MiB/s (35.6MB/s), 33.9MiB/s-33.9MiB/s (35.6MB/s-35.6MB/s), io=4073MiB (4270MB), run=120078-120078msec
WRITE: bw=34.1MiB/s (35.8MB/s), 34.1MiB/s-34.1MiB/s (35.8MB/s-35.8MB/s), io=4100MiB (4300MB), run=120078-120078msec
without NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (not used))
READ: bw=31.7MiB/s (33.3MB/s), 31.7MiB/s-31.7MiB/s (33.3MB/s-33.3MB/s), io=3808MiB (3993MB), run=120083-120083msec
WRITE: bw=31.9MiB/s (33.4MB/s), 31.9MiB/s-31.9MiB/s (33.4MB/s-33.4MB/s), io=3830MiB (4016MB), run=120083-120083msec
Which release branch to backport (provide reason below if selected)
Why I did it
1. fix chassis test_set_fans_led case
2. fix chassis get_name case mismatch issue
3. fix fan_drawer test_set_fans_speed
4. fix component test_components test case
How I did it
Add corresponding configuration into chassis json file
How to verify it
Run platform tests cases to verify these failure cases
#### Why I did it
Include below commits:
```
7147354 2023-02-14 | Fix: zero route may have empty nexthop (#276) [Qi Luo]
e60a64c 2022-11-30 | Use github code scanning instead of LGTM (#274) [Liu Shilong]
```
#### Why I did it
1.57.x SDK based incremental drop that addresses:
Fix for MIGSMSFT-158
Support for VxLAN and BFD Serviceability CLI
sfputil reset platform fix to handle 100G optics
Added thermal management feature for ZR optics sensors
#### How I did it
Update cisco-8000 submodule to v0.2.5
Include following commits:
- c98b9f09 [202012][orchagent]: Get bridge port ID from FDBOrch cache instead of SAI API #2657 (#2658)
- 59886b8f [MuxOrch] Enabling neighbor when adding in active state (#2601)
```
39cdb49c7 [202012][show] Add bgpraw to show run all (#2639)
b3ebba2ca [202012][show] add new CLI to show tunnel route objects #2255 (#2659)
d08f59b9f Fixed a bug in "show vnet routes all" causing screen overrun. (#2644) (#2654)
a996abdb5 [202012][show] show logging CLI support for logs stored in tmpfs (#2652)
c60f771c0 [202012][show_bfd] add local discriminator in show bfd command (#2616)
```
#### Why I did it
Cherry pick from #10083
It is to fix the issue #10048
When building .raw image, for instance, target/sonic-broadcom.raw, it will generate a .bin image, target/sonic-broadcom.bin, as the intermediate file. The intermediate file is a build target which may contains different dependencies with the raw one.
#### How I did it
Rename the intermediate file.
#### Why I did it
Some tests under platform_tests/api were failing on the 7050CX3 due to outdated facts in platform.json
#### How I did it
Updated platform.json facts with appropriate values
#### How to verify it
Run tests under platform_tests/api to verify no failures
Why I did it
fix DX010 fan drawer and watchdog platform test case issues
How I did it
1. Add fan_drawer get_maximum_consumed_power support
2. Adjust maximum watchdog timeout value check
How to verify it
Run test_fan_drawer and test_watchdog test cases.
9551386 Jing Zhang Mon Jan 23 15:49:52 2023 -0800 [202012] Update link prober stats post logic #159 (#162)
e54f289 Liu Shilong Wed Nov 30 18:04:15 2022 +0800 Use github code scanning instead of LGTM (#157)
This is to reduce writes to disk, which then can use the SSD to get worn
out faster.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
(cherry picked from commit ee1c32a802)
Why I did it
Advance sonic-dhcp-relay submodule head on 202012 branch
How I did it
Added the following commits:
a4b15d8 jcaiMR Thu Dec 29 14:18:28 2022 +0800 fix relay-reply dhcpv6 packet counter issue (#29)
677543f jcaiMR Sat Dec 17 00:24:32 2022 +0800 fix handleSwssNotification crash in dhcp6relay (#28)
ed86546 jcaiMR Wed Dec 14 14:08:58 2022 +0800 Fix multiple vlan issue (#27)
5ec1f5b Vivek Thu Dec 8 09:44:15 2022 -0800 Made the Error log informative (#22)
063d41b jcaiMR Wed Nov 30 14:41:53 2022 +0800 disable cfg dynamic change (#25)
d4a51f6 kellyyeh Tue Jan 31 18:09:08 2023 -0800 Add unittest infrastructure (#5) (#31)
How to verify it
Ran full dhcpv6 test suite on lab device
Why I did it
When getting system mac of centec platform, it would increase by 1 the last byte of mac, but it could not consider the case of carry.
How I did it
Firstly, I would replace the ":" with "" of mac to a string.
And then, I would convert the mac from string to int and increase by 1, at last convert it to string with inserting ":".
- Why I did it
Added BIOS upgrade infra
- How I did it
Added new make target
- How to verify it
Copy msn3800_bios.tar.gz to platform/mellanox/bios
make configure PLATFORM=mellanox
make target/files/buster/msn3800_bios.tar.gz
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
Why I did it
Fix issue caused by dualtor support PR [dhcpmon] Open different socket for dual tor to enable interface filtering #11201
Improve code
How I did it
On single ToR, packets received count was duplicated due to socket filter set to "inbound"
Tx count not increasing due to filter set to "inbound". Added an outbound socket to count tx packets
Added vlan member interface mapping for Ethernet interface to vlan interface lookup in reference to PR Fix multiple vlan issue sonic-dhcp-relay#27
Exit when socket fails to initialize to allow dhcp_relay docker to restart
How to verify it
Tested on vstestbed single tor and dual tor, sent packets and verify printed out dhcpmon rx and tx counters is correct
Correct number of tx increases
Tx does not increase when ToR is on standby
#### Why I did it
Update submodule for sonic-swss:
e739e6c - 2023-01-27 : custom advertised prefix for primary vxlan tunnel [202012] (#2641) [siqbal1986]
sonic-restapi:
99c467d - 2023-01-24 : Add API support for adv prefix and custom monitoring (#133) [Prince Sunny]
347684a - 2022-11-30 : Use github code scanning instead of LGTM (#132) [Liu Shilong]
…(#12912)
Why I did it
As described in detail in #12753, the current FRR patch 0009-ignore-route-from-default-table.patch is causing unwanted FRR/zebra error logs. This change gets rid of the error messages for routes from kernel default table while these routes are ignored in prefix encoding.
How I did it
This fix updates the original 0009 patch by checking if the routes are from table default before printing the error logs. The original patch checks the same condition and ignores the routes from table default in prefix encoding.
How to verify it
Follow the steps to repro as described in #12753.
Also verify the test case ipfwd/test_nhop_count.py no longer fails due to the error messages.
#### Why I did it
Resolve cherry-pick conflict for https://github.com/sonic-net/sonic-buildimage/pull/12912
Why I did it
Fix the following issues for Seastone platform:
- system-health issue: show system-health detail will not complete #9530, Celestica Seastone DX010-C32: show system-health detail fails with 'Chassis' object has no attribute 'initizalize_system_led' #11322
- show platform firmware updates issue: Celestica Seastone DX010-C32: show platform firmware updates #11317
- other platform optimization
How I did it
Modify and optimize the platform implememtation.
How to verify it
Manual run the test commands described in these issues.
1.57.x SDK based incremental drop that addresses:
1) orchagent crash
2) Port LED issue
3) Tunnel endpoint stats
4) test_warm_reboot issue
5) nhop test failure
6) "show platform versions' CLI
- Why I did it
Backport #9795
Python select.select accept a optional timeout value in seconds, however, the value passes to it is a value in millisecond.
- How I did it
Transfer the value to millisecond.
- How to verify it
Manual test
#### Why I did it
Backport https://github.com/sonic-net/sonic-buildimage/pull/13246 to 202012 branch.
In case of warm/fast reboot, the hardware reboot cause will NOT be cleared because CPLD will not be touched in this flow. To not confuse the reboot cause determine logic, the leftover hardware reboot cause shall be skipped by the platform API, platform API will return the 'REBOOT_CAUSE_NON_HARDWARE' instead of the "hardware" reboot cause.
#### How I did it
Check the proc cmdline to see whether the last reboot is a warm or fast reboot, if yes skip checking the leftover hardware reboot cause.
#### How to verify it
a. Manual test:
> 1. Perform a power loss
> 2. Perform a warm/fast reboot
> 3. check the reboot cause should be "warm-reboot" or "fast-reboot" instead of "power loss"
b. Run reboot cause related regression test.