- Why I did it
Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.
Syslog error message examples:
Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".
Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:
[Arista] Disable ATA NCQ for a few products #13739 [Arista] Disable ATA NCQ for a few products
[Arista] Disable SSD NCQ on DCS-7050CX3-32S #13964 [Arista] Disable SSD NCQ on DCS-7050CX3-32S
Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:
https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous
- How I did it
Add a kernel parameter to tell libata to disable NCQ
- How to verify it
Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
Why I did it
Add the create_only_config_db_buffers attribute to the DEVICE_METADATA|localhost. If the "create_only_config_db_buffers" exists and is equal to "true" - the buffers will be created according to the config_db configuration (for example BUFFER_QUEUE|* table), otherwise the maximum available buffers (which are read from SAI) will be created, regardless of the CONFIG_DB buffers configuration.
Work item tracking
Microsoft ADO (number only):
How I did it
Add the create_only_config_db_buffers.json files for Mellanox devices (not MSFT SKU's), and inject the content to the CONFIG_DB during the swss docker container start.
How to verify it
Manual verification:
Install the image with this PR included on the not MSFT SKU switch
Check the show queue counters output and verify that only configured in CONFIG_DB buffers are created
root@sonic:/home/admin# show queue counters
Port TxQ Counter/pkts Counter/bytes Drop/pkts Drop/bytes
--------- ----- -------------- --------------- ----------- ------------
Ethernet0 UC0 0 0 0 N/A
Ethernet0 UC1 0 0 0 N/A
Ethernet0 UC2 0 0 0 N/A
Ethernet0 UC3 0 0 0 N/A
Ethernet0 UC4 0 0 0 N/A
Ethernet0 UC5 0 0 0 N/A
Ethernet0 UC6 0 0 0 N/A
Open the /usr/share/sonic/device/$DEVICE/$SKU/create_only_config_db_buffers.json and change it to:
"create_only_config_db_buffers": "false"
Do config reload
Check the show queue counters output and verify that all available buffers are created
root@sonic:/home/admin# show queue counters
Port TxQ Counter/pkts Counter/bytes Drop/pkts Drop/bytes
--------- ----- -------------- --------------- ----------- ------------
Ethernet0 UC0 0 0 0 N/A
Ethernet0 UC1 0 0 0 N/A
Ethernet0 UC2 0 0 0 N/A
Ethernet0 UC3 0 0 0 N/A
Ethernet0 UC4 0 0 0 N/A
Ethernet0 UC5 0 0 0 N/A
Ethernet0 UC6 0 0 0 N/A
Ethernet0 UC7 60 15346 0 N/A
Ethernet0 MC8 N/A N/A N/A N/A
Ethernet0 MC9 N/A N/A N/A N/A
Ethernet0 MC10 N/A N/A N/A N/A
Ethernet0 MC11 N/A N/A N/A N/A
Ethernet0 MC12 N/A N/A N/A N/A
Ethernet0 MC13 N/A N/A N/A N/A
Ethernet0 MC14 N/A N/A N/A N/A
Ethernet0 MC15 N/A N/A N/A N/A
Why I did it
To improve FAST reboot dataplane downtime
Work item tracking
N/A
How I did it
Updated SAI xml config file
How to verify it
Run sonic-mgmt tests of fastboot
- Why I did it
Because the Spectrum4 devices don't support mlxtrace utility.
- How I did it
Edit sai.profile and remove mlxtrace_spectrum4_itrace_*.cfg.ext files
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
Co-authored-by: Vadym Hlushko <62022266+vadymhlushko-mlnx@users.noreply.github.com>
- Why I did it
Enabled port late create on SN5600 Spectrum-4 switch boots up with no ports
Work item tracking
N/A
- How I did it
Updated SAI xml config file
- How to verify it
Run sonic-mgmt tests of fastboot
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
- Why I did it
Update SAI xml file to align with the default SKU
- How I did it
Update the SN5600 SAI xml file
- How to verify it
Install image on SN5600 device
- Why I did it
Update the sensors.conf and pcie.yaml according to the real hardware.
- How I did it
Update the sensors.conf and pcie.yaml
- How to verify it
run relevant sonic-mgmt test cases.
Signed-off-by: Kebo Liu <kebol@nvidia.com>