f1d6655004
- Why I did it Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ. There seems to be a problem between some kernel versions and some SATA controllers. Syslog error message examples: Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED". Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED". Some vendors already disabled NCQ on their platforms in SONiC due to similar issue: [Arista] Disable ATA NCQ for a few products #13739 [Arista] Disable ATA NCQ for a few products [Arista] Disable SSD NCQ on DCS-7050CX3-32S #13964 [Arista] Disable SSD NCQ on DCS-7050CX3-32S Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ: https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous - How I did it Add a kernel parameter to tell libata to disable NCQ - How to verify it Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 |
||
---|---|---|
.. | ||
ACS-MSN3800 | ||
Mellanox-SN3800-C64 | ||
Mellanox-SN3800-D24C52 | ||
Mellanox-SN3800-D28C49S1 | ||
Mellanox-SN3800-D28C50 | ||
Mellanox-SN3800-D100C12S2 | ||
Mellanox-SN3800-D112C8 | ||
plugins | ||
default_sku | ||
installer.conf | ||
pcie.yaml | ||
platform_asic | ||
platform_components.json | ||
platform_reboot | ||
platform_wait | ||
platform.json | ||
pmon_daemon_control.json | ||
port_peripheral_config.j2 | ||
sensors.conf | ||
system_health_monitoring_config.json | ||
thermal_policy.json |