**DEPENDS ON: [[graceful reboot] Add the pre_reboot_hook script execution, add the watchdog arm before the reboot](https://github.com/sonic-net/sonic-utilities/pull/3203)**
#### Why I did it
Add support for the `graceful reboot` instead of the `sysfs power cycle` to avoid filesystem corruption
### How I did it
Rename the `platform_reboot` script to the `pre_reboot_hook`.
Remove the sysfs power cycle function, from now on the Debian reboot (`/sbin/reboot`) will be executed instead of the sysfs power cycle.
#### How to verify it
1. Start watching logs by using `show log -f` and `journalctl -p debug -f`
2. Execute the `reboot` command from the switch CLI
3. Check in logs that all systemd services terminated
- Why I did it
Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.
Syslog error message examples:
Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".
Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:
[Arista] Disable ATA NCQ for a few products #13739 [Arista] Disable ATA NCQ for a few products
[Arista] Disable SSD NCQ on DCS-7050CX3-32S #13964 [Arista] Disable SSD NCQ on DCS-7050CX3-32S
Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:
https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous
- How I did it
Add a kernel parameter to tell libata to disable NCQ
- How to verify it
Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
* [buffers] Add create_only_config_db_buffers.json for MLNX devices (not MSFT SKU), inject it at the start of the swss docker
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
* [buffers] Align the sonic-device_metadata.yang
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
---------
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
- Why I did it
Add new breakout modes to be used in PAM4 supported cables
- How I did it
- How to verify it
Verified the 50G per lane breakout modes are applied properly on the switch
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
- Why I did it
Added the fwtrace config files in order to be able to call the mlxstrace utility during the show techsupport dump.
Work item tracking
Microsoft ADO (number only):
- How I did it
Added fwtrace config files. Added path to these files to sai.profile for each mlnx device.
- How to verify it
Execute the show techsupport command and check if mlxstrace output is in system dump.
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
- Why I did it
The chassis name in MSN4410 platform_components.json is not correct
- How I did it
Fix the chassis name
- How to verify it
Run relevant platform API test
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
For MSN4410/MSN4600/MSN4700 now they can support fetching PSU voltage threshold, no need to skip the psu voltage check in system health monitoring, so update the system health monitoring configuration file for these platforms.
- How I did it
remove skip PSU change config from the system_health_monitoring_config.json file
- How to verify it
Build image run on these platforms, system health monitoring will not report error against PSU voltage
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
The capability files were incorrect in comparison to the marketing spec of the SN4410 platform.
- How I did it
Aligned the capability files according to the marketing spec.
- How to verify it
Basic manual sanity checks:
1. Check if critical docker containers were UP
2. Check if interfaces were created and were UP
3. Check if interfaces created in the syncd docker container by executing – sx_api_ports_dump.py script
4. Check the logs from the start of the switch – everything was OK
5. Verified the port breakout
Signed-off-by: Vadym Hlushko <vadymh@nvidia.com>
Signed-off-by: Stephen Sun stephens@nvidia.com
Why I did it
Support zero buffer profiles
Add buffer profiles and pool definition for zero buffer profiles
Support applying zero profiles on INACTIVE PORTS
Enable dynamic buffer manager to load zero pools and profiles from a JSON file
Dependency: It depends on Azure/sonic-swss#1910 and submodule advancing PR once the former merged.
How I did it
Add buffer profiles and pool definition for zero buffer profiles
If the buffer model is static:
Apply normal buffer profiles to admin-up ports
Apply zero buffer profiles to admin-down ports
If the buffer model is dynamic:
Apply normal buffer profiles to all ports
buffer manager will take care when a port is shut down
Update buffers_config.j2 to support INACTIVE PORTS by extending the existing macros to generate the various buffer objects, including PGs, queues, ingress/egress profile lists
Originally, all the macros to generate the above buffer objects took active ports only as an argument
Now that buffer items need to be generated on inactive ports as well, an extra argument representing the inactive ports need to be added
To be backward compatible, a new series of macros are introduced to take both active and inactive ports as arguments
The original version (with active ports only) will be checked first. If it is not defined, then the extended version will be called
Only vendors who support zero profiles need to change their buffer templates
Enable buffer manager to load zero pools and profiles from a JSON file:
The JSON file is provided on a per-platform basis
It is copied from platform/<vendor> folder to /usr/share/sonic/temlates folder in compiling time and rendered when the swss container is being created.
To make code clean and reduce redundant code, extract common macros from buffer_defaults_t{0,1}.j2 of all SKUs to two common files:
One in Mellanox-SN2700-D48C8 for single ingress pool mode
The other in ACS-MSN2700 for double ingress pool mode
Those files of all other SKUs will be symbol link to the above files
Update sonic-cfggen test accordingly:
Adjust example output file of JSON template for unit test
Add unit test in for Mellanox's new buffer templates.
How to verify it
Regression test.
Unit test in sonic-cfggen
Run regression test and manually test.
- Why I did it
Add NVIDIA Copyright header to "mellanox" files
- How I did it
Add NVIDIA Copyright header as a comment for Mellanox files
- How to verify it
Sanity tests and PR checkers.
#### Why I did it
Add platform_asic file to each platform folder in sonic-device-data package. The file content will be used as the ground truth of mapping from PLATFORM_STRING to switch ASIC family.
One use case of the mapping is to prevent installing a wrong image, which targets for other ASIC platforms. For example, currently we have several ONIE images naming as sonic-*.bin, it's easy to mistakenly install the wrong image. With this mapping built into image, we could fetch the ONIE platform string, and figure out which ASIC it is using, and check we are installing the correct image.
After this PR merged, each platform vendor has to add one mandatory text file `device/PLATFORM_VENDOR/PLATFORM_STRING/platform_asic`, with the content of the platform's switch ASIC family.
I will update https://github.com/Azure/SONiC/wiki/Porting-Guide after this PR is merged.
You can get a list of the ASIC platforms by `ls -b platform | cat`. Currently the options are
```
barefoot
broadcom
cavium
centec
centec-arm64
generic
innovium
marvell
marvell-arm64
marvell-armhf
mellanox
nephos
p4
vs
```
Also support
```
broadcom-dnx
```
#### How I did it
#### How to verify it
Test one image on DUT. And check the folders under `/usr/share/sonic/device`
- Why I did it
The MSN4410 platform was missing 2x100G and 4x50G supported breakout modes in platform.json
- How I did it
Added the aforementioned breakout modes to platform.json
- How to verify it
Run show interface breakout on a MSN4410 and verify that 2x100G and 4x50G are listed in the supported breakout modes for ports 1-24.
#### Why I did it
New SN410 A1 system has a different sensor layout with A0 system, needs a new sensor conf file to support it.
#### How I did it
Since the SN4410 A1 system use exactly the same sensor layout as the SN4700 A1 system, so add a symbol link linking to the SN4700 A1 sensor conf file to reuse.
#### How to verify it
Run sensor test against the SN4410 A1 system;
Run platform related regression test against the SN4410 A1 system
- Why I did it
To create SDK dump on Mellanox devices when SDK event has occurred.
- How I did it
Set the SKUs keys needed to initialize the feature in SAI.
- How to verify it
Simulate SDK event and check that dump is created in the expected path.
- Why I did it
The default breakout mode according to hwsku.json for the MSN4410 is 1x400G and this is not a supported breakout mode according to its platform.json
This causes a conflict on boot of this platform and no containers on the switch will init successfully.
- How I did it
Referenced the platform specification files and updated platform.json
- How to verify it
Install master version of SONiC on MSN4410
Boot switch and verify swss is successfully running using docker ps
#### Why I did it
Current platform.json lacks some peripheral device related facts, like chassis/fan/pasu/drawer/thermal/components names, numbers, etc.
#### How I did it
Add platform device facts to the platform.json file
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
Add missed files for dynamic buffer calculation for ACS-MSN3420 and ACS-MSN4410
- How I did it
asic_table.j2: Add mapping from platform to ASIC
Add buffer_dynamic.json.j2 for ACS-MSN4410.
- How to verify it
Check whether the dynamic buffer calculation daemon starts successfully.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
The file device/mellanox/x86_64-mlnx_msn4410-r0/plugins/sfputil.py is not a software link for device/mellanox/x86_64-mlnx_msn2700-r0/plugins/sfputil.py. And it is still using python2 syntex which causes some SFP CLI error. The PR is to change it to a softlink and add 4410 support in device/mellanox/x86_64-mlnx_msn2700-r0/plugins/sfputil.py.
To fix [DPB| wrong aliases for interfaces](https://github.com/Azure/sonic-buildimage/issues/6024) issue, implimented flexible alias support [design doc](https://github.com/Azure/SONiC/pull/749)
> [[dpb|config] Fix the validation logic of breakout mode](https://github.com/Azure/sonic-utilities/pull/1440) depends on this
#### How I did it
1. Removed `"alias_at_lanes"` from port-configuration file(i.e. platfrom.json)
2. Added dictionary to "breakout_modes" values. This defines the breakout modes available on the platform for this parent port, and it maps to the alias list. The alias list presents the alias names for individual ports in order under this breakout mode.
```
{
"interfaces": {
"Ethernet0": {
"index": "1,1,1,1",
"lanes": "0,1,2,3",
"breakout_modes": {
"1x100G[40G]": ["Eth1"],
"2x50G": ["Eth1/1", "Eth1/2"],
"4x25G[10G]": ["Eth1/1", "Eth1/2", "Eth1/3", "Eth1/4"],
"2x25G(2)+1x50G(2)": ["Eth1/1", "Eth1/2", "Eth1/3"],
"1x50G(2)+2x25G(2)": ["Eth1/1", "Eth1/2", "Eth1/3"]
}
}
}
```
#### How to verify it
`config interface breakout`
Signed-off-by: Sangita Maity <samaity@linkedin.com>
**- Why I did it**
PR https://github.com/Azure/sonic-platform-common/pull/102 modified the name of the SFF-8436 (QSFP) method to align the method name between all drivers, renaming it from `parse_qsfp_dom_capability` to `parse_dom_capability`. Once the submodule was updated, the callers using the old nomenclature broke. This PR updates all callers to use the new naming convention.
**- How I did it**
Update the name of the function globally for all calls into the SFF-8436 driver.
Note that the QSFP-DD driver still uses the old nomenclature and should be modified similarly. I will open a PR to handle this separately.
- Why I did it
platform.json and hwsku.json files are required for a feature called Dynamic Port Breakout
- How I did it
Created capability files according to platform specification SN4410
- How to verify it
Full qualification requires bugs fixes reported under sonic-buildimage
Signed-off-by: Vadym Hlushko <vadymh@nvidia.com>