#### Why I did it
When a kernel crash occurs, the system will reboot to the kdump capture kernel if kdump is enabled (`config kdump enable`). In the kdump capture boot, it only stores the crash information, and then reboot the system to a normal boot.
In this boot, no SONiC service is started but it invokes `reboot` which is actually the SONiC reboot that depends on SONiC services. There is a logic to skip all SONiC stuff and invoke platform reboot in SONiC reboot to avoid issues.
However, on Nvidia platforms, the platform reboot still depends on SONiC services, which can cause issues.
So, the Debian reboot is called directly in platform reboot if it is invoked from the kdump capture boot.
#### How I did it
Manual test
- Why I did it
Added the fwtrace config files in order to be able to call the mlxstrace utility during the show techsupport dump.
Work item tracking
Microsoft ADO (number only):
- How I did it
Added fwtrace config files. Added path to these files to sai.profile for each mlnx device.
- How to verify it
Execute the show techsupport command and check if mlxstrace output is in system dump.
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
* [E1031] add platform specific reboot command support
Why I did it
E1031: add platform specific cold reboot support
How I did it
Use the CPLD to trigger board level power cycle when cold reboot
How to verify it
Do reboot stress test and check the reboot cause history
* [E1031] try to umount filesystem before power cycle reboot
* [E1031] remove fstrim in customized reboot script
Why I did it
Certain all-numeric device IDs of PCI devices in the pcie.yaml file are left unquoted, leading to false mismatch flags in the pcie daemon and subsequently leads to log flooding. This PR fixes that issue.
Work item tracking
Microsoft ADO (number only): 24578930
How I did it
Added quotes around numeric PCI devices in the pcie.yaml files of the following platforms:
x86_64-mlnx_msn2700-r0
x86_64-mlnx_msn4600c-r0
How to verify it
Install latest image after the merge and verify that syslogs are not flooded with PCI device mismatch errors
Why I did it
port_config.ini and hwsku.json are needed to generate the default config
switch_type needs to be "dpu" to spawn the right set of processes during dvs initialization and to make sure that DASH APIs can be handled properly
Work item tracking
Microsoft ADO 24375371:
How I did it
Use the same hwsku.json and port_config.ini for DPU-2P as the ones used for Nvidia-MBF2H536C SKU in nvidia-sonic sonic-buildimage repo.
Set switch_type to "dpu" in DEVICE_METADATA configuration to make sure DASH specific APIs are handled properly
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
* [202012][platform/barefoot] (#8543)
Why I did it
Pcied running by python 2.
How I did it
dropped python2 support and add python3 support for pcied in file docker-pmon.supervisord.conf.j2
How to verify it
docker exec pmon supervisorctl status
* [Netberg][nba710] Added initial support for Aurora 710
Signed-off-by: Andrew Sapronov <andrew.sapronov@gmail.com>
---------
Signed-off-by: Andrew Sapronov <andrew.sapronov@gmail.com>
Co-authored-by: Kostiantyn Yarovyi <kostiantynx.yarovyi@intel.com>
Define a generic 2-port NPU SKU for docker-sonic-vs to
enable DASH vstests to pass on azure pipelines
Work item tracking
Microsoft ADO 24375371:
How I did it
Define a generic 2-port NPU hwsku that is used only for DASH-specific vstests.
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
Added support data for fabric monitoring in CONFIG_DB
The CONFIG_DB now has the FABRIC_MONITOR|FABRIC_MONITOR_DATA table for default value for fabric port monitoring. An example output of getting this table is:
sonic-db-cli CONFIG_DB hgetall "FABRIC_MONITOR|FABRIC_MONITOR_DATA"
{'monErrThreshCrcCells': '1', 'monErrThreshRxCells': '61035156', 'monPollThreshIsolation': '1', 'monPollThreshRecovery': '8'}
The CONFIG_DB now also has a table for each fabric port for its isolate status.
An example output of getting this table is:
sonic-db-cli CONFIG_DB hgetall "FABRIC_PORT|Fabric20"
{'alias': 'Fabric20', 'isolateStatus': 'False', 'lanes': '20'}
Why I did it
sonic-mgmt is failing tests due to invalid test data in platform.json
Fwutil is upset the chassis name in the platform_component.json of the 7060CX-32S
How I did it
Fixed the aforementioned issues
Why I did it
Work item tracking
Microsoft ADO (24182162):
How I did it
update the config.bcm to set the default fec RS 100G Linecard
How to verify it
Tests on chassis
Why I did it
fix possible cpld race read issue between watchdog and reboot cause
process
How I did it
Use fcntl.flock to limit parallel access to cpld sys file
How to verify it
It can be simulated and verified with following python script
``` python3
import fcntl
import signal
import threading
exit_flag = False
def get_cpld_reg_value(getreg_path, register):
file = open(getreg_path, 'w+')
# Acquire an exclusive lock on the file
fcntl.flock(file, fcntl.LOCK_EX)
try:
file.write(register + '\n')
file.flush()
# Seek to the beginning of the file
file.seek(0)
# Read the content of the file
result = file.readline().strip()
finally:
# Release the lock and close the file
fcntl.flock(file, fcntl.LOCK_UN)
file.close()
return result
def cpld_read(thread_num, cpld_reg, expect_val):
while not exit_flag:
val
= get_cpld_reg_value("/sys/devices/platform/dx010_cpld/getreg",
cpld_reg)
#print(f"Thread {thread_num}: get cpld reg {cpld_reg}, value
{val}")
if val != expect_val:
print(f"Thread {thread_num}: get cpld reg {cpld_reg}, value
{val}, expect_val {expect_val}")
def signal_handler(sig, frame):
global exit_flag
print("Ctrl+C detected. Quitting...")
exit_flag = True
if __name__ == '__main__':
# Register the signal handler for Ctrl+C
signal.signal(signal.SIGINT, signal_handler)
t1 = threading.Thread(target=cpld_read, args=(1, '0x103', '0x11',))
t2 = threading.Thread(target=cpld_read, args=(2, '0x141', '0x00',))
t1.start()
t2.start()
t1.join()
t2.join()
```
Why I did it
Update the device data files to support 1024 LAGs for Nokia IXR7250E platform
fixes https://github.com/Nokia-ION/ndk/issues/15
How I did it
Update the lag_id_end=1024 in chassisdb.conf file and add the trunk_group_max_members=16 in the BCM config file
How to verify it
check to allow to create lag ids up to 1024 with 16 port members
Signed-off-by: mlok <marty.lok@nokia.com>
* [202205] Update SOC properties for DLR_INIT based pfcwd recovery (#15217)
Why I did it
Update soc properties for certain roles that need to use pfcwd dlr init based recovery mechanism
How to verify it
Updated the templates on a 7050cx3 dual tor and 7260 T1 which satisfies these conditions and validated pfcwd recovery which uses DLR_INIT based mechanism. Also validated that this mechanism is not used on 7050cx3 single tor with the updated templates
Signed-off-by: Neetha John <nejo@microsoft.com>
#### Why I did it
To add new SKU Mellanox-SN4700-O8C48 with following requirements:
| Port configuration | Value |
| ------ |--------- |
| Breakout mode for each port |**Defined in port mapping** |
| Speed of the port | **Defined in Port mapping** |
| Auto-negotiation enable/disable | **No setting required** |
| FEC mode | **No setting required** |
|Type of transceiver used | **Not needed**|
Buffer configuration | Value
------ |---------
Shared headroom | **Enabled**
Shared headroom pool factor | **2**
Dynamic Buffer | **Disable**
In static buffer scenario how many uplinks and downlinks? | **48x100G Downlinks and 8x400G uplinks**
2km cable support required? | **Yes**
Switch configuration | Value
------ |---------
Warmboot enabled? | **yes**
Should warmboot be added to SAI profile when enabled? | **yes**
Is VxLAN source port range set? | **No**
Should Vxlan source port range be added to SAI profile when set. | **No**
Is Static Policy Based Hashing enabled? | **No**
Port Mapping
| Ports | Mode |
| ------ |--------- |
| 1-12 | 2x100G |
| 13-20 | 1x400G |
| 21-32 | 2x100G |
Number of Uplinks / Downlinks:
T1 topology: **48x100G Downlinks 8x400G uplinks**.
Length of downlink: **40m**
Length of uplink: **2000m**
##### Work item tracking
- Microsoft ADO **(number only)**:
#### How I did it
Defined the SKU as per requirements
#### How to verify it
Load the SKU and verify if all links come up and traffic passes.
#### A picture of a cute animal (not mandatory but encouraged)
* [armhf][Nokia-7215]Add HWSKU files for new SAI
Add new easy bringup (EZB) files for new SAI 1.11.0
* [Nokia][devicedata]Modified the port autoneg default setting for Nokia-7215 platform
[armhf][Nokia-7215]Update profile.ini
Why I did it
Optimize Silverstone led init process, this linkscan = off can cause the sonic port link status async with bcm shell after reboot.
How I did it
Remove redundant code.
How to verify it
After reboot, the ports can linkup normally.
* Update PG headroom settings ports based on port speed/cable length
* Updated XOFF settings to use chip level numbers than core
* Updated PG headroom based on uplink/downlink side
* fix for sonic-config-gen tests
* More fixes for unit test cases
* more test fixes
* Merged multiple functions into one
Add new Nokia build target and establish an arm64 build:
Platform: arm64-nokia_ixs7215_52xb-r0
HwSKU: Nokia-7215-A1
ASIC: marvell
Port Config: 48x1G + 4x10G
How I did it
- Change make files for saiserver and syncd to use Bulleseye kernel
- Change Marvell SAI version to 1.11.0-1
- Add Prestera make files to build kernel, Flattened Device Tree blob and ramdisk for arm64 platforms
- Provide device and platform related files for new platform support (arm64-nokia_ixs7215_52xb-r0).
- Why I did it
Update SAI xml file to align with the default SKU
- How I did it
Update the SN5600 SAI xml file
- How to verify it
Install image on SN5600 device
Why I did it
Update ECN settings for T2 chassis
How I did it
Updated qos config file to load these settings during switch bootup
How to verify it
Verified on line card on T2 chassis
- Why I did it
Update the sensors.conf and pcie.yaml according to the real hardware.
- How I did it
Update the sensors.conf and pcie.yaml
- How to verify it
run relevant sonic-mgmt test cases.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Why I did it
Today at most 128 LAGs are supported. This is not sufficient if there are many LAGs with just few ports.
How I did it
Increase LAG Ids to 1024 for DNX device.
Why I did it
When reboot the chassis by issuing "sudo reboot" on Supervisor card. The internal midplane communication xe0 should be shutdown to avoid double reboot on the linecard.
Added a udev link rule to disable the autoneg on AMD xgbe port Xe0 and Xe1 and make the setting in sync with the peer Broadcom greyhound ports.
How I did it
Modify the Nokia-7250IXRE specific reboot script on the Supervisor card to shutdown the internal interface xe0. Also move reboot linecard code to the top of the script to make sure the notification has been send to Linecard before shutdown the xe0 interface.
Introduced a new rule 80-net-by-driver.link to disable the autoneg on the AMD size. This change requires the latest NDK which contains the change to set the autoneg on the xe0 and xe1 port on the Greyhound.
Signed-off-by: mlok <marty.lok@nokia.com>
Why I did it
Support Egress Mirroring on supported Arista platforms
How I did it
Add necessary soc properties for egress mirroring recycle ports to be created
Signed-off-by: Nathan Wolfe <nwolfe@arista.com>
Updated asic_port_names for all Arista LC SKUs to follow latest naming
conventions to remove redundant ASICx suffix. For
Arista-7800R3-48CQ2-C48, added the asic_port_name mapping.
Why I did it
sonic-sfp based sfp impl would be deprecated in future, change to sfp-refactor based implementation.
How I did it
Use the new sfp-refactor based sfp implementation for seastone.
How to verify it
Manual test sfp platform api or run sfp platform test cases.
Why I did it
For better accounting purposes, updating the ingress lossy traffic profile to use static threshold. This change is only intended for Th devices using RDMA-CENTRIC profiles
How I did it
Update the buffer templates for Th devices in RDMA-CENTRIC folder to use the correct threshold
How to verify it
Verified the changes manually on a Th device.
Existing unit tests render Th template from the RDMA-CENTRIC folder. Updated the expected output to use the correct threshold
Why I did it
Update dynamic threshold to -1 to get optimal performance for RDMA traffic
How I did it
Modified pg_profile_lookup.ini to reflect the correct value
Signed-off-by: Neetha John <nejo@microsoft.com>
Provide platform-components.json for Clearwater2 and Wolverine
These files are needed for fwutil platform sonic-mgmt tests to pass.
Fix PikeZ platform_components.json
Co-authored-by: Patrick MacArthur <pmacarthur@arista.com>
Co-authored-by: Andy Wong <andywong@arista.com>
Why I did it
Platform cases test_tx_disable, test_tx_disable_channel, test_power_override failed in dx010.
How I did it
Add i2c access algorithm for CPLD i2c adapters.
How to verify it
Verify it with platform_tests/api/test_sfp.py::TestSfpApi test cases.
To support 64 cores on arista skus. Fixesaristanetworks/sonic#77
Remapped recycle ports to lowers core port ids and set appl_param_nof_ports_per_modid to 64.
Why I did it
Seastone does not have the psu fans' status led, need to reflect it in platform.json.
How I did it
Set the psu fans status led available to false.
How to verify it
Verify it with platform_tests/api/test_psu_fans.py::TestPsuFans::test_set_fans_led case.
Why I did it
The sensors and sensord processes were reporting data on unused sensors.
This lead to ALARM messages or erroneous values that could be misinterpreted.
How I did it
Ignore the affected sensors in the sensors.conf
How to verify it
Check that there are no longer ALARM messages from sensord in the syslog or in the output of sensors