* [swss] Chassis db clean up optimization and bug fixes
This commit includes the following changes:
- Fix for regression failure due to error in finding CHASSIS_APP_DB in
pizzabox (#PR 16451)
- After attempting to delete the system neighbor entries from
chassis db, before starting clearing the system interface entries,
wait for sometime only if some system neighbors were deleted.
If there are no system neighbors entries deleted for the asic coming up,
no need to wait.
- Similar changes for system lag delete. Before deleting the
system lag, wait for some time only if some system lag memebers were
deleted. If there are no system lag members deleted no need to wait.
- Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
is coming up, when system neigh entries are deleted from chassis ap
db (as part of chassis db clean up), there is no orchs/process running to
process the delete messages from chassis redis. Because of this, stale system
neigh are entries present in the local STATE_DB. The stale entries result in
creation of orphan (no corresponding data path/asic db entry) kernel neigh
entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
the local STATE_DB when sevice comes up.
Signed-off-by: vedganes <veda.ganesan@nokia.com>
* [swss] Chassis db clean up bug fixes review comment fix - 1
Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)
Signed-off-by: vedganes <veda.ganesan@nokia.com>
---------
Signed-off-by: vedganes <veda.ganesan@nokia.com>
(cherry picked from commit b13b41fc22)
Stop installing development packages from telemetry docker images to avoid unnecessary space usage.
### Why I did it
From 202305, libswsscommon-dev and the Boost headers were brought in telemetry docker image incorrectly, which result in unnecessary space usage.
##### Work item tracking
- Microsoft ADO **(number only)**:25176224
#### How I did it
Remove libswsscommon-dev accordingly.
#### How to verify it
Image building.
Signed-off-by: anamehra anamehra@cisco.com
Added a check for DEVICE_METADATA before accessing the data. This prevents the j2 failure when var is not available.
In #15080, there was a command added to re-add 127.0.0.1/8 to the lo
interface when the networking configuration is being brought down.
However, the trigger for that command is `down`, which, looking at
ifupdown2 configuration files, runs immediately after 127.0.0.1/16 is
removed. This means there may be a period of time where there are no
loopback addresses assigned to the lo interface, and redis commands will
fail.
Fix this by changing this to pre-down, which should run well before
127.0.0.1/16 is removed, and should always leave lo with a loopback
address.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Co-authored-by: Saikrishna Arcot <sarcot@microsoft.com>
* Change the CAK key length check in config plugin, macsec test profile changes
* Fix the format in add_profile api
The changes needed in various macsec unit tests and config plugin when we move to accept the type 7 encoded key format for macsec. This goes along with PR : sonic-net/sonic-swss#2892 raised earlier.
Co-authored-by: judyjoseph <53951155+judyjoseph@users.noreply.github.com>
- Why I did it
Because the Spectrum4 devices don't support mlxtrace utility.
- How I did it
Edit sai.profile and remove mlxtrace_spectrum4_itrace_*.cfg.ext files
Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
Co-authored-by: Vadym Hlushko <62022266+vadymhlushko-mlnx@users.noreply.github.com>
#### Why I did it
To enable qos config for a certain backend deployment mode, for resource-type "Compute-AI".
This deployment has the following requirement:
- Config below enabled if DEVICE_TYPE as one of backend_device_types
- Config below enabled if ResourceType is 'Compute-AI'
- 2 lossless TCs' (2, 3)
- 2 lossy TCs' (0,1)
- DSCP to TC map uses 4 DSCP code points and maps to the TCs' as follows:
"DSCP_TO_TC_MAP": {
"AZURE": {
"48" : "0",
"46" : "1",
"3" : "3",
"4" : "4"
}
}
- WRED profile has green {min/max/mark%} as {2M/10M/5%}
This required template change <as in the PR> in addition to the vendor qos.json.j2 file (not included here).
### How I did it
#### How to verify it
- with the above change and the vendor config change, generated the qos.json file and verified that the objective stated in "Why I did it" was met
- verified no error
### Description for the changelog
Update qos_config.j2 for Comptue-AI deployment on one of backend device type roles
- Why I did it
1. Update Mellanox HW-MGMT package to newer version V.7.0030.1011
2. Replace the SONiC PMON Thermal control algorithm with the one inside the HW-MGMT package on all Nvidia platforms
3. Support Spectrum-4 systems
- How I did it
1. Update the HW-MGMT package version number and submodule pointer
2. Remove the thermal control algorithm implementation from Mellanox platform API
3. Revise the patch to HW-MGMT package which will disable HW-MGMT from running on SIMX
4. Update the downstream kernel patch list
Signed-off-by: Kebo Liu <kebol@nvidia.com>
How I did it
Update Yang definition of ACL_TABLE_TYPE.
Update existing testcase.
Add new testcase to cover lowercase key scenario.
How to verify it
Verified by building sonic_yang_models-1.0-py3-none-any.whl. While building the target package, unit tests were run and passed.
On S6100 we are seeing almost 100K interrupts per second on intels i801 SMBUS controller which affects systems performance.
We now disable the i801 driver interrupt and instead enable polling
Microsoft ADO (number only): 24910530
How I did it
Disable the interrupt by passing the interrupt disable feature argument to i2c-i801 driver
How to verify it
This fix is NOT applicable for ARM based platforms. Applicable only for intel based platforms:-
- On SN2700 its already disabled in Mellanox hw-mgmt
- Celestica DX010 and E1031
- Dell S6100 verified the interrupts are no longer incrementing.
- Arista 7260CX3
Signed-off-by: Prince George <prgeor@microsoft.com>
Why I did it
sonic-mgmt test failure is seen for update_firmware component API
Microsoft ADO: 25208748
How I did it
Edited API 2.0 to fix this issue.
How to verify it
Run sonic-mgmt test after the fix and verify it passes.
* [Mellanox] Update SDK/FW/SAI to 4.6.1020/2012.1020/SAIBuild2305.25.0.3 (#16096)
SONiC changes:
1. Support Spectrum4 ASIC FW binary building.
2. Support new SDK sx-obj-desc lib building since new SAI need it.
3. Remove SX_SCEW debian package from Mellanox SDK build since we are no longer using it (we use libxml2 instead).
4. Update SAI, SDK, FW to version 4.6.1020/2012.1020/SAIBuild2305.25.0.3
SDK/FW bug fixes
1. In SPC-1 platforms: Fastboot mode is not operational for Split port with Force mode in 50G speed
SFP modules are kept in disabled state after set LPM (low power mode) on/off for at least 3 minutes.
2. When preforming fast boot from an old SDK version (currently installed) to a newer one (target version), and the system was initially loaded with a new SDK version (past version), and the system has not been wiped, under specific conditions, the fast boot would use the past version's data and may fail.
SDK/FW Features
1. On SN2700 all ports can support y cable by credo
SAI bug Fixes
1. When creating an ACL rule with SAI_ACL_ENTRY_ATTR_FIELD_SRC_IP/SAI_ACL_ENTRY_ATTR_FIELD_DST_IP enabled, and then disabling the field by setting enable=false, a match on L3_type=IPv4 will remain programmed for the rule Issue resolved after the fix
2. Allow the max scale of virtual routers to be configure for SPC-1, SPC-2, SPC-3 when fastboot enable
3. Remove default hash key of SRC_MAC, DST_MAC and ETH_TYPE
SAI features
1. Port init profile
- How I did it
Update SDK/FW/SAI make files
- How to verify it
Run full sonic-mgmt regression on Mellanox platform
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Conflicts:
platform/mellanox/mlnx-sai.mk
* Fix issue: unprintable character is rendered when handling comments in j2
Use "{#-" and "-#}" to mark comments in jinja template
Signed-off-by: Stephen Sun <stephens@nvidia.com>
---------
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Co-authored-by: Stephen Sun <stephens@nvidia.com>
#### Why I did it
src/sonic-linux-kernel
```
* 9cb7ea0 - (HEAD -> 202305, origin/202305) arm64: dts: marvell: Add Nokia 7215-IXS-A1 board (#321) (24 hours ago) [Pavan-Nokia]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Why I did it
Advance dhcpmon to a3c5381 in 202305 branch.
a3c5381 - (HEAD, origin/master, origin/HEAD, master) Merge pull request src: Add libnl3 build.sh script #11 from jcaiMR/dev/jcai_fix_err_log (11 days ago) [StormLiangMS]
c5ef7e7 - Change common_libs dependencies from buster to bullseye (Updating docker-orchagent/syncd Dockerfile and start.sh #9)
824a144 - replace atoi with strtol (Rename hostname #6) (10 weeks ago) [Mai Bui]
32c0c3f - Fix libswsscommon package installation for non-amd64 (README.md leaves out docker-database #7) (10 weeks ago) [Saikrishna Arcot]
Work item tracking
Microsoft ADO (25048723):
How I did it
How to verify it
Run test_dhcp_relay.py, no failure
- Why I did it
Fixed build failure when flag ENABLE_SFLOW_DROPMON=y set
- How I did it
Fixed sflow dropmon patch to align with hsflowd version 2.0.45
Signed-off-by: rajkumar38 <rpennadamram@marvell.com>
Why I did it
Update the platform_reboot of Nokia Platform IXR-7250E-36x400G to displays the correct reboot-cause history when reboot from supervisor card.
Work item tracking
Microsoft ADO (number only):
How I did it
Modify the platform_reboot script to copy the correct reboo-cause.txt file from NDK to the /host/reboot-cause directory at the down cycle when the reboot is issued from Supervisor (for both reboot right after install a new image and normal reboot)
Signed-off-by: mlok <marty.lok@nokia.com>
- Why I did it
watchdogutil uses platform API watchdog instance to control/query watchdog status. In Nvidia watchdog status, it caches "armed" status in a object member "WatchdogImplBase.armed". This is not working for CLI infrastructure because each CLI will create a new watchdog instance, the status cached in previous instance will totally lose. Consider following commands:
admin@sonic:~$ sudo watchdogutil arm -s 100 =====> watchdog instance1, armed=True
Watchdog armed for 100 seconds
admin@sonic:~$ sudo watchdogutil status ======> watchdog instance2, armed=False
Status: Unarmed
admin@sonic:~$ sudo watchdogutil disarm =======> watchdog instance3, armed=False
Failed to disarm Watchdog
- How I did it
Use sysfs to query watchdog status
- How to verify it
Manual test
Unit test
It appears that this was initially added to provide the git-retry
command (which doesn't appear to be used today). However, this repo is
now also providing bazel (which is actually used in our build today),
and this command (along with git-retry) expects some vpython3 binary to
be set up/installed.
Rather than going through that, just get rid of this repo.
- Why I did it
Update Mellanox MFT tool to version 4.25.0-62
- How I did it
Update the MFT tool make file
- How to verify it
Run full sonic-mgmt regression.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Bmc is a valid neighbor type in minigraph, however it was missing from the YANG model definition. Usually, the Bmc type device can be neighbor of BmcMgmtToRRouter. This PR is to introduce this type.
Why I did it
Dell S6100 Platform components needs to be updated.
How I did it
Modified platform.json to fix the issue.
How to verify it
Run sonic-mgmt component test and check whether it passes.
Why I did it
According to ACL-Table-Type-HLD, the value type of MATCHES, ACTIONS and BIND_POINTS should be list instead of string. Opening this PR to update the definition of BMCDATA and BMCDATAV6.
How I did it
Update the definition of BMCDATA and BMCDATAV6 in minigraph-parser.
How to verify it
Verified by UT and build SONiC image.
#### Why I did it
src/sonic-swss
```
* c869c1df - (HEAD -> 202305, origin/202305) update portStatIds for cisco (#2876) (11 minutes ago) [Zhixin Zhu]
* dd152288 - [Dynamic Buffer][Mellanox] Skip PGs in pending deleting set while checking accumulative headroom of a port (#2871) (12 minutes ago) [Stephen Sun]
* 97068ff1 - Fix error in peer response time when headroom is calculated for 800G (#2860) (16 minutes ago) [Stephen Sun]
```
#### How I did it
#### How to verify it
#### Description for the changelog