src/sonic-swss
* 17c4d731 - (HEAD -> 202205, origin/202205) Remove system neighbor DEL operation in m_toSync if SET operation for (#2853) (3 hours ago) [Song Yuan]
src/sonic-platform-daemons
* 8147e25 - (HEAD -> 202205, origin/202205) Revert "Added PCIe transaction check for all peripherals on the bus (#331)" (12 hours ago) [Ying Xie]
sonic-build image side change to fix source interface selection in dual tor scenario.
dhcprelay related PR:
[master]fix dhcpv6 relay dual tor source interface selection issue sonic-dhcp-relay#42
Announce dhcprelay submodule to 6a6ce24 to include PR #42
Graceful restart is a key event for bgpd, related log print is debug level. To change it to info level to get more visibilities when this kind of event is triggered.
Why I did it
When sonic is managed by k8s, the sonic container is managed by k8s daemonset, daemonset identifies its members by labels. Currently when restarting a sonic service by systemctl, if the service's container is already managed by k8s, systemd script stops the container by removing the feature label to make it disjoin from k8s daemonset, and then starts it by adding the label to make it join k8s daemonset again.
This behavior would cause problem during k8s container upgrade. Containers in daemonset are upgraded in a rolling fashion, that means the daemonset version is updated first, then rollout the new version to containers with precheck/postcheck one by one. However, if a sonic device joins a daemonset, k8s will directly deploy a pod with the current version of daemonset, it is expected when a device joins k8s cluster at first time.
But for a device which has already joined k8s cluster, the re-joining daemonset will cause the container upgraded to new version without precheck, so if a systemd service is restarted during daemonset upgrade, the container may be upgraded without precheck and break rolling update policy. To fix it, we need to remove the logic about dropping k8s label in systemd service stop script for kube mode.
Work item tracking
Microsoft ADO (number only): 24304563
How I did it
Don't drop label in systemd service stop script when feature's set_owner is kube. Only drop label when feature's set_owner is local.
How to verify it
The label feature_enabled should be always true if the feature's set owner is kube.
Why I did it
When do clean up container images, current code has two bugs need to be fixed. And some variables' name maybe cause confused, change the variables' name.
Work item tracking
Microsoft ADO (number only): 24502294
How I did it
We do clean up after tag latest successfully. But currently tag latest function only return 0 and 1, 0 means succeed and 1 means failed, when we get 1, we will retry, when we get 0, we will do clean up. Actually the code 0 includes another case we don't need to do clean up. The case is that when we are doing tag latest, the container image we want to tag maybe not running, so we can not tag latest and don't need to cleanup, we need to separate this case from 0, return -1 now.
When local mode(v1) -> kube mode(v2) happens, one problem is how to handle the local image, there are two cases. one case is that there was one kube v1 container dry-run(cause we don't relace the local if kube version = local version), we will remove the kube v1 image and tag the local version with ACR prefix and remove local v1 local tag. Another case is that there was no kube v1 container dry-run, we remove the local v1 image directly, cause the local v1 image should not be the last desire version.
About the docker_id variable, it may cause confused, it's actually docker image id, so rename the variable. About the two dicts and the list, rename them to be more readable.
How to verify it
Check tag latest and image clean up result.
Why I did it
During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart.
When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time.
Work item tracking
Microsoft ADO (number only):
24172368
How I did it
Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit.
When need to go back to local mode, we do systemd restart immediately.
How to verify it
Feature's systemd service can be always restarted successfully during upgrade process via k8s.
src/sonic-platform-daemons
* bef58aa - (HEAD -> 202205, origin/202205) Added PCIe transaction check for all peripherals on the bus (#331) (10 hours ago) [Ashwin Srinivasan]
#### Why I did it
After k8s upgrade a container, k8s can only know the container is running, don't know the service's status inside container. So we need a probe inside container, k8s will call the probe to check whether the container is really ready.
##### Work item tracking
- Microsoft ADO **(number only)**: 22453004
#### How I did it
Add a health check probe inside config engine container, the probe will check whether the start service exit normally or not if the start service exists and call the python script to do container self-related specific checks if the script is there. The python script should be implemented by feature owner if it's needed.
more details: [design doc](https://github.com/sonic-net/SONiC/blob/master/doc/kubernetes/health-check.md)
#### How to verify it
Check path /usr/bin/readiness_probe.sh inside container.
#### Which release branch to backport (provide reason below if selected)
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [x] 202205
- [x] 202211
#### Tested branch (Please provide the tested image version)
- [x] 20220531.28
src/sonic-platform-common
* 56f227a - (HEAD -> 202205, origin/202205) More prevention of fatal exception caused by VDM dictionary missing fields when a transceiver has just been pulled (#376) (3 hours ago) [snider-nokia]
Why I did it
To reduce the container's dependency from host system
Work item tracking
Microsoft ADO (number only):
17713469
How I did it
Move the k8s container startup script to config engine container, other than mount it from host.
How to verify it
Check file path(/usr/share/sonic/scripts/container_startup.py) inside config engine container.
Signed-off-by: Yun Li <yunli1@microsoft.com>
Co-authored-by: Qi Luo <qiluo-msft@users.noreply.github.com>
How I did it
Free up Multiprocessing Manager resource at task stop request
[self.mpmgr.shutdown() in task_stop]
How to verify it
time systemctl stop system-health.service
* [chassis][lldp] Fix the lldp error log in host instance which doesn't contain front pannel ports
---------
Signed-off-by: mlok <marty.lok@nokia.com>
Co-authored-by: Marty Y. Lok <76118573+mlok-nokia@users.noreply.github.com>
* [buildsystem] Fix hiredis package version: 0.14.1-1 (#15461)
- Why I did it
To fix hiredis compilation
- How I did it
Changed package version: 0.14.0-3~bpo9+1 -> 0.14.1-1
- How to verify it
make configure PLATFORM=mellanox
make target/sonic-mellanox.bin
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
* Update Makefile
---------
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
Co-authored-by: Nazarii Hnydyn <nazariig@nvidia.com>
* Updated default ECN settings for T2 chassis (#14388)
Why I did it
Update ECN settings for T2 chassis
How I did it
Updated qos config file to load these settings during switch bootup
How to verify it
Verified on line card on T2 chassis
* Fix for test failures
* Test case failures
* test case fix
Why I did it
Update the definition of acl table type BMCDATA and BMCDATAV6 in minigraph parser.
Work item tracking
Microsoft ADO (number only): 24101023
How I did it
Update the definition of acl table type BMCDATA and BMCDATAV6 in minigraph parser.
How to verify it
Ran unittest to verify this update:
Co-authored-by: Zhijian Li <zhijianli@microsoft.com>
* [YANG] Add MUX_CABLE yang model (#11797)
Why I did it
Address issue #10970
sign-off: Jing Zhang zhangjing@microsoft.com
How I did it
Add sonic-mux-cable.yang and unit tests.
How to verify it
Compile Compile target/python-wheels/sonic_yang_mgmt-1.0-py3-none-any.whl and target/python-wheels/sonic_yang_models-1.0-py3-none-any.whl.
Pass sonic-config-engine unit test.
Which release branch to backport (provide reason below if selected)
201811
201911
202006
202012
202106
202111
202205
Description for the changelog
Link to config_db schema for YANG module changes
f8fe41a023/src/sonic-yang-models/doc/Configuration.md (mux_cable)
* [YANG] add peer switch model (#11828)
Why I did it
Address issue #10966
sign-off: Jing Zhang zhangjing@microsoft.com
How I did it
Add sonic-peer-switch.yang and unit tests.
How to verify it
Compile Compile target/python-wheels/sonic_yang_mgmt-1.0-py3-none-any.whl and target/python-wheels/sonic_yang_models-1.0-py3-none-any.whl.
Which release branch to backport (provide reason below if selected)
201811
201911
202006
202012
202106
202111
202205
Description for the changelog
Link to config_db schema for YANG module changes
b721ff87b9/src/sonic-yang-models/doc/Configuration.md (peer-switch)
src/sonic-swss
* 6a193e0 - (HEAD -> 202205, origin/202205) [Dual-ToR][ACL] bind LAG to ACL table in order to guarantee rule coverage if lag menber will be added to LAG after binding (#2749) (5 hours ago) [Andriy Yurkiv]
* Re-add 127.0.0.1/8 when bringing down the interfaces
With #5353, 127.0.0.1/16 was added to the lo interface, and then
127.0.0.1/8 was removed. However, when bringing down the lo interface,
like during a config reload, 127.0.0.1/16 gets removed, but 127.0.0.1/8
isn't added back to the interface. This means that there's a period of
time where 127.0.0.1 is not available at all, and services that need to
connect to 127.0.01 (such as for redis DB) will fail.
To fix this, when going down, add 127.0.0.1/8. Add this address before
the existing configuration gets removed, so that 127.0.0.1 is available
at all times.
Note that running `ifdown lo` doesn't actually bring down the loopback
interface; the interface always stays "physically" up.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
src/sonic-utilities
* 0b878087 - (HEAD -> 202205, origin/202205) Add display support for serial field in show chassis modules status CLI (#2858) (4 days ago) [amulyan7]
src/sonic-swss
* 2aec547 - (HEAD -> 202205, origin/202205) [pfcwd] Enhance DLR_INIT based recovery and DLR_PACKET_ACTION for broadcom platforms (#2807) (4 days ago) [Neetha John]
* fa4acd3 - Fix to substract the macsec sectag size from port MTU during InitializePort (#2789) (5 days ago) [judyjoseph]
src/sonic-utilities
* cd08aa69 - (HEAD -> 202205, origin/202205) [show][muxcable] add some new commands health, reset-cause, queue_info support for muxcable (#2853) (2 days ago) [vdahiya12]
* 4b96bd7e - Update pcieutil error message on loading common pcie module (#2786) (6 days ago) [cytsao1]
* 77b725ca - Fix the show interface counters throwing exception on device with no external interfaces (#2851) (6 days ago) [abdosi]
Why I did it
Update cable length for uplink/downlink ports for chassis and and update PG/pool headroom size accordingly.
Work item tracking
17880812
How I did it
Updated cable length as well as buffer config in HWSKU files.
Why I did it
To improve readability of config.bcm, fixed the alignment of soc properties
How to verify it
Build sonic_config_engine-1.0-py3-none-any.whl successfully
Signed-off-by: Neetha John <nejo@microsoft.com>
Why I did it
Update soc properties for certain roles that need to use pfcwd dlr init based recovery mechanism
How to verify it
Updated the templates on a 7050cx3 dual tor and 7260 T1 which satisfies these conditions and validated pfcwd recovery which uses DLR_INIT based mechanism. Also validated that this mechanism is not used on 7050cx3 single tor with the updated templates
Signed-off-by: Neetha John <nejo@microsoft.com>
Why I did it
Our k8s feature will pull new version container images for each upgrade, the container images inside sonic will be more and more, but for now we don’t have a way to clean up the old version container images, the disk may be filled up. Need to add cleaning up the old version container images logic.
Work item tracking
Microsoft ADO (number only):
17979809
How I did it
Remove the old version container images besides the feature's current version and last version image, last version image is saved for supporting fallback.
How to verify it
Check whether the old version images are removed
Co-authored-by: lixiaoyuner <35456895+lixiaoyuner@users.noreply.github.com>
Why I did it
Introduce a new valid neighbor element type to YANG.
Work item tracking
Microsoft ADO (number only): 23994521
How I did it
Add MgmtLeafRouter to element network type list.
How to verify it
Passes UTs
Co-authored-by: Jing Kan <jika@microsoft.com>
This is backport of #14757
SONiC Yang model support for IPv6 link local
What I did
Created SONiC Yang model for IPv6 link local
How I did it
Defined Yang models for IPv6 link local based on https://github.com/sonic-net/SONiC/blob/master/doc/ipv6/ipv6_link_local.md
How to verify it
Added enable test case.
Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
Why I did it
We need to store information of power shelf in config_db for SONiC MX switch. Current minigraph parser cannot parse rack_mgmt_map field.
Work item tracking
Microsoft ADO (number only): 22179645
How I did it
Add support for parsing rack_mgmt_map.
src/sonic-platform-common
* ff72811 - (HEAD -> 202205, origin/202205) Fix issue '<' not supported between instances of 'NoneType' and 'int' (#371) (5 hours ago) [Junchao-Mellanox]
* f2a419d - Render Media lane and Media assignment options info from Application Code (#368) (8 hours ago) [rajann]
* d8bad10 - Retrieve FW version using CDB command for CMIS transceivers + handle single bank FW versioning (#372) (8 hours ago) [mihirpat1]
Why I did it
This PR is to backport PR #11117 into 202205 branch.
This PR is to define Yang model for SYSTEM_DEFAULTS table.
The table was introduced in PR sonic-net/SONiC#982
The table will be like
"SYSTEM_DEFAULTS": {
"tunnel_qos_remap": {
"status": "enabled"
}
}
Work item tracking
Microsoft ADO (https://msazure.visualstudio.com/One/_workitems/edit/23037078)
How I did it
Add a new yang file sonic-system-defaults. Yang.
How to verify it
Verified by UT
What I did:
In FRR command update source <interface-name> is not at address-family level. Because of this
internal peer route-map for ipv6 were getting applied to ipv4 address family. As a result
TSA over iBGP for Ipv6 was not getting applied.
How I verify:
Manual Verification of TSA over both ipv4 and ipv6 after fix works fine.
Updated UT for this.
Added sonic-mgmt test gap: sonic-net/sonic-mgmt#8170
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Why I did it
systemd stop event on service with timers can sometime delete the state_db entry for the corresponding service.
Note: This won't be observed on the latest master label since the dependency on timer was removed with the recent config reload enhancement. However, it is better to have the fix since there might be some systemd services added to system health daemon in the future which may contain timers
root@qa-eth-vt01-4-3700c:/home/admin# systemctl stop snmp
root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status
System is not ready - one or more services are not up
Service-Name Service-Status App-Ready-Status Down-Reason
---------------------- ---------------- ------------------ -------------
<Truncated>
ssh OK OK -
swss OK OK -
syncd OK OK -
sysstat OK OK -
teamd OK OK -
telemetry OK OK -
what-just-happened OK OK -
ztp OK OK -
<Truncated>
Expected
Should see a Down entry for SNMP instead of the entry being deleted from the STATE_DB
root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status
System is not ready - one or more services are not up
Service-Name Service-Status App-Ready-Status Down-Reason
---------------------- ---------------- ------------------ -------------
<Truncated>
snmp Down Down Inactive
ssh OK OK -
swss OK OK -
syncd OK OK -
sysstat OK OK -
teamd OK OK -
telemetry OK OK -
what-just-happened OK OK -
ztp OK OK -
<Truncated>
How I did it
Happens because the timer is usually a PartOf service and thus a stop on service is propagated to timer. Fixed the logic to handle this
Apr 18 02:06:47.711252 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.service from source:sysbus time:2023-04-17 23:06:47
Apr 18 02:06:47.711347 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.service ]
Apr 18 02:06:47.722363 r-lionfish-16 INFO healthd: snmp.service service state changed to [inactive/dead]
Apr 18 02:06:47.723230 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.timer from source:sysbus time:2023-04-17 23:06:47
Apr 18 02:06:47.723328 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.timer ]
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
- Why I did it
There are chassis-packet and Single asic platforms which support this 400G to 100G/40G speed change via config.
Enabling this feature for all platforms which can support this. Keeping it enabled for all does not affect the platforms
which do not support this feature yet.
Work item tracking
Microsoft ADO (number only):
17952356
- How I did it
Removed switch_type and role type check.
- How to verify it
Loaded router with default 400G config. Loaded minigraph to convert 400G to 100G speed.
Signed-off-by: anamehra <anamehra@cisco.com>
src/sonic-platform-common
* c7ce1a5 - (HEAD -> 202205, origin/202205) Prevent VDM dictionary related KeyError when a transceiver module is pulled while a bulk get method is interrogating said module (#360) (5 days ago) [snider-nokia]
* Support ACL interface type BmcData in minigraph parser
* Support ACL interface type BmcData in minigraph parser
* add unittest
* Add a global dict for storing the defination of custom acl tables
Fix per-command authorization failed issue when a command with wildcard match more than hundred files.
#### Why I did it
When user enable TACACS per-command authorization, and run a command with wildcard , if the command match more than hundreds of files, the per-command authorization will failed with following message:
*** authorize failed by TACACS+ with given arguments, not executing
The root cause of this issue is because bash will match files with wildcard and replace with wildcard args with matched files. when there are too many files, TACACS plugin will generate a big authorization request, which will be reject by server side.
##### Work item tracking
- Microsoft ADO **(number only)**: 18074861
#### How I did it
Fix bash patch file, use original user inputs as authorization parameters.
#### How to verify it
Pass all UT.
Create new UT to validate the TACACS authorization request are using original command arguments.
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8115
#### Which release branch to backport (provide reason below if selected)
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [X] 202205
- [X] 202211
#### Tested branch (Please provide the tested image version)
- [x] 202205.258490-412b83d0f
- [x] 202211.71966120-1b971c54b5
#### Description for the changelog
Fix per-command authorization failed issue when a command with wildcard match more than hundred files.
* yang mode support for neighbor metadata
* add description in leaf node
* modify description
Co-authored-by: jcaiMR <111116206+jcaiMR@users.noreply.github.com>
* [minigraph] add support for changing T1 ports speed from 400G to 100G and vice-versa (#14505)
Open
[minigraph] add support for changing T1 ports speed from 400G to 100G and vice-versa
vdahiya12 wants to merge 9 commits into sonic-net:master from vdahiya12:dev/vdahiya/minigraph_parser
Conversation 10
Commits 9
Checks 18
Files changed 5
Conversation
vdahiya12
@vdahiya12 vdahiya12 commented 2 weeks ago •
On SONiC T1 cisco 8101 HwSku, the speed changes are done from 400G to 100G needs to be supported on 400G ports.
To enable this, along with speed change the port lanes need to be changed. This PR has the changes to update the port lanes when such speed change happens.
Basically if Banwidth in minigraph.xml intends to enable a 100G speed on a 400G port, then the appropriate lane change and speed change needs to be invoked in mingraph parser
Example if port_config.ini dicatates the speed to be 400G and minigraph has 100G speed, then this changeneeds to be accommodated
Ethernet96 1536,1537,1538,1539,1540,1541,1542,1543 etp12 12 400000 0
<DeviceLinkBase>
<ElementType>DeviceInterfaceLink</ElementType>
<EndDevice>ARISTA01T2</EndDevice>
<EndPort>Ethernet1</EndPort>
<StartDevice>Device-8101-01</StartDevice>
<StartPort>etp12</StartPort>
<Bandwidth>100000</Bandwidth>
</DeviceLinkBase>
These platforms today have 400g port with 8 serdes lines, and 100g will operate with 4 serdes lane. When the port speed changes from 400G to 100G the first 4 lanes will be used for 100G port.
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
* add all
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
* fix unit
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
---------
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>