Why I did it
This is to add support for specifying custom retry counts for LACP sessions. This is to make warmboot easier on low-storage and low-memory platforms, by allowing more than 90 seconds of downtime.
How I did it
How to verify it
Tested manually with these cases:
Verify that changing the retry count using teamdctl PortChannel101 state item set runner.retry_count 5 takes effect
Verify that the retry count change actually affects when the LAG goes down by forcefully killing teamd on one side (i.e. setting the retry count to 5 causes the LAG to go down after 150 seconds)
Verify that the retry count gets reset to 3 after the LAG goes down for whatever reason
Verify that the retry count gets reset to 3 after some period of time (30 seconds * retry count)
Test cases are in sonic-net/sonic-mgmt#7961 and sonic-net/sonic-mgmt#8152.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
What I did:
Workaround for the issue seen here : FRRouting/frr#13682
It seems there is timing issue where there are multiple recursive lookup needed to resolve nexthop of the route it's possible that it does not happen correctly causing route to remain in inactive state
Issue is seen on chassis-packet as there 2 level of recursive lookup needed for a given e-BGP learnt route
- Level1 to resolve e-BGP peer (connected route via bgp ) over Loopback4096 (i-BGP peering)
- Level 2 Loopback4096 over backend port-channels next-hops
For VOQ chassis there is no e-BGP peer (connected route via bgp ) resolution as route is added as Static route by orchagent over Ethernet-IB.
Also as part of this remove route-map policy from instance.conf.j2 as same is define in peer-group.j2.
Microsoft ADO: https://msazure.visualstudio.com/One/_workitems/edit/24198507
How I verify:
Functional Verification manually
Updated UT.
We will be adding sanity check in sonic-mgmt to make sure none of route are in inactive state.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
During build, lots of pip packages are getting installed through pip
install command.
This feature adds support for caching all the pip packages into local
cache path, so that subsequent build always loads from the cache.
When a package is referenced from the web through wget command,
it downloads the package for every build.
This feature caches all the packages that are being downloaded from the
web, so that subsequent build always loads the cache instead of from
web.
* [202205] Update SOC properties for DLR_INIT based pfcwd recovery (#15217)
Why I did it
Update soc properties for certain roles that need to use pfcwd dlr init based recovery mechanism
How to verify it
Updated the templates on a 7050cx3 dual tor and 7260 T1 which satisfies these conditions and validated pfcwd recovery which uses DLR_INIT based mechanism. Also validated that this mechanism is not used on 7050cx3 single tor with the updated templates
Signed-off-by: Neetha John <nejo@microsoft.com>
* AclInterface and Management Interfaces are parsed on finding first valid node for it.
Above logic works for multi-asic scenarios where ACL Interface and Management Interfaces are present in DPG order {Host, Asicx, Asicy} but not when DPG is in {Asicx, Asicy, Host} order.
* [static_route][staticroutebfd]fix an issue on deleting a non-bfd static route
Fix an issue for deleting a non-bfd static route also remove the staticroutebfd from critical_processes list and make it auto restart in the case of crash.
What I did:
Added change to add 'peerType' as element in NEIGH_STATE_TABLE.
'peerType' can be i-BGP vs e-BGP determined based on local and remote AS number.
Why I did:
This is useful to filter neighbors in SONiC as internal vs external in chassis use-case (example: telemetry)
Verification:
Manual Verification
127.0.0.1:6379[6]> hgetall "NEIGH_STATE_TABLE|10.0.0.5"
1) "state"
2) "Established"
3) "peerType"
4) "e-BGP"
127.0.0.1:6379[6]> hgetall "NEIGH_STATE_TABLE|2603:10e2:400::4"
1) "state"
2) "Established"
3) "peerType"
4) "i-BGP"
Also sonic-mgmt test case test_bgp_fact.py is enhanced: Enhanced bgp_fact to validate NEIGH_STATE_TABLE element 'peerType' sonic-mgmt#8462
Stop authorization after user being rejected by server.
#### Why I did it
Fix nss_tacplus bug: after user being rejected by one TACACS+ server, nss_tacplus will try with next TACACS+ server.
##### Work item tracking
- Microsoft ADO :15276692
#### How I did it
Check authorization result, stop authorization after user being rejected by server.
#### How to verify it
Pass all E2E test.
Create new UT: https://github.com/sonic-net/sonic-mgmt/pull/8345
#### Description for the changelog
Stop authorization after user being rejected by server.
#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
- Why I did it
If you enable feature and then disable it, System Ready status change to Not Ready
A disabled feature should not affect the system ready status.
- How I did it
During the disable flow of dhcp_relay, it entered the dnsrvs_name list, which caused the SYSTEM_STATE key to be set to DOWN. Right after that, the dhcp_relay service was removed from the full service list, however, but, when it was removed from the dnsrvs_name, there was no flow to reset the system state back to UP even though there was no more services in down state.
- How to verify it
root@qa-eth-vt01-2-3700v:/home/admin# config feature state dhcp_relay enabled
root@qa-eth-vt01-2-3700v:/home/admin# show system-health sysready-status
root@qa-eth-vt01-2-3700v:/home/admin# config feature state dhcp_relay disabled
root@qa-eth-vt01-2-3700v:/home/admin# show system-health sysready-status
Should see
System is ready
- Why I did it
interfaces-config service restarts networking service, which in-turn results in loopback interface address is being removed and reassigned back
If the system-health happens to start during that instance expections and logs like this are seen:
Apr 15 18:14:49.357869 r-panther-20 ERR healthd: update system status exception:Unable to connect to redis: Cannot assign requested address
Apr 15 18:14:49.429778 r-panther-20 ERR healthd: subscribe_statedb exited- Unable to connect to redis: Cannot assign requested address
Apr 15 18:14:52.218594 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:52.219714 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:55.218636 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:55.218722 r-panther-20 ERR healthd: system_service_Map_base::at
- How I did it
use unix socket path
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Why I did it
Introduce a new valid neighbor element type to YANG.
Work item tracking
Microsoft ADO (number only): 23994521
How I did it
Add MgmtLeafRouter to element network type list.
How to verify it
Passes UTs
- Run pre-commit tox profile to trim all trailing blanks
- Use several commits with a per-folder based strategy
to ease their merge
Issue #15114
Signed-off-by: Guillaume Lambert <guillaume.lambert@orange.com>
Why I did it
We need to store information of power shelf in config_db for SONiC MX switch. Current minigraph parser cannot parse rack_mgmt_map field.
Work item tracking
Microsoft ADO (number only): 22179645
How I did it
Add support for parsing rack_mgmt_map.
What I did:
Updated Static Route Attribute in Minigraph. NGS Minigraph has define semantics of static route differently.
See below for differences:-
Microsoft ADO: 17956325
Before
<AssociatedTo>8.0.0.1/32</AssociatedTo>
<Address>192.168.1.2,192.168.2.2</Address>
<AttachTo>PortChannel40,PortChannel50</AttachTo>
Now:
<Address>8.0.0.1</Address>
<AttachTo>PortChannel40,192.168.1.2;PortChannel50,192.168.2.2</AttachTo>
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Update PG headroom settings ports based on port speed/cable length
* Updated XOFF settings to use chip level numbers than core
* Updated PG headroom based on uplink/downlink side
* fix for sonic-config-gen tests
* More fixes for unit test cases
* more test fixes
* Merged multiple functions into one
Add new Nokia build target and establish an arm64 build:
Platform: arm64-nokia_ixs7215_52xb-r0
HwSKU: Nokia-7215-A1
ASIC: marvell
Port Config: 48x1G + 4x10G
How I did it
- Change make files for saiserver and syncd to use Bulleseye kernel
- Change Marvell SAI version to 1.11.0-1
- Add Prestera make files to build kernel, Flattened Device Tree blob and ramdisk for arm64 platforms
- Provide device and platform related files for new platform support (arm64-nokia_ixs7215_52xb-r0).
Following changes are done:
Added Support where if asic configuration is not present in minigraph sonic-cfggen do not error out but instead process it gracefully.
Use Case: In Supervisor we have number of asic are define as max possible but in minigraph configuration of only valid/available asics only are present. Without this change load_minigraph fails.
Microsoft ADO: 17956325
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Why I did it
Our k8s feature will pull new version container images for each upgrade, the container images inside sonic will be more and more, but for now we don’t have a way to clean up the old version container images, the disk may be filled up. Need to add cleaning up the old version container images logic.
Work item tracking
Microsoft ADO (number only):
17979809
How I did it
Remove the old version container images besides the feature's current version and last version image, last version image is saved for supporting fallback.
How to verify it
Check whether the old version images are removed
Why I did it
Fix#15000
isc-dhcp 4.4.1-2.3+deb11u1 is no longer available in debian repository
How I did it
update isc-dhcp to new version 4.4.1-2.3+deb11u2
Why I did it
replace yaml.load to yaml.safe_load because yaml.safe_load is more secure
Work item tracking
Microsoft ADO (number only): 15022050
How I did it
How to verify it
Verified in DUT 201911 which yaml version < 5.1
- Why I did it
Add fan direction check to system health, all fans should be in the same direction
- How I did it
Add fan direction check to system health, all fans should be in the same direction
- How to verify it
Manual test
Unit test
Added sonic-mgmt test case to verify
Why I did it
To support BGPMon sessions from each T2 linecard ASIC
Work item tracking
Microsoft ADO (number only): 17873174
How I did it
Added change in BGPMon configuration to use Loopback4096 as source interface, since this has a unique IP per ASIC.
How to verify it
Tested by manually setting up BGPMon session on T2 LC and verified that Loopback4096 could be used as source
What I did:
In FRR command update source <interface-name> is not at address-family level. Because of this
internal peer route-map for ipv6 were getting applied to ipv4 address family. As a result
TSA over iBGP for Ipv6 was not getting applied.
How I verify:
Manual Verification of TSA over both ipv4 and ipv6 after fix works fine.
Updated UT for this.
Added sonic-mgmt test gap: sonic-net/sonic-mgmt#8170
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
There are chassis-packet and Single asic platforms which support this 400G to 100G/40G speed change via config.
Enabling this feature for all platforms which can support this. Keeping it enabled for all does not affect the platforms
which do not support this feature yet.
Signed-off-by: anamehra anamehra@cisco.com
Why I did it
Do not require leafref as part of yang. Only need string to compare whether string received from event matches what is possible for ifname.
How I did it
How to verify it
Run UT
Why I did it
Update ECN settings for T2 chassis
How I did it
Updated qos config file to load these settings during switch bootup
How to verify it
Verified on line card on T2 chassis
Why I did it
Support for SONIC chassis isolation using TSA and un-isolation using TSB from supervisor module
Work item tracking
Microsoft ADO (number only): 17826134
How I did it
When TSA is run on the supervisor, it triggers TSA on each of the linecards using the secure rexec infrastructure introduced in sonic-net/sonic-utilities#2701. User password is requested to allow secure login to linecards through ssh, before execution of TSA/TSB on the linecards
TSA of the chassis withdraws routes from all the external BGP neighbors on each linecard, in order to isolate the entire chassis. No route withdrawal is done from the internal BGP sessions between the linecards to prevent transient drops during internal route deletion. With these changes, complete isolation of a single linecard using TSA will not be possible (a separate CLI/script option will be introduced at a later time to achieve this)
Changes also include no-stats option with TSC for quick retrieval of the current system isolation state
This PR also reverts changes in #11403
How to verify it
These changes have a dependency on sonic-net/sonic-utilities#2701 for testing
Run TSA from supervisor module and ensure transition to Maintenance mode on each linecard
Verify that all routes are withdrawn from eBGP neighbors on all linecards
Run TSB from supervisor module and ensure transition to Normal mode on each linecard
Verify that all routes are re-advertised from eBGP neighbors on all linecards
Run TSC no-stats from supervisor and verify that just the system maintenance state is returned from all linecards
Why I did it
systemd stop event on service with timers can sometime delete the state_db entry for the corresponding service.
Note: This won't be observed on the latest master label since the dependency on timer was removed with the recent config reload enhancement. However, it is better to have the fix since there might be some systemd services added to system health daemon in the future which may contain timers
root@qa-eth-vt01-4-3700c:/home/admin# systemctl stop snmp
root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status
System is not ready - one or more services are not up
Service-Name Service-Status App-Ready-Status Down-Reason
---------------------- ---------------- ------------------ -------------
<Truncated>
ssh OK OK -
swss OK OK -
syncd OK OK -
sysstat OK OK -
teamd OK OK -
telemetry OK OK -
what-just-happened OK OK -
ztp OK OK -
<Truncated>
Expected
Should see a Down entry for SNMP instead of the entry being deleted from the STATE_DB
root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status
System is not ready - one or more services are not up
Service-Name Service-Status App-Ready-Status Down-Reason
---------------------- ---------------- ------------------ -------------
<Truncated>
snmp Down Down Inactive
ssh OK OK -
swss OK OK -
syncd OK OK -
sysstat OK OK -
teamd OK OK -
telemetry OK OK -
what-just-happened OK OK -
ztp OK OK -
<Truncated>
How I did it
Happens because the timer is usually a PartOf service and thus a stop on service is propagated to timer. Fixed the logic to handle this
Apr 18 02:06:47.711252 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.service from source:sysbus time:2023-04-17 23:06:47
Apr 18 02:06:47.711347 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.service ]
Apr 18 02:06:47.722363 r-lionfish-16 INFO healthd: snmp.service service state changed to [inactive/dead]
Apr 18 02:06:47.723230 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.timer from source:sysbus time:2023-04-17 23:06:47
Apr 18 02:06:47.723328 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.timer ]
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
* Support ACL interface type BmcData in minigraph parser
* Support ACL interface type BmcData in minigraph parser
* add unittest
* Add a global dict for storing the defination of custom acl tables
- Why I did it
Extend the CRM YANG model with DASH attributes.
- How I did it
Add new attributes to the existing CRM YANG model.
Implement tests for DASH CRM attributes.
- How to verify it
Build sonic_yang_models packages. The tests will be run automatically.
Fix per-command authorization failed issue when a command with wildcard match more than hundred files.
#### Why I did it
When user enable TACACS per-command authorization, and run a command with wildcard , if the command match more than hundreds of files, the per-command authorization will failed with following message:
*** authorize failed by TACACS+ with given arguments, not executing
The root cause of this issue is because bash will match files with wildcard and replace with wildcard args with matched files. when there are too many files, TACACS plugin will generate a big authorization request, which will be reject by server side.
##### Work item tracking
- Microsoft ADO **(number only)**: 18074861
#### How I did it
Fix bash patch file, use original user inputs as authorization parameters.
#### How to verify it
Pass all UT.
Create new UT to validate the TACACS authorization request are using original command arguments.
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8115
#### Which release branch to backport (provide reason below if selected)
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [X] 202205
- [X] 202211
#### Tested branch (Please provide the tested image version)
- [x] 202205.258490-412b83d0f
- [x] 202211.71966120-1b971c54b5
#### Description for the changelog
Fix per-command authorization failed issue when a command with wildcard match more than hundred files.
#### Why I did it
sonic-config-engine unit test needs to detect missing yang models.
#### How I did it
Update unit test, return error for missing yang models.
#### How to verify it
Run unit test for sonic-config-engine
#### Why I did it
Yang model for NEIGH table was missing
Fixed https://github.com/sonic-net/sonic-buildimage/issues/13971
#### How I did it
added sonic-neigh.yang model
#### How to verify it
make buildimage
#### Description for the changelog
Adding NEIGH yang model
Signed-off-by: Stepan Blyschak stepanb@nvidia.com
DEPENDS: #12852
Why I did it
To support BGP pending FIB suppression.
How I did it
I backported patches from FRR 8.4 feature that allows communicating ASIC route status back to FRR.
Also, added a new field in DEVICE_METADATA YANG model table. Added UT for YANG model changes.
How to verify it
Run on the switch.