This PR depends on https://github.com/sonic-net/sonic-swss/pull/2737 merge first.
**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.
**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.
**How I verified it**
Pass all UT.
Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:
Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).
**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
In the PR sonic-net/sonic-utilities#2850 , for support remote access of linecards paramiko package is installed in sonic-utilities. libffi-dev needs to installed to be able to compile for armhf image
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
Why I did it
Fix the issue where db_migrator is called before DB is loaded w/ config. This leads to db_migrator:
Not finding anything, and resumes to incorrectly migrate every missing config
This is not expected. migration should happen after the old config is loaded and only new schema changes need migration.
Since DB does not have anything when migrator is called, db_migrator fails when some APIs return None.
The reason for incorrect call is that:
database service starts db_migrator as part of startup sequence.
config-setup service loads data from old-config/minigraph. However, since it has Requires=database.service.
Hence, config-setup starts only when database service is started. And database service is started when db_migrator is completed.
Fixed by:
Check if this is first time boot by checking pending_config_migration flag.
If pending_config_migration is enabled, then do not call db_migrator as part of database service startup.
Let database service start which triggers config-setup service to start.
Now call db_migrator after when config-setup service loads old-config/minigraph
* Update PG headroom settings ports based on port speed/cable length
* Updated XOFF settings to use chip level numbers than core
* Updated PG headroom based on uplink/downlink side
* fix for sonic-config-gen tests
* More fixes for unit test cases
* more test fixes
* Merged multiple functions into one
Add new Nokia build target and establish an arm64 build:
Platform: arm64-nokia_ixs7215_52xb-r0
HwSKU: Nokia-7215-A1
ASIC: marvell
Port Config: 48x1G + 4x10G
How I did it
- Change make files for saiserver and syncd to use Bulleseye kernel
- Change Marvell SAI version to 1.11.0-1
- Add Prestera make files to build kernel, Flattened Device Tree blob and ramdisk for arm64 platforms
- Provide device and platform related files for new platform support (arm64-nokia_ixs7215_52xb-r0).
Some devices running SONiC have a small storage device (2G and 4G mainly)
The SONiC image growth over time has made it impossible to install
2 images on a single device.
Some mitigations have been implemented in the past for some devices but
there is a need to do more.
One such mitigation is `docker_inram` which creates a `tmpfs` and
extracts `dockerfs.tar.gz` in it.
This all happens in the SONiC initramfs and by ensuring the installation
process does not extract `dockerfs.tar.gz` on the flash but keep the file as is.
This mitigation does a tradeoff by using more RAM to reduce the disk footprint.
It however creates new issues for devices with 4G of system memory since
the extracted `dockerfs.tar.gz` nears the 1.6G.
Considering debian upgrades (with dual base images) and the continuous
stream of features this is only going to get bigger.
This change introduces an alternative to the `tmpfs` by allowing a system
to extract the `dockerfs.tar.gz` inside a `zram` device thus bringing
compression in play at the detriment of performance.
Introduce 2 new optional kernel parameters to be consumed by SONiC initramfs.
- `docker_inram_size` which represent the max physical size of the
`zram` or `tmpfs` volume (defaults to DOCKER_RAMFS_SIZE)
- `docker_inram_algo` which is the method to use to extract the
`dockerfs.tar.gz` (defaults to `tmpfs`)
other values are considered to be compression algorithm for `zram`
(e.g `zstd`, `zlo-rle`, `lz4`)
Refactored the logic to mount the docker fs in the SONiC initramfs under
the `union-mount` script.
Moved the code into a function to make it cleaner and separated the
inram volume creation and docker extraction.
On Arista platform with a flash smaller or equal to 4GB set
`docker_inram_algo` to `zstd` which produces the best compression ratio
at the detriment of a slower write performance and a similar read
performance to other `zram` compression algorithms.
Enable docker_inram for all systems with 4GB or less of flash.
This is mandatory to allow these systems to store 2 SONiC images.
This change also fixes the missing docker_inram attribute when
installing a new image from SONiC.
Because the SWI image can ship with additional kernel parameters within
such as `sonic_fips=` this lead to a conflict.
To prevent the conflict, the extra kernel parameters from the SWI are
now stored in the file `kernel-cmdline-append` which isn't used anywhere.
* To resolve NEIGH table entries present in CONFIG_DB. Without this change arp/ndp entries which we wish to resolve, and configured via CONFIG_DB are not resolved.
Why I did it
Current regex not able to capture logs, modify regex to capture syslog messages
Work item tracking
Microsoft ADO (number only): 13366345
How I did it
Code change
How to verify it
sonic-mgmt test case
Why I did it
Support for SONIC chassis isolation using TSA and un-isolation using TSB from supervisor module
Work item tracking
Microsoft ADO (number only): 17826134
How I did it
When TSA is run on the supervisor, it triggers TSA on each of the linecards using the secure rexec infrastructure introduced in sonic-net/sonic-utilities#2701. User password is requested to allow secure login to linecards through ssh, before execution of TSA/TSB on the linecards
TSA of the chassis withdraws routes from all the external BGP neighbors on each linecard, in order to isolate the entire chassis. No route withdrawal is done from the internal BGP sessions between the linecards to prevent transient drops during internal route deletion. With these changes, complete isolation of a single linecard using TSA will not be possible (a separate CLI/script option will be introduced at a later time to achieve this)
Changes also include no-stats option with TSC for quick retrieval of the current system isolation state
This PR also reverts changes in #11403
How to verify it
These changes have a dependency on sonic-net/sonic-utilities#2701 for testing
Run TSA from supervisor module and ensure transition to Maintenance mode on each linecard
Verify that all routes are withdrawn from eBGP neighbors on all linecards
Run TSB from supervisor module and ensure transition to Normal mode on each linecard
Verify that all routes are re-advertised from eBGP neighbors on all linecards
Run TSC no-stats from supervisor and verify that just the system maintenance state is returned from all linecards
- Why I did it
We suspect the issue #13791 is caused by redis server being temporarily unavailable during system initialization so we do not use -d in sonic-cfggen, for now, to avoid accessing redis server
- How I did it
Provide a string containing required json data when calling sonic-cfggen
- How to verify it
Manually test it
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Part of sonic-net/sonic-utilities#2760
Similar to #14295
- Why I did it
To clear teamd timer when fast-reboot is finalized to prevent any further affect.
- How I did it
Deleted teamd timer from config-db in fast-reboot finalizer.
config save call is moved to after clearing teamd-timer so it won't have any further affect as well.
- How to verify it
Verified manually that entry was deleted after fast-reboot was finailized.
rasdaemon is a tool to log hardware errors. It takes 100% CPU during
boot for a few seconds. It impacts fast/warm boot by delaying control
plane restoration for 5 sec on some platforms.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
#### Why I did it
Implementing code changes for https://github.com/sonic-net/SONiC/pull/1203
#### How I did it
Removed the timers and delayed target since the delayed services would start based on event driven approach.
Cleared port table during config reload and cold reboot scenario.
Modified yang model, init_cfg.json to change has_timer to delayed
#### How to verify it
Running regression
Why I did it
Fixes#14179
chassis-packet: missing arp entries for static routes causing high orchagent cpu usage
It is observed that some sonic-mgmt test case calls sonic-clear arp, which clears the static arp entries as well. Orchagent or arp_update process does not try to resolve the missing arp entries after clear.
How I did it
arp_update should resolve the missing arp/ndp static route
entries. Added code to check for missing entries and try ping if any
found to resolve it.
How to verify it
After boot or config reload, check ipv4 and ipv4 neigh entries to make sure all static route entries are present
manual validation:
Use sonic-clear arp and sonic-clear ndp to clear all neighbor entries
run arp_update
Check for neigh entries. All entries should be present.
Testing on T0 setup route/for test_static_route.py
The test set the STATIC_ROUTE entry in conifg db without ifname:
sonic-db-cli CONFIG_DB hmset 'STATIC_ROUTE|2.2.2.0/24' nexthop 192.168.0.18,192.168.0.25,192.168.0.23
"STATIC_ROUTE": {
"2.2.2.0/24": {
"nexthop": "192.168.0.18,192.168.0.25,192.168.0.23"
}
},
Validate that the arp_update gets the proper ARP_UPDATE_VARDS using arp_update_vars.j2 template from config db and does not crash:
{ "switch_type": "", "interface": "", "pc_interface" : "PortChannel101 PortChannel102 PortChannel103 PortChannel104 ", "vlan_sub_interface": "", "vlan" : "Vlan1000", "static_route_nexthops": "192.168.0.18 192.168.0.25 192.168.0.23 ", "static_route_ifnames": "" }
validate route/test_static_route.py testcase pass.
Why I did it
Support to add SONiC OS Version in device info.
It will be used to display the version info in the SONiC command "show version". The version is used to do the FIPS certification. We do not do the FIPS certification on a specific release, but on the SONiC OS Version.
SONiC Software Version: SONiC.master-13812.218661-7d94c0c28
SONiC OS Version: 11
Distribution: Debian 11.6
Kernel: 5.10.0-18-2-amd64
How I did it
- Why I did it
To solve an issue with upgrade with fast-reboot including FW upgrade which has been introduced since moving to fast-reboot over warm-reboot infrastructure.
As well, this introduces fast-reboot finalizing logic to determine fast-reboot is done.
- How I did it
Added logic to finalize-warmboot script to handle fast-reboot as well, this makes sense as using fast-reboot over warm-reboot this script will be invoked. The script will clear fast-reboot entry from state-db instead of previous implementation that relied on timer. The timer could expire in some scenarios between fast-reboot finished causing fallback to cold-reboot and possible crashes.
As well this PR updates all services/scripts reading fast-reboot state-db entry to look for the updated value representing fast-reboot is active.
- How to verify it
Run fast-reboot and check that fast-reboot entry exists in state-db right after startup and being cleared as warm-reboot is finalized and not due to a timer.
#### Why I did it
Enhance the error message output mechanism during swss docker creating
#### How I did it
Capture the output to stderr of `sonic-cfggen` and output it using `echo` to make sure the error message will be logged in syslog.
#### How to verify it
Manually test
Why I did it
Orchagent sometimes take additional time to execute Tunnel tasks. This cause write_standby script to error out and mux state machines are not initialized. It results in show mux status missing some ports in output.
Mar 13 20:36:52.337051 m64-tor-0-yy41 INFO systemd[1]: Starting MUX Cable Container...
Mar 13 20:37:52.480322 m64-tor-0-yy41 ERR write_standby: Timed out waiting for tunnel MuxTunnel0, mux state will not be written
Mar 13 20:37:58.983412 m64-tor-0-yy41 NOTICE swss#orchagent: :- doTask: Tunnel(s) added to ASIC_DB.
How I did it
Increase timeout from 60s to 90s
How to verify it
Verified that mux state machine is initialized and show mux status has all needed ports in it.
Why I did it
At service start up time, there are chances that the networking service is being restarted by interface-config service. When that happens, write_standby could fail to make DB connections due to loopback interface is being reconfigured.
How I did it
Force the db connector to use unix socket to avoid loopback reconfig timing window.
How to verify it
Run config reload test 20+ times and no issue encountered.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* use unix socket instead
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Why I did it
All these 3 services started after swss service, which used to start after interface-config service. But #13084 remove the time constraints for swss.
After that, these 3 services has the chance of start earlier when the inteface-config service is restarting the networking service, which could cause db connect request to fail.
How I did it
Delay mux/sflow/snmp timer after the interface-config service.
How to verify it
PR test.
Config reload can repro the issue in 1-3 retries. With this change. config reload run 30+ iterations without hitting the issue.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
arp_update should resolve the missing arp/ndp static route
entries. Added code to check for missing entries and try ping to
resolve the missing entry.
Why I did it
Fixes#14179
chassis-packet: missing arp entries for static routes causing high orchagent cpu usage
It is observed that some sonic-mgmt test case calls sonic-clear arp, which clears the static arp entries as well. Orchagent or arp_update process does not try to resolve the missing arp entries after clear.
How I did it
arp_update should resolve the missing arp/ndp static route
entries. Added code to check for missing entries and try ping if any
found to resolve it.
How to verify it
After boot or config reload, check ipv4 and ipv4 neigh entries to make sure all static route entries are present
manual validation:
Use sonic-clear arp and sonic-clear ndp to clear all neighbor entries
run arp_update
Check for neigh entries. All entries should be present.
Signed-off-by: anamehra <anamehra@cisco.com>
Improve sudo cat command for RO user.
#### Why I did it
RO user can use sudo command show none syslog files.
#### How I did it
Improve sudo cat command for RO user.
#### How to verify it
Pass all UT.
Manually check fixed code work correctly.
#### Description for the changelog
Improve sudo cat command for RO user.
Why I did it
After warm reboot, show environment prints the following error:
failed to import plugin show.plugins.macsec: [Errno 13] Permission denied: '/tmp/cache/macsec'
How I did it
Set owner back to admin after restoring counters folder.
How to verify it
sudo warm-reboot, then ensure show environement does not print errors.
Signed-off-by: Oleksandr Kolomeiets <oleksandrx.kolomeiets@intel.com>