- Why I did it
The recent change #15685 (comment) removed the db migration for non first reboots.
This is problematic for many deployments which doesn't rely on ZTP and push a custom config_db.json
Port to older branches after #15685 is ported back
- How I did it
Re-introduce the logic to run the db_migrator on non-first boots
- How to verify it
Verified reboot and warm-reboot cases
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
SONiC changes:
1. Support Spectrum4 ASIC FW binary building.
2. Support new SDK sx-obj-desc lib building since new SAI need it.
3. Remove SX_SCEW debian package from Mellanox SDK build since we are no longer using it (we use libxml2 instead).
4. Update SAI, SDK, FW to version 4.6.1020/2012.1020/SAIBuild2305.25.0.3
SDK/FW bug fixes
1. In SPC-1 platforms: Fastboot mode is not operational for Split port with Force mode in 50G speed
SFP modules are kept in disabled state after set LPM (low power mode) on/off for at least 3 minutes.
2. When preforming fast boot from an old SDK version (currently installed) to a newer one (target version), and the system was initially loaded with a new SDK version (past version), and the system has not been wiped, under specific conditions, the fast boot would use the past version's data and may fail.
SDK/FW Features
1. On SN2700 all ports can support y cable by credo
SAI bug Fixes
1. When creating an ACL rule with SAI_ACL_ENTRY_ATTR_FIELD_SRC_IP/SAI_ACL_ENTRY_ATTR_FIELD_DST_IP enabled, and then disabling the field by setting enable=false, a match on L3_type=IPv4 will remain programmed for the rule Issue resolved after the fix
2. Allow the max scale of virtual routers to be configure for SPC-1, SPC-2, SPC-3 when fastboot enable
3. Remove default hash key of SRC_MAC, DST_MAC and ETH_TYPE
SAI features
1. Port init profile
- How I did it
Update SDK/FW/SAI make files
- How to verify it
Run full sonic-mgmt regression on Mellanox platform
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Fixes#15667 and #13293
Work item tracking
Microsoft ADO 24472854:
How I did it
On chassis supervisor bgp feature is disabled in hostcfgd. The dependency between swss and bgp causes the bgp containers to start even though the feature is disabled.
How to verify it
Tests on chassis supervisor and LC
Why I did it
The hw resources should be released before updating firmware.
How I did it
Added logic to release hw resources in syncd.sh script
Signed-off-by: Vadym Hlushko <vadymh@nvidia.com>
* Fix CONFIG_DB_INITIALIZED flag check logic and set/reset flag for warm-reboot
* Fix db-cli usage
* Handle same image warm-reboot and generalize handling of INIT flag
* Cover boot from ONIE case: set config init flag when minigraph, config_db are missing
* Handle case: first boot of SONiC
* Check for config init flag
* Simplify logic, and do not call db_migrator for same image reboot
Why I did it
Support default DNS configuration
How I did it
Use j2 template to generate default DNS configuration.
How to verify it
Run sonic-config-engine unit test.
Why I did it
Support FIPS DB configuration
Design Doc: sonic-net/SONiC#1372
Work item tracking
Microsoft ADO (number only): 24411148
How I did it
Add the FIPS Yang model to make FIPS configurable in ConfigDB.
How to verify it
See TestPlan: sonic-net/sonic-mgmt#9092
Build the image and run the tests: sonic-net/sonic-mgmt#9091
Why I did it
Currently, k8s master image is generated from a separate branch which we created by ourselves, not release ones. We need to commit these k8s master related code to master branch for a better way to do k8s master image build out.
Work item tracking
Microsoft ADO (number only):
19998138
How I did it
Install k8s dashboard docker images
Install geneva mds and mdsd and fluentd docker images and tag them as latest, tagging latest will help create container always with the latest version
Install azure-storage-blob and azure-identity, this will help do etcd backup and restore.
Install kubernetes python client packages, this will help read worker and container state, we can send these metric to Geneva.
Remove mdm debian package, will replace it with the mdm docker image
Add k8s master entrance script, this script will be called by rc-local service when system startup. we have some master systemd services in compute-move repo, when VMM service create master VM, VMM will copy all master service files inside VM, the entrance script will setup all services according to the service files.
When the entrance script content changed, the PR build will set include_kubernetes_master=y to help do validation for k8s master related code change. The default value of include_kubernetes_master should be always n for public master branch. We will generate master image from internal master branch
How to verify it
Build with INCLUDE_KUBERNETES_MASTER = y
There is a redundant line in init_cfg.json.j2. It would cause pmon service always has "delayed=False". However, we know that PMON has a timer now. So, I try to fix it here.
Why I did it
For route registry service, in order to block hijacked routes, IBGP session needs to be set up from BGP sentinel service to SONiC, and BGP sentinel service advertise the same route with higher local-preference and no export community. So that SONiC takes the route from BGP sentinel as the best path and does not advertise the route to EBGP peers.
In order to do that, new route-maps are needed. So this change adds a new set of templates, keeping BGPSentinel peers out of the other templates.
Work item tracking
Microsoft ADO (number only): 24451346
How I did it
Add sentinel_community in constants.yml, route from BGPSentinel do not match this community will be denied.
Add support to convert BGPSentinel related configuration in the BGPPeerPassive element of the minigraph to a new BGP_SENTINELS table in CONFIG_DB
Add a new set of "sentinels" templates to docker-fpm-frr
Add a new BGP peer manager to bgpcfgd, to add neighbors from the BGP_SENTINELS table using the "sentinels" templates
Add a test case for minigraph.py, making sure the BGPSentinel and BGPSentinelV6 elements create BGP_SENTINELS DB entry.
Add a set of test cases for the new sentinels templates in sonic-bgpcfgd tests.
Add sonic-bgp-sentinel.yang and a set of testcases for the yang file.
How to verify it
Testcases and UT newly added would pass.
Setup IPv4 and IPv6 BGPSentinel services in minigraph, and load minigraph, show CONFIG_DB and "show runningconfig bgp", configuration would be loaded successfully.
Using t1-lag topo and setup IBGP session from BGPSentinel to SONiC loopback address, IBGP session would up.
Advertise route from BGPSentinel to T1 with sentinel_community, higher local-preference and no-export communiyt. In T1, show bgp route, the result is "Not advertise to any EBGP peer".
Withdraw the route in BGPSentinel, in T1, route would advertise to EBGP peers.
Advertise route from T1 that does not match sentinel_community, in T1, would not see the route in show bgp route.
Why I did it
For some devices whose log folder size is larger than 200M, for example, 256M, the LOG_FILE_ROTATE_SIZE_KB should be 16M. and
THRESHOLD_KB=$((USABLE_SPACE_KB - (NUM_LOGS_TO_ROTATE * LOG_FILE_ROTATE_SIZE_KB * 2)))
= $(( (VAR_LOG_SIZE_KB * 90 / 100) - RESERVED_SPACE_KB)) - (NUM_LOGS_TO_ROTATE * LOG_FILE_ROTATE_SIZE_KB * 2)))
= $(( (256M * 90 / 100) - 4096)) - (8 * 16M * 2)))
the result would be a negative value
Work item tracking
Microsoft ADO (number only):
24524827
How I did it
Add a case for 400M, if the log folder size is between 200M and 400M, set the log file size to 2M
How to verify it
Do cmd "sudo logrotate -f /etc/logrotate.conf" on DUT which val/log folder size is 256M, and check the syslog.
#### Why I did it
Support reset factory in Sonic OS
[Reset Factory HLD](https://github.com/sonic-net/SONiC/pull/1231)
[Sonic-mgmt tests](https://github.com/sonic-net/sonic-mgmt/pull/7652)
#### How I did it
- Added new script "/usr/bin/reset-factory"
* It generates a new config_db.json files with factory configurations
* It clears system files and logs
* It removes all docker containers on system except database
* It clears non-default users and restores default users password
- Dump the default users info to a new file during build "/etc/sonic/default_users.json"
- Supported new type "Keep-basic" in "config-setup factory"
- Add new conf file for config-setup "/etc/config-setup/config-setup.conf
#### How to verify it
- Run reset-factory script with all types: < none | keep-all-config | only-config | keep-basic >
- Run config-setup factory with parameters < none | keep-basic >
#### Description for the changelog
Support reset factory in Sonic OS
#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
* Add an ability to configure remote syslog servers
* Add an initial configuration for remote syslog
* Extend YANG module and add unit tests
#### Why I did it
Adding the following functionality to rsyslog feature:
- Configure remote syslog servers: protocol, filter, severity level
- Update global syslog configuration: severity level, message format
#### How I did it
added parameters to syslog server and global configuration.
#### How to verify it
create syslog server using CLI/adding to Redis-DB
verify server is added to file /etc/rsyslog.conf and server is functional.
#### Description for the changelog
extend rsyslog capabilities, added server and global configuration parameters.
#### Link to config_db schema for YANG module changes
https://github.com/iavraham/sonic-buildimage/blob/master/src/sonic-yang-models/yang-models/sonic-syslog.yang
Why I did it
To reduce the container's dependency from host system
Work item tracking
Microsoft ADO (number only):
17713469
How I did it
Move the k8s container startup script to config engine container, other than mount it from host.
How to verify it
Check file path(/usr/share/sonic/scripts/container_startup.py) inside config engine container.
Signed-off-by: Yun Li <yunli1@microsoft.com>
Co-authored-by: Qi Luo <qiluo-msft@users.noreply.github.com>
Why I did it
Sharing the storage of syncd with other proprietary application extensions allows them to communicate with syncd in differnt ways.
If one container wants to pass some information to syncd then shared storage can be used. However, today the shared storage isn't cleaned on restarts making it possible for syncd to read out-of-date information generated in the past.
NOTE: No plans to use it for standard SONIC dockers and we are working on removing the SDK dependency from PMON docker
How I did it
Implemented new service to clean the shared storage.
How to verify it
Do reboot/fast-reboot/warm-reboot/config-reload/systemctl restart swss and verify /tmp/ is cleaned after each restart in syncd container.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
* Added VNET_MONITOR_TABLE, BFD_SESSION_TABLE, to the listof tables to be cleaned up after swss restart.
* Added VNET_ROUTE* table in cleanup. This should cover VNET_ROUTE_TUNNEL_TABLE as well.
- Why I did it
Add support for static DNS configuration. According to sonic-net/SONiC#1262 HLD.
- How I did it
Add a new resolv-config.service that is responsible for transferring configuration from Config DB into /etc/resolv.conf file that is consumed by various subsystems in Linux to resolve domain names into IP addresses.
- How to verify it
Run the image compilation. Each component related to the static DNS feature is covered with the unit tests.
Run sonic-mgmt tests. Static DNS feature will be covered with the system tests.
Install the image and run manual tests.
This reverts commit 02b17839c3.
Reverts #14933
The earlier commit caused a race condition that particularly broke cross branch warm upgrade.
Issue happens when db_migrator is still migrating the DB and finalizer is checking DB for list of components to reconcile.
If migration is not complete, finalizer get an empty list to wait for. Due to this, finalizer concludes warmboot (deletes system wide warmboot flag) and cause all the services to do cold restart.
ADO: 24274591
* Re-add 127.0.0.1/8 when bringing down the interfaces
With #5353, 127.0.0.1/16 was added to the lo interface, and then
127.0.0.1/8 was removed. However, when bringing down the lo interface,
like during a config reload, 127.0.0.1/16 gets removed, but 127.0.0.1/8
isn't added back to the interface. This means that there's a period of
time where 127.0.0.1 is not available at all, and services that need to
connect to 127.0.01 (such as for redis DB) will fail.
To fix this, when going down, add 127.0.0.1/8. Add this address before
the existing configuration gets removed, so that 127.0.0.1 is available
at all times.
Note that running `ifdown lo` doesn't actually bring down the loopback
interface; the interface always stays "physically" up.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Add watchdog mechanism to swss service and generate alert when swss have issue.
**Work item tracking**
Microsoft ADO (number only): 16578912
**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.
**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.
**How I verified it**
Pass all UT.
Manually test process_monitoring/test_critical_process_monitoring.py can pass.
Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:
Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).
**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
For T2 systems using packet mode, the backplane interfaces (Ethernet-BP#) and the fabric card ethernet interfaces are not visible as neighbor interfaces.
In packet mode, these interfaces needs qos and buffer config as well.
This fix addresses that issue and adds the backplane interfaces to the PORTS_ACTIVE list
Why I did it
After docker_inram is enabled, the docker folder's default max size is 1.5G.
It's not big enough for some tests which need to install additional docker images or install extra packages.
Work item tracking
Microsoft ADO 24199761:
How I did it
add docker_inram into cmdline_allowlist
How to verify it
sudo sh -c 'echo "docker_inram_size=3000M" >> kernel-cmdline-append'
sudo reboot and check the docker folder size
This reverts commit 44427a2f6b.
Docker image not updated during PR validation and caused PR check failures.
Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.
This PR depends on https://github.com/sonic-net/sonic-swss/pull/2737 merge first.
**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.
**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.
**How I verified it**
Pass all UT.
Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:
Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).
**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306