#### Why I did it
When CPU is busy, the sonic_ax_impl may not have sufficient speed to handle the notification message sent from REDIS.
Thus, the message will keep stacking in the memory space of sonic_ax_impl.
If the condition continues, the memory usage will keep increasing.
#### How I did it
Add a monit file to check if the SNMP container where sonic_ax_impl resides in use more than 4GB memory.
If yes, restart the sonic_ax_impl process.
#### How to verify it
Run a lot of this command: `while true; do ret=$(redis-cli -n 0 set LLDP_ENTRY_TABLE:test1 test1); sleep 0.1; done;`
And check the memory used by sonic_ax_impl keeps increasing.
After a period, make sure the sonic_ax_impl is restarted when the memory usage reaches the 4GB threshold.
And verify the memory usage of sonic_ax_impl drops down from 4GB.
#### Why I did it
Fixes#14277.
Fixes the inconsistent fallback behaviour for RADIUS authentication when AAA authentication is configured as "radius, local".
#### How I did it
Modified common-auth-sonic.j2 template to make sure that when no RADIUS servers are configured (with AAA authentication login method set to radius, local), the system falls back to local authentication successfully.
#### How to verify it
1. Configure authentication based on RADIUS and local.
config aaa authentication login radius local
2. Configure an unreachable RADIUS server.
config radius add 6.6.6.6
3. Try to login to switch with existing admin user credentials. This is successful.
4. Remove RADIUS server configuration.
config radius delete 6.6.6.6
5. Try to login to switch with admin user credentials. This is successful.
Based on the port breakout HLD, we are now using subport instead of channel in the CONFIG_DB PORT table to handle port breakout. The yang schema needs to be modified accordingly to handle the corresponding change.
The corresponding code changes have been merged through sonic-net/sonic-platform-daemons/pull/342 merged
Signed-off-by: Mihir Patel <patelmi@microsoft.com>
Why I did it
Support Egress Mirroring on supported Arista platforms
How I did it
Add necessary soc properties for egress mirroring recycle ports to be created
Signed-off-by: Nathan Wolfe <nwolfe@arista.com>
Updated asic_port_names for all Arista LC SKUs to follow latest naming
conventions to remove redundant ASICx suffix. For
Arista-7800R3-48CQ2-C48, added the asic_port_name mapping.
Why I did it
It is to fix the issue #13773
It only has impact on the build triggered manually inside of the slave container. Developers can go to the slave container do a build, it will print a skippable error message complaining the variable not found.
How I did it
Add the default value for variable SLAVE_DRI.
How to verify it
[S6100] Improve S6100 serial-getty monitor, wait and re-check when getty not running to avoid false alert.
#### Why I did it
On S6100, the serial-getty service some time can't auto-restart by systemd. So there is a monit unit to check serial-getty service status and restart it.
However, this monit will report false alert, because in most case when serial-getty not running, systemd can restart it successfully.
To avoid the false alert, improve the monitor to wait and re-check.
Steps to reproduce this issue:
1. User login to device via console, and keep the connection.
2. User login to device via SSH, check the serial-getty@ttyS1.service service, it's running.
3. Run 'monit reload' from SSH connection.
4. Check syslog 1 minutes later, there will be false alert: ' 'serial-getty' process is not running'
#### How I did it
Add check-getty.sh script to recheck again later when getty service not running.
And update monit unit to check serial-getty service status with this script to avoid false alert.
#### How to verify it
Pass all UT.
Manually check fixed code work correctly:
```
admin@***:~$ sudo systemctl stop serial-getty@ttyS1.service
admin@***:~$ sudo /usr/local/bin/check-getty.sh
admin@***:~$ echo $?
1
admin@***:~$ sudo systemctl status serial-getty@ttyS1.service
● serial-getty@ttyS1.service - Serial Getty on ttyS1
Loaded: loaded (/lib/systemd/system/serial-getty@.service; enabled-runtime; vendor preset: enabled)
Active: inactive (dead) since Tue 2023-03-28 07:15:21 UTC; 1min 13s ago
admin@***:~$ sudo /usr/local/bin/check-getty.sh
admin@***:~$ echo $?
0
admin@***:~$ sudo systemctl status serial-getty@ttyS1.service
● serial-getty@ttyS1.service - Serial Getty on ttyS1
Loaded: loaded (/lib/systemd/system/serial-getty@.service; enabled-runtime; vendor preset: enabled)
```
syslog:
```
Mar 28 07:10:37.597458 *** INFO systemd[1]: serial-getty@ttyS1.service: Succeeded.
Mar 28 07:12:43.010550 *** ERR monit[593]: 'serial-getty' status failed (1) -- no output
Mar 28 07:12:43.010744 *** INFO monit[593]: 'serial-getty' trying to restart
Mar 28 07:12:43.010846 *** INFO monit[593]: 'serial-getty' stop: '/bin/systemctl stop serial-getty@ttyS1.service'
Mar 28 07:12:43.132172 *** INFO monit[593]: 'serial-getty' start: '/bin/systemctl start serial-getty@ttyS1.service'
Mar 28 07:13:43.286276 *** INFO monit[593]: 'serial-getty' status succeeded (0) -- no output
```
#### Description for the changelog
[S6100] Improve S6100 serial-getty monitor.
#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
Why I did it
All these 3 services started after swss service, which used to start after interface-config service. But #13084 remove the time constraints for swss.
After that, these 3 services has the chance of start earlier when the inteface-config service is restarting the networking service, which could cause db connect request to fail.
How I did it
Delay mux/sflow/snmp timer after the interface-config service.
How to verify it
PR test.
Config reload can repro the issue in 1-3 retries. With this change. config reload run 30+ iterations without hitting the issue.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Change references to use bullseye instead of buster
Why I did it
Almost all daemons in 202211 and master uses bullseye, and sflow was easy to migrate.
How I did it
Replaced the references, built and tested in 202211.
How to verify it
Build with the changes, enable sflow:
admin@sonic:~$ sudo config sflow collector add test 1.2.3.4
admin@sonic:~$ sudo config sflow collector enable
tcpdump on 1.2.3.4 and see that UDP sFlow are being sent.
Signed-off-by: Christian Svensson <blue@cmd.nu>
Change references to use bullseye instead of buster
Why I did it
Almost all daemons in 202211 and master uses bullseye, and NAT seems easy to migrate.
How I did it
Replaced the references, built with 202211 branch.
How to verify it
Not sure, it builds and tests pass as far as I can tell but I don't use the feature myself.
Signed-off-by: Christian Svensson <blue@cmd.nu>
Why I did it
After sonic-install install a new image, print_menu is set echo without any data. No image info between Hit any key to stop autoboot: 0 and Start USB
Board configuration detected:
Net:
| port | Interface | PHY address |
|--------|-----------|--------------|
No ethernet found.
Hit any key to stop autoboot: 0
(Re)start USB...
USB0: Port (usbActive) : 0 Interface (usbType = 2) : USB EHCI 1.00
scanning bus 0 for devices... 3 USB Device(s) found
scanning usb for storage devices... 0 Storage Device(s) found
How I did it
The fw_setenv print_menu is missing the double quotes. That causes the value is truncated. Using double quotes to in the environment setting.
How to verify it
Install new image with this fix. And reboot the system. The following section should be shown:
Signed-off-by: mlok <marty.lok@nokia.com>
Why I did it
We found a bug when pilot, the tag function doesn't remove the ACR domain when do tag, it makes the latest tag not work. And in the original tag function, it calls os.system and os.popen which are not recommend, need to refactor.
How I did it
Do a split("/") when get image_rep to fix the acr domain bug
Refactor the tag function code and add test cases
How to verify it
Check whether container images are tagged as latest when in kube mode.
Why I did it
Add platform files for critical processes and default qos config for Innovium platforms
How I did it
Added default files for critical processes and qos config
How to verify it
Tested with autorestart/test_container_autorestart.py::test_containers_autorestart
Signed-off-by: rck-innovium rck@innovium.com
Why I did it
mmh3's new version 3.1.0 breaks pipeline build.
bullseye/buster/jessie pined the version to 2.5.1
How I did it
Pin mmh3's version as other dists.
How to verify it
Why I did it
832ef9c4 - Fix bug in GCU vlanintf_validator ([Bcm SAI] ugprade Broadcom SAI to version 3.3.5.4m-1 #2765) (5 minutes ago) [jingwenxie]
53f611b7 - Revert "Convert IPv6 addresses to lowercase in apply-patch (Add Pegatron project to branch 201807 #2299)" (Add note for running out of disk space in /var/lib/docker to README.md #2758) (20 hours ago) [jingwenxie]
79a21cef - Revert frr route check ([mlnx] fix url inconsistency in fw.mk #2761) (8 minutes ago) [StormLiangMS]
824680ed - Resolved rc!=0 problem by replacing fgrep with awk. Added ipv4 filtering to get only v4 peers in case of show ip bgp neighbors (Improve eeprom access reliability #2756) (30 hours ago) [saurabh17g]
10f31ea6 - Revert "Replace pickle by json (Add autoneg to 7170-Q59S20 #2636)" ([hostcfgd] Default value of fallthrough for authentication set to be False. #2746) (7 days ago) [Mai Bui]
05fa7513 - Fix the show interface counters throwing exception on device with no external interfaces ([docker-platform-monitor]: Add smartmontools 6.6-1 #2703) (11 days ago) [abdosi]
f27dea0c - [route_check] remove check-frr_patch mock ([minigraph]: Mark both ERSPAN and ERSPANv6 as mirror ACL tables #2732) (11 days ago) [Stepan Blyshchak]
2d95529d - Revert "Update load minigraph to load backend acl (mlnx msn2010: default config_db.json generation with sonic-cfggen is not working #2236)" (swss stretch update broke restore_neighbors.py for neigh service #2735) (12 days ago) [Neetha John]
c869c970 - (master) Update the ref guide to reflect the vlan brief output ([teamd] update teamd docker to stretch and fix teamd_init failure #2731) (2 weeks ago) [Vivek]
76457141 - Fix fast-reboot DB migration ([teamd]: update teamd docker to stretch #2734) (2 weeks ago) [Aryeh Feigin]
f7f783bc - Enhance the logic to wait for all buffer tables to be removed in _clear_qos ([sfputil] Not able to read out values of voltage/temp/power on some cables #2720) (2 weeks ago) [Stephen Sun]
e6179afa - Remove timer from FAST_REBOOT STATE_DB entry and use finalizer (Rollback kernel submodule update. #2621) (3 weeks ago) [Aryeh Feigin]
ff688323 - [route_check] fix IPv6 address handling ([docker pmon] install fancontrol & sensord #2722) (3 weeks ago) [Stepan Blyshchak]
7a604c51 - update fast-reboot ([201811][sairedis][swss] advance sub module head of sairedis and swss #2728) (3 weeks ago) [jhli-cisco]
9f83ace9 - [GCU] Add vlanintf-validator (Revert "[device/celestica] blacklist gpio_ich kernel module on haliburton" #2697) (3 weeks ago) [jingwenxie]
338d1c05 - Check SONiC dependencies before installation. ([sonic-slave]: Add iproute2 dependencies in stretch docker #2716) (3 weeks ago) [Liu Shilong]
64d2efd2 - Improve show acl commands ([sonic-utilities] update submodule #2667) (3 weeks ago) [bingwang-ms]
2ef5b31e - [GCU] Add PFC_WD RDMA validator ([sub module] advance sonic-utilities sub module for 201811 branch #2619) (3 weeks ago) [isabelmsft]
c7aa8416 - [show][muxcable] increase timeout for displaying HW_STATUS (Fixing get_transceiver_change_event #2712) (3 weeks ago) [vdahiya12]
2fc2b826 - YANG validation for ConfigDB Updates: MIRROR_SESSION use case ([mellanox] Update SDK to 4.3.0132 #2430) (3 weeks ago) [isabelmsft]
e16bdaae - Fix non-zero status exit on non secure boot system ([service] add warmboot finializer service #2715) (3 weeks ago) [kellyyeh]
90d70152 - [route_check] implement a check for FRR routes not marked offloaded (Feature to run an option platform specific script on the first boot #2531) (3 weeks ago) [Stepan Blyshchak]
c2bc150a - [warm/fast-reboot] Backup logs from tmpfs to disk during fast/warm shutdown ([swss]: update swss docker to stretch #2714) (3 weeks ago) [Vaibhav Hemant Dixit]
a015834d - [db_migrator] Add missing attribute 'weight' to route entries in APPL DB ([device/celestica] blacklist gpio_ich kernel module on seastone #2691) (4 weeks ago) [Vaibhav Hemant Dixit]
cd519aac - [ci] Fix pipeline issue caused by sonic-slave-* change. ([201803] Modify Debian apt repos to reflect changes made by maintainers #2709) (4 weeks ago) [Liu Shilong]
2680e6f3 - [dhcp_relay] Fix dhcp_relay restart error while add/del vlan ([thrift] add a patch to revert THRIFT-3650 #2688) (4 weeks ago) [Yaqiang Zhu]
How I did it
How to verify it
Why I did it
This PR is to update the check of IP_TYPE from sonic-acl.yang.
It's because if the ACL rule is added by loading a json file with acl-loader, there is no IP_TYPE for ACL rule. If such rule exists in ACL_RULE table, the GCU (generic config updater) refuses to update any ACL rules because the existing one is invalid.
This PR updates the yang model for ACL. If the IP_TYPE leaf doesn't exist, then we don't check the field.
How I did it
Accept the rule if IP_TYPE is absent.
How to verify it
The change is verified by UT.
arp_update should resolve the missing arp/ndp static route
entries. Added code to check for missing entries and try ping to
resolve the missing entry.
Why I did it
Fixes#14179
chassis-packet: missing arp entries for static routes causing high orchagent cpu usage
It is observed that some sonic-mgmt test case calls sonic-clear arp, which clears the static arp entries as well. Orchagent or arp_update process does not try to resolve the missing arp entries after clear.
How I did it
arp_update should resolve the missing arp/ndp static route
entries. Added code to check for missing entries and try ping if any
found to resolve it.
How to verify it
After boot or config reload, check ipv4 and ipv4 neigh entries to make sure all static route entries are present
manual validation:
Use sonic-clear arp and sonic-clear ndp to clear all neighbor entries
run arp_update
Check for neigh entries. All entries should be present.
Signed-off-by: anamehra <anamehra@cisco.com>
Description for the changelog
Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
Why I did it
We don't need to install rsync in every docker container if vcache is disabled.
How I did it
Install rsync in pre_run_buildinfo script only if vcache is enabled.
How to verify it
Why I did it
Fix the installation candidate not found issue when building docker-sonic-vs
How I did it
Need to run the command "apt-get update" to update the mirror indexes before installing the package gnupg
How to verify it
Improve sudo cat command for RO user.
#### Why I did it
RO user can use sudo command show none syslog files.
#### How I did it
Improve sudo cat command for RO user.
#### How to verify it
Pass all UT.
Manually check fixed code work correctly.
#### Description for the changelog
Improve sudo cat command for RO user.
New docker versions use stderr instead of stdout to print info when build image.
As a resullt we got empty log files.
the fix is to redirect stderr to stdout when build sonic-slave images.
#### Why I did it
Fixes#9216
#### How I did it
Add support for python2 redis to the sonic-slave-stretch and sonic-slave-buster Dockerfiles
#### How to verify it
Run build steps documented in Issue #9216
Why I did it
After warm reboot, show environment prints the following error:
failed to import plugin show.plugins.macsec: [Errno 13] Permission denied: '/tmp/cache/macsec'
How I did it
Set owner back to admin after restoring counters folder.
How to verify it
sudo warm-reboot, then ensure show environement does not print errors.
Signed-off-by: Oleksandr Kolomeiets <oleksandrx.kolomeiets@intel.com>
catch system error and log as warning level instead of
error level in case interface was already deleted.
Why I did it
sflow process exited when failed to convert the interface index from interface name
How I did it
Added exception handling code and logged when OSError exception.
How to verify it
Recreated the bug scenario #11437 and ensured that sflow process not exited.
Description for the changelog
catch system error and log as warning level instead of
error level in case interface was already deleted.
Logs
steps :
root@sonic:~# sudo config vlan member del 4094 PortChannel0001
root@sonic:~# sudo config vlan member del 4094 Ethernet2
root@sonic:~# sudo config vlan del 4094
root@sonic:~#
"WARNING sflow#port_index_mapper: no interface with this name" is seen but no crash is reported
syslogs :
Jan 23 09:17:24.420448 sonic NOTICE swss#orchagent: :- removeVlanMember: Remove member Ethernet2 from VLAN Vlan4094 lid:ffe vmid:27000000000a53
Jan 23 09:17:24.420710 sonic NOTICE swss#orchagent: :- flushFdbEntries: flush key: SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000, fields: 3
Jan 23 09:17:24.420847 sonic NOTICE swss#orchagent: :- recordFlushFdbEntries: flush key: SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000, fields: 3
Jan 23 09:17:24.426082 sonic NOTICE syncd#syncd: :- processFdbFlush: fdb flush succeeded, updating redis database
Jan 23 09:17:24.426242 sonic NOTICE syncd#syncd: :- processFlushEvent: received a flush port fdb event, portVid = oid:0x3a000000000a52, bvId = oid:0x26000000000a51
Jan 23 09:17:24.426374 sonic NOTICE syncd#syncd: :- processFlushEvent: pattern ASIC_STATE:SAI_OBJECT_TYPE_FDB_ENTRY:*oid:0x26000000000a51*, portStr oid:0x3a000000000a52
Jan 23 09:17:24.427104 sonic NOTICE bgp#fpmsyncd: :- onRouteMsg: RouteTable del msg for route with only one nh on eth0/docker0: fe80::/64 :: eth0
Jan 23 09:17:24.427182 sonic NOTICE bgp#fpmsyncd: :- onRouteMsg: RouteTable del msg for route with only one nh on eth0/docker0: fd00::/80 :: docker0
Jan 23 09:17:24.428502 sonic NOTICE swss#orchagent: :- meta_sai_on_fdb_flush_event_consolidated: processing consolidated fdb flush event of type: SAI_FDB_ENTRY_TYPE_DYNAMIC
Jan 23 09:17:24.429058 sonic NOTICE swss#orchagent: :- meta_sai_on_fdb_flush_event_consolidated: fdb flush took 0.000606 sec
Jan 23 09:17:24.431496 sonic NOTICE swss#orchagent: :- setHostIntfsStripTag: Set SAI_HOSTIF_VLAN_TAG_STRIP to host interface: Ethernet2
Jan 23 09:17:24.431675 sonic NOTICE swss#orchagent: :- flushFdbEntries: flush key: SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000, fields: 2
Jan 23 09:17:24.431797 sonic NOTICE swss#orchagent: :- recordFlushFdbEntries: flush key: SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000, fields: 2
Jan 23 09:17:24.437009 sonic NOTICE swss#orchagent: :- meta_sai_on_fdb_flush_event_consolidated: processing consolidated fdb flush event of type: SAI_FDB_ENTRY_TYPE_DYNAMIC
Jan 23 09:17:24.437532 sonic NOTICE swss#orchagent: :- meta_sai_on_fdb_flush_event_consolidated: fdb flush took 0.000514 sec
Jan 23 09:17:24.437942 sonic NOTICE syncd#syncd: :- processFdbFlush: fdb flush succeeded, updating redis database
Jan 23 09:17:24.438065 sonic NOTICE syncd#syncd: :- processFlushEvent: received a flush port fdb event, portVid = oid:0x3a000000000a52, bvId = oid:0x0
Jan 23 09:17:24.438173 sonic NOTICE syncd#syncd: :- processFlushEvent: pattern ASIC_STATE:SAI_OBJECT_TYPE_FDB_ENTRY:*, portStr oid:0x3a000000000a52
Jan 23 09:17:24.440348 sonic NOTICE swss#orchagent: :- removeBridgePort: Remove bridge port Ethernet2 from default 1Q bridgeJan 23 09:17:29.782554 sonic NOTICE swss#orchagent: :- removeVlan: VLAN Vlan4094 still has 1 FDB entries
Jan 23 09:17:29.791373 sonic WARNING sflow#port_index_mapper: no interface with this name
Signed-off-by: Gokulnath-Raja <Gokulnath_R@dell.com>
Why I did it
sonic-sfp based sfp impl would be deprecated in future, change to sfp-refactor based implementation.
How I did it
Use the new sfp-refactor based sfp implementation for seastone.
How to verify it
Manual test sfp platform api or run sfp platform test cases.
Why I did it
There is rare condition, emc2305 hold SMBus and cause SMBus completion wait timed out.
How I did it
Enable EMC2305 SMBus timeout feature, 30ms period of inactivity will reset the interface.
How to verify it
Use 'i2cget -y -f 23 0x4d 0x20 b' to read EMC2305 configuration register and check DIS_TO bit not set.
Signed-off-by: Eric Zhu <erzhu@celestica.com>
Why I did it
sonic-slave-stretch build failed for mmh3 version update to 3.10 on Mar 24.
How I did it
Enable reproducible build for vhdx image.
How to verify it
Why I did it
Change static route expiry timer max timeout value from 1800 to 172800.
To keep same value range as defined in sonic-restapi/sonic_api.yaml
How I did it
How to verify it
apply change to bgpcfd, restart bgp container see if the value take action.
Why I did it
This is a fix for c63e9fe
SLAVE_TAG should include all dependencies used for SLAVE_BASE_TAG
How I did it
Take sources.list.* into account when calculate SLAVE_TAG
How to verify it
Update sonic-py-common, add missing dependency to redis-dump-load.
#### Why I did it
The script sonic_db_dump_load.py in sonic-py-common is depends on redis-dump-load, however the dependency is missing.
#### How I did it
Add redis-dump-load dependency.
#### How to verify it
Pass all E2E test case.
#### Description for the changelog
Update sonic-py-common, add missing dependency to redis-dump-load.
Why I did it
For better accounting purposes, updating the ingress lossy traffic profile to use static threshold. This change is only intended for Th devices using RDMA-CENTRIC profiles
How I did it
Update the buffer templates for Th devices in RDMA-CENTRIC folder to use the correct threshold
How to verify it
Verified the changes manually on a Th device.
Existing unit tests render Th template from the RDMA-CENTRIC folder. Updated the expected output to use the correct threshold
Why I did it
The demo_part_size should be initialized before creating partition.
How I did it
Move the initializing setting to the line before using it.
How to verify it