Improve sudo cat command for RO user.
#### Why I did it
RO user can use sudo command show none syslog files.
#### How I did it
Improve sudo cat command for RO user.
#### How to verify it
Pass all UT.
Manually check fixed code work correctly.
#### Description for the changelog
Improve sudo cat command for RO user.
Why I did it
This PR addresses the issue mentioned above by loading the acl config as a service on a storage backend device
How I did it
The new acl service is a oneshot service which will start after swss and does some retries to ensure that the SWITCH_CAPABILITY info is present before attempting to load the acl rules. The service is also bound to sonic targets which ensures that it gets restarted during minigraph reload and config reload
How to verify it
Build an image with the following changes and did the following tests
Verified that acl is loaded successfully on a storage backend device after a switch boot up
Verified that acl is loaded successfully on a storage backend ToR after minigraph load and config reload
Verified that acl is not loaded if the device is not a storage backend ToR or the device does not have a DATAACL table
Signed-off-by: Neetha John <nejo@microsoft.com>
#### Why I did it
Following the PR https://github.com/sonic-net/sonic-swss-common/pull/739 increasing netlink buffer size in linux kernel
As error is seen in fdbsyncd with netlink reports "out of memory on reading a netlink socket" It is seen when kernel is sending 10k remote mac to fdbsyncd.
#### How I did it
Increase the buffer size of the netlink buffer from 3MB to 16MB
#### How to verify it
Verified with 10k remote mac, and restarting the fdbsyncd process. So that kernel send the bridge fdb dump to the fdbsyncd.
Verified that the netlink buffer error is not reported in the sys log.
- Why I did it
In to-sonic and multi-asic KVM-test, pretest sometimes failed. Reason is rsyslogd process can not start in teamd container. Because rsyslog.conf is empty caused by sonic-cfggen execute failed
- How I did it
If sonic-cfggen -d execute failed, execute without -d because the template file has the default value.
- How to verify it
Build image and test it over 40 times, all passed pretest.
Signed-off-by: Chun'ang Li <chunangli@microsoft.com>
- Why I did it
fixes#12907
When the management interface IP address configuration changes from dynamic to static the DNS configuration (retrieved from the DHCP server) in /etc/resolv.conf remains uncleared. This leads to a DNS configuration pointing to the wrong nameserver. To make the behavior clear DNS configuration received from DHCP should be cleared.
- How I did it
Use resolvconf package for managing DNS configuration. It is capable of tracking the source of DNS configuration and puts the configuration retrieved from the DHCP servers into a separate file. This allows the implementation of DNS configuration cleanup retrieved from DHCP during networking reconfiguration.
- How to verify it
Ensure that the management interface has no static configuration.
Check that /etc/resolv.conf has DNS configuration.
Configure a static IP address on the management interface.
Verify that /etc/resolv.conf has no DNS configuration.
Remove the static IP address from the management interface.
Verify that /etc/resolv.conf has DNS configuration retrieved form DHCP server.
Adding /usr/local/bin/storyteller to READ_ONLY_CMDS. So no write access or prompt for password is needed to run storyteller.
Tested on 202205 clusters, user who didn't request write access was able to grep log using storyteller.
sign-off: Jing Zhang zhangjing@microsoft.com
Debian is shipping a systemd timer unit for logrotate, but we're also
packaging in a cron job, which means both of them will run, potentially
at the same time. Remove our cron file, and add an override to the
shipped timer file to have it be run every 10 minutes.
Fixes#12392.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
#### Why I did it
Segfault was occuring when running memory_checker
#### How I did it
Deinit publisher immediately after publishing
#### How to verify it
Manual testing
Backport of https://github.com/sonic-net/sonic-buildimage/pull/12490 into 202211
- Why I did it
Support syslog rate limit configuration feature
- How I did it
Remove unused rsyslog.conf from containers
Modify docker startup script to generate rsyslog.conf from template files
Add metadata/init data for syslog rate limit configuration
- How to verify it
Manual test
New sonic-mgmt regression cases
The main issue is the pip/pip3 command cannot be found when the package is being installed by apt-get.
When using the dpkg install, the searching path is PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
When using the apt-get install, the searching path is PATH=/usr/sbin:/usr/bin:/sbin:/bin
But the pip/pip3 default path is at /usr/local/bin, so dpkg works, but apt-get not work.
How I did it
Export the path /usr/local/bin for pip/pip3.
Make the deb packages can be installed by apt-get.
Why I did it
The current lazy installer relies on a filename sort for both unpack and configuration steps. When systemd services are configured [started] by multiple packages the order is by filename not by the declared package dependencies. This can cause the start order of services to differ between first-boot and subsequent boots. Declared systemd service dependencies further exacerbate the issue (e.g. blocking the first-boot script).
The current installer leaves packages un-configured if the package dependency order does not match the filename order.
This also fixes a trivial bug in [Build]: Support to use symbol links for lazy installation targets to reduce the image size #10923 where externally downloaded dependencies are duplicated across lazy package device directories.
How I did it
Changed the staging and first-boot scripts to use apt-get:
dpkg -i /host/image-$SONIC_VERSION/platform/$platform/*.deb
becomes
apt-get -y install /host/image-$SONIC_VERSION/platform/$platform/*.deb
when dependencies are detected during image staging.
How to verify it
Apt-get critical rules
Add a Depends= to the control information of a package. Grep the syslog for rc.local between images and observe the configuration order of packages change.
Why I did it
nameserver and domain entries from build system fsroot gets into sonic image.
How I did it
Clear /etc/resolv.conf before building image
How to verify it
Built image with it and verified with install that /etc/resolv.conf is empty
- Why I did it
Fix logrotate firstaction script to reflect correct size. The size was modified to change dynamically based on disk size. However this variable was not updated
#9504
- How I did it
Updated the variable based on disk size
- How to verify it
Verify in the generated rsyslog file if the variable is correctly generated from jinja template
Signed-off-by: maipbui <maibui@microsoft.com>
#### Why I did it
`subprocess` is used with `shell=True`, which is very dangerous for shell injection.
`os` - not secure against maliciously constructed input and dangerous if used to evaluate dynamic content
#### How I did it
remove `shell=True`, use `shell=False`
Replace `os` by `subprocess`
* Fix to improve hostname handling
If config_db.json is missing hostname entry, hostname-config.sh ends
up deleting existing entry too and hostname changes to default 'localhost'
* default hostname to 'sonic` if missing in config file
There's an odd crash that intermittently happens after the teamd container
exits, and a signal is raised to the main thread to exit. This thread (watching
teamd) continues execution because it's in a `while True`. The subsequent wait
call on the teamd container very likely returns immediately, and it calls
`is_warm_restart_enabled` and `is_fast_reboot_enabled`. In either of these
cases, sometimes, there is a crash in the transition from C code to Python code
(after the function gets executed). Python sees that this thread got a signal
to exit, because the main thread is exiting, and tells pthread to exit the
thread. However, during the stack unwinding, _something_ is telling the
unwinder to call `std::terminate`. The reason is unknown.
This then results in a python3 SIGABRT, and systemd then doesn't call the stop
script to actually stop the container (possibly because the main process exited
with a SIGABRT, so it's a hard crash). This means that the container doesn't
actually get stopped or restarted, resulting in an inconsistent state
afterwards.
The workaround appears to be that if we know the main thread needs to exit,
just return here, and don't continue execution. This at least tries to avoid it
from getting into the problematic code path. However, it's still feasible to
get a SIGABRT, depending on thread/process timings (i.e. teamd exits, signals
the main thread to exit, and then syncd exits, and syncd calls one of the two C
functions, potentially hitting the issue).
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
It could happen that a container has already crashed but docker-wait-any
will wait forever till it starts. It should, however, immediately exit
to make the serivce restart.
#### Why I did it
It is observed in some circumstances that the auto-restart mechanism does not work. Specifically for ```swss.service```, ```orchagent``` had crashed before ```docker-wait-any``` started in ```swss.sh```. This led ```docker-wait-any``` wait forever for ```swss``` to be in ```"Running"``` state and it results in:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1abef1ecebff bcbca2b74df6 "/usr/local/bin/supe…" 22 hours ago Up 22 hours what-just-happened
3c924d405cd5 docker-lldp:latest "/usr/bin/docker-lld…" 22 hours ago Up 22 hours lldp
eb2b12a98c13 docker-router-advertiser:latest "/usr/bin/docker-ini…" 22 hours ago Up 22 hours radv
d6aac4a46974 docker-sonic-mgmt-framework:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours mgmt-framework
d880fd07aab9 docker-platform-monitor:latest "/usr/bin/docker_ini…" 22 hours ago Up 22 hours pmon
75f9e22d4fdd docker-snmp:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours snmp
76d570a4bd1c docker-sonic-telemetry:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours telemetry
ee49f50344b3 docker-syncd-mlnx:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours syncd
1f0b0bab3687 docker-teamd:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours teamd
917aeeaf9722 docker-orchagent:latest "/usr/bin/docker-ini…" 22 hours ago Exited (0) 22 hours ago swss
81a4d3e820e8 docker-fpm-frr:latest "/usr/bin/docker_ini…" 22 hours ago Up 22 hours bgp
f6eee8be282c docker-database:latest "/usr/local/bin/dock…" 22 hours ago Up 22 hours database
```
The check for ```"Running"``` state is not needed because for cold boot case we do ```start_peer_and_dependent_services``` and for warm boot case the loop will retry to wait for container if this container is doing warm boot:
d01a91a569/files/image_config/misc/docker-wait-any (L56)
#### How I did it
Removed the check for ```"Running"```.
#### How to verify it
Kill swss before ```docker-wait-any``` is reached and verify auto restart will restart swss serivce.
#### Why I did it
To deprecate swsssdk, remove all dependency to it.
#### How I did it
Remove swsssdk from rules and build image scripts.
#### How to verify it
Pass all UT and E2E test case
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
#### Description for the changelog
Remove swsssdk from rules and build image scripts.
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
#### A picture of a cute animal (not mandatory but encouraged)
Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.
The monit 'container_checker' fails in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).
* Add k8s master feature
Signed-off-by: Yun Li <yunli1@microsoft.com>
* Update kubernetes version mistake and make variable passing clear
Signed-off-by: Yun Li <yunli1@microsoft.com>
* Add CRI-dockerd package
Signed-off-by: Yun Li <yunli1@microsoft.com>
* Update version variable passing logic
Signed-off-by: Yun Li <yunli1@microsoft.com>
* Upgrade the worker kubernetes version
Signed-off-by: Yun Li <yunli1@microsoft.com>
* Install xml file parse tool
Signed-off-by: Yun Li <yunli1@microsoft.com>
Signed-off-by: Yun Li <yunli1@microsoft.com>
Spanning from sonic-net/sonic-linkmgrd#76, this PR is to update warm restart finalizer to wait for linkmgrd to be reconciled.
sign-off: Jing Zhang zhangjing@microsoft.com
Why I did it
To make sure finalizer save config after linkmgrd's reconciliation.
How I did it
Add linkmgrd to the reconciliation wait list of warmboot finalizer.
How to verify it
Verified on lab device, linkmgrd reconciled as expected.
A change in sonic-utilities makes all cache files be saved into a
/tmp/cache. On swss restart this cache has to be removed in case swss
starts in cold or fast mode. A related cache restoration in the warmboot
finalizer script is also updated to use new location.
- Why I did it
To fix#9817. Clear the cache directory on swss.sh except for warm start.
Also, adopted finalize-warmboot script to take the cache directory.
- How I did it
A change in sonic-utilities makes all cache files be saved into a /tmp/cache. On swss restart this cache has to be removed in case swss starts in cold or fast mode. A related cache restoration in the warmboot finalizer script is also updated to use new location.
- How to verify it
Run togather with Azure/sonic-utilities#2232. Verify counters cache is removed on config reload, cold/fast reboots, swss restart.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Fix in Monit memory_checker plugin. Skip fetching running containers if docker engine is down (can happen in deinit).
This PR fixes issue #11472.
Signed-off-by: liora liora@nvidia.com
Why I did it
In the case where Monit runs during deinit flow, memory_checker plugin is fetching the running containers without checking if Docker service is still running. I added this check.
How I did it
Use systemctl is-active to check if Docker engine is still running.
How to verify it
Use systemctl to stop docker engine and reload Monit, no errors in log and relevant print appears in log.
Which release branch to backport (provide reason below if selected)
The fix is required in 202205 and 202012 since the PR that introduced the issue was cherry picked to those branches (#11129).
- Why I did it
To implement Syslog Source IP feature
In order to include the following commit: 8e5d478 [ssip]: Add CLI (#2191)
- How I did it
Updated syslog config template
Advanced submodule sonic-utilities
ea11b22 [sonic-bootchart] add sonic-bootchart (#2195)
8e5d478 [ssip]: Add CLI (#2191)
1dacb7f Replace pyswsssdk with swsscommon (#2251)
- How to verify it
make configure PLATFORM=mellanox
make target/sonic-mellanox.bin
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
Refactors the SONiC Installer to support greater flexibility in building for a given architecture and bootloader.
#### Why I did it
Currently the SONiC installer assumes that if a platform is ARM based that it uses the `uboot` bootloader and uses the `grub` bootloader otherwise. This is not a correct assumption to make as ARM is not strictly tied to uboot and x86 is not strictly tied to grub.
#### How I did it
To implement this I introduce the following changes:
* Remove the different arch folders from the `installer/` directory
* Merge the generic components of the ARM and x86 installer into `installer/installer.sh`
* Refactor x86 + grub specific functions into `installer/default_platform.conf`
* Modify installer to call `default_platform.conf` file and also call `platform/[platform]/patform.conf` file as well to override as needed
* Update references to the installer in the `build_image.sh` script
* Add `TARGET_BOOTLOADER` variable that is by default `uboot` for ARM devices and `grub` for x86 unless overridden in `platform/[platform]/rules.mk`
* Update bootloader logic in `build_debian.sh` to be based on `TARGET_BOOTLOADER` instead of `TARGET_ARCH` and to reference the grub package in a generic manner
#### How to verify it
This has been tested on a ARM test platform as well as on Mellanox amd64 switches as well to ensure there was no impact.
#### Description for the changelog
[arm] Refactor installer and build to allow arm builds targeted at grub platforms
#### Link to config_db schema for YANG module changes
N/A
Why I did it
Currently interfaces.j2 hardcodes to eth0 even when there are multiple interfaces in MGMT_INTERFACE. This change adds support to generate /e/n/i when there are multiple interfaces in MGMT_INTERFACE.
How I did it
By removing hardcoded eth0 when looping through MGMT_INTERFACE.
How to verify it
Verified through unit test.
Which release branch to backport (provide reason below if selected)
201811
201911
202006
202012
202106
202111
202205
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)
This reverts commit 90a849ea85.
#### Why I did it
The interfaces unit test did not cover some of the conditions in interfaces.j2 that was changed in #11204. Therefore reverting the change and add the tests before making the change to interfaces.j2.
#### How I did it
Git revert.
#### How to verify it
#### Which release branch to backport (provide reason below if selected)
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
#### Description for the changelog
#### Link to config_db schema for YANG module changes
#### A picture of a cute animal (not mandatory but encouraged)
* [Interfaces] Modify template to support multiple management interfaces
* Modify minigraph to process interfaces in sorted order
Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>
* Add UT minigraph
Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>
* make case insensitve comparison
Signed-off-by: George Chen <gechen@microsoft.com>
* Use natural sort
Signed-off-by: George Chen <gechen@microsoft.com>
Co-authored-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
This PR aims to fix an issue (#10088) by enhancing the script memory_checker.
Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage.
How I did it
In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage.
How to verify it
I tested on a lab device by following the steps:
Stops telemetry container with command sudo systemctl stop telemetry.service
Removes telemetry container with command docker rm telemetry
Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.
Why I did it
The dhcp_graph_url used by internal service is always set as "N/A". So we can make the updategraph logic short.
How I did it
Shorten 'if statement' logic for /tmp/dhcp_graph_url
What/Why I did:
Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability.
Scenario: For example if let's say database service goes in fail state then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created.
Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface.
Issue 2: Use of IPVLAN vs MACVLAN
Currently we are using ipvlan mode. However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device." based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN) it remains in Kernel Database and never goes to free pool even though namespace is deleted.
Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address.
Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis Server.
Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle
Solution: Added explicit PING from namespace that will check this reachability.
Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet.
Solution: Handle gracefully via try..except and log the message.
Why I did it
To upgrade SSD firmware in initramfs while rebooting from SONiC to SONiC and during NOS to SONiC migration.
How I did it
New option 'ssd-upgrader-part’ is introduced in grub command line, to indicate the partition and its filesystem type in which the SSD firmware updater is present. ‘ssd-upgrader-part’ syntax is ssd-upgrader-part=<partition>,<filesystem type>. Example: ssd-upgrader-part=/dev/sda8,ext4
A new initramfs script ‘ssd-upgrade’ is included in init-premount and it invokes the SSD firmware updater (ssd-fw-upgrade) present in the partition indicated by the boot option 'ssd-upgrader-part'
How to verify it
In SONiC, the SSD firmware updater is copied to “/host/” directory.
Fast-reboot is to be initiated with the ‘-u’ option ([scripts/fast-reboot] Add option to include ssd-upgrader-part boot option with SONiC partition sonic-utilities#2150)
After reboot, while booting into SONiC the SSD firmware updater will be executed in initramfs.
Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:
Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
Systemd starts interface-config service who depends on networking service
Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
Interface-config service updates /etc/networking/interface to static configuration.
Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.
How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.
How to verify it
Manual test.