1. remove container feature table
2. do not generate feature entry if the feature is not included
in the image
3. rename ENABLE_* to INCLUDE_* for better clarity
4. rename feature status to feature state
5. [submodule]: update sonic-utilities
* 9700e45 2020-08-03 | [show/config]: combine feature and container feature cli (#1015) (HEAD, origin/master, origin/HEAD) [lguohan]
* c9d3550 2020-08-03 | [tests]: fix drops_group_test failure on second run (#1023) [lguohan]
* dfaae69 2020-08-03 | [lldpshow]: Fix input device is not a TTY error (#1016) [Arun Saravanan Balachandran]
* 216688e 2020-08-02 | [tests]: rename sonic-utilitie-tests to tests (#1022) [lguohan]
Signed-off-by: Guohan Lu <lguohan@gmail.com>
As part of consolidating all common Python-based functionality into the new sonic-py-common package, this pull request:
1. Redirects all Python applications/scripts in sonic-buildimage repo which previously imported sonic_device_util or sonic_daemon_base to instead import sonic-py-common, which was added in https://github.com/Azure/sonic-buildimage/pull/5003
2. Replaces all calls to `sonic_device_util.get_platform_info()` to instead call `sonic_py_common.get_platform()` and removes any calls to `sonic_device_util.get_machine_info()` which are no longer necessary (i.e., those which were only used to pass the results to `sonic_device_util.get_platform_info()`.
3. Removes unused imports to the now-deprecated sonic-daemon-base package and sonic_device_util.py module
This is the next step toward resolving https://github.com/Azure/sonic-buildimage/issues/4999
Also reverted my previous change in which device_info.get_platform() would first try obtaining the platform ID string from Config DB and fall back to gathering it from machine.conf upon failure because this function is called by sonic-cfggen before the data is in the DB, in which case, the db_connect() call will hang indefinitely, which was not the behavior I expected. As of now, the function will always reference machine.conf.
* [platform] Add Support For Environment Variable
This PR adds the ability to read environment file from /etc/sonic.
the file contains immutable SONiC config attributes such as platform,
hwsku, version, device_type. The aim is to minimize calls being made
into sonic-cfggen during boot time.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
* Changes to add template support for copp.json.
This is needed so that we can install differnt type of
Traps based on Device Role (Tor/Leaf/Mgmt/etc...).
Initial use case is to install DHCP/DHCPv6 tarp only
for tor router.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Fixed based on review comments.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Fixed based on review comment.
virtual-chassis test uses multiple vs instances to simulate a
modular switch and a redis-chassis service is required to run on
the vs instance that represents a supervisor card.
This change allows vs docker start redis-chassis service according
to external config file.
**- Why I did it**
To support virtual-chassis setup, so that we can test distributed forwarding feature in virtual sonic environment, see `Distributed forwarding in a VOQ architecture HLD` pull request at https://github.com/Azure/SONiC/pull/622
**- How I did it**
The sonic-vs start.sh is enhanced to start new redis_chassis service if external chassis config file found. The config file doesn't exist in current vs environment, start.sh will behave like before.
**- How to verify it**
The swss/test still pass. The chassis_db service is verified in virtual-chassis topology and tests which are in following PRs.
Signed-off-by: Honggang Xu <hxu@arista.com>
(cherry picked from commit c1d45cf81ce3238be2dcbccae98c0780944981ce)
Co-authored-by: Honggang Xu <hxu@arista.com>
Consolidate common SONiC Python-language functionality into one shared package (sonic-py-common) and eliminate duplicate code.
The package currently includes three modules:
- daemon_base
- device_info
- logger
Fix for the host unmount issue through PR https://github.com/Azure/sonic-buildimage/pull/4558 and https://github.com/Azure/sonic-buildimage/pull/4865 creates the timeout of syslog.socket closure during reboot since the journald socket closure has been included in syslog.socket
Removed the journal socket closure. The host unmount is fixed with just stopping the services which gets restarted only after /var/log unmount and not causing the unmount issues.
`sonic-installer list` is a read-only command. Specify it as such in the sudoers file.
This will also ensure the new `show boot` command, which calls `sudo sonic-installer list` under the hood doesn't fail due to permissions.
Otherwise, it may cause issues for warm restarts, warm reboot.
Warm restart of swss will start nat which is not expected for warm
restart. Also it is observed that during warm-reboot script execution
nat container gets started after it was killed. This causes removal of
nat dump generated by nat previously:
A check [ -f /host/warmboot/nat/nat_entries.dump ] || echo "NAT dump
does not exists" was added right before kexec:
```
Fri Jul 17 10:47:16 UTC 2020 Prepare MLNX ASIC to fastfast-reboot:
install new FW if required
Fri Jul 17 10:47:18 UTC 2020 Pausing orchagent ...
Fri Jul 17 10:47:18 UTC 2020 Stopping nat ...
Fri Jul 17 10:47:18 UTC 2020 Stopped nat ...
Fri Jul 17 10:47:18 UTC 2020 Stopping radv ...
Fri Jul 17 10:47:19 UTC 2020 Stopping bgp ...
Fri Jul 17 10:47:19 UTC 2020 Stopped bgp ...
Fri Jul 17 10:47:21 UTC 2020 Initialize pre-shutdown ...
Fri Jul 17 10:47:21 UTC 2020 Requesting pre-shutdown ...
Fri Jul 17 10:47:22 UTC 2020 Waiting for pre-shutdown ...
Fri Jul 17 10:47:24 UTC 2020 Pre-shutdown succeeded ...
Fri Jul 17 10:47:24 UTC 2020 Backing up database ...
Fri Jul 17 10:47:25 UTC 2020 Stopping teamd ...
Fri Jul 17 10:47:25 UTC 2020 Stopped teamd ...
Fri Jul 17 10:47:25 UTC 2020 Stopping syncd ...
Fri Jul 17 10:47:35 UTC 2020 Stopped syncd ...
Fri Jul 17 10:47:35 UTC 2020 Stopping all remaining containers ...
Warning: Stopping telemetry.service, but it can still be activated by:
telemetry.timer
Fri Jul 17 10:47:37 UTC 2020 Stopped all remaining containers ...
NAT dump does not exists
Fri Jul 17 10:47:39 UTC 2020 Rebooting with /sbin/kexec -e to
SONiC-OS-201911.140-08245093 ...
```
With this change, executed warm-reboot 10 times without hitting this
issue, while without this change the issue is easily reproducible almost
every warm-reboot run.
Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
* [sonic-buildimage] Move BGP warm reboot scripts into BGP service /usr/local/bin
* Revert "[sonic-buildimage] Move BGP warm reboot scripts into BGP service /usr/local/bin"
This reverts commit d16d163fc4.
* [sonic-buildimage] Move BGP warm reboot script to BGP service
* [sonic-buildimage] Move BGP warm reboot script to BGP service
* [sonic-buildimage] Move BGP warm reboot script to BGP service
- access DB correctly
* Address code review comments, also change file mode of bgp.sh (+x)
* Address code review comments, also change the file mode of bgp.sh (+x)
* BGP warm reboot script to service, also handle fast boot as indicated by flag saved in StateDB
* BGP warm reboot script to service, code review comments on space alignment
* BGP warm reboot script to service: remove uncesseary space
* BGP warm reboot script to service: replace tab with space
* Code review comments: -) use new multi-db api -) add ignore error from zebra in case it's not configured
* Integrate with multi-ASIC changes committed recently
Co-authored-by: heidi.ou@alibaba-inc.com <heidi.ou@alibaba-inc.com>
* Defect 2082949: Handling Control Plane ACLs so that IPv4 rules and IPv6 rules are not added to the same ACL table
* Previous code review comments of coming up with functions for is_ipv4_rule and is_ipv6_rule is addressed and also raising Exceptions instead of simply aborting when the conflict occurs is handled
* Addressed code review comment to replace duplicate code with already existing functions
* removed raising Exception when rule conflict in Control plane ACLs are found
* added code to remove the rule_props if it is conflicting ACL table versioning rule
* addressed review comment to add ignoring rule in the error statement
Co-authored-by: Madhan Babu <madhan@arc-build-server.mtr.labs.mlnx>
* PCIe Monitor service
* Add rescan to pcie-mon.service when it fails to get all pcie devices
* space
* Clean up
* review comments
* update the pcie status in state db
* update the failed pcie status once at the end
* Update the pcie_status in STATE_DB and rename the service
* Add log to exit the service if the configuration file doesn't exist.
* fix the build failure
* Redo the pcie rescan for pcie-check failed case.
* review comments
* review comments
* review comments
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
This PR has changes to support accessing the bcmsh and bcmcmd utilities on multi ASIC devices
Changes done
- move the link of /var/run/sswsyncd from docker-syncd-brcm.mk to docker_image_ctl.j2
- update the bcmsh and bcmcmd scripts to take -n [ASIC_ID] as an argument on multi ASIC platforms
This pull request was cherry picked from "#1238" to resolve the conflicts.
- Why I did it
Add support to specify source address for TACACS+
- How I did it
Add patches for libpam-tacplus and libnss-tacplus. The patches parse the new option 'src_ip' and store the converted addrinfo. Then the addrinfo is used for TACACS+ connection.
Add a attribute 'src_ip' for table "TACPLUS|global" in configDB
Add some code to adapt to the attribute 'src_ip'.
- How to verify it
Config command for source address PR in sonic-utilities
config tacacs src_ip <ip_address>
- Description for the changelog
Add patches to specify source address for the TACACS+ outgoing packets.
- A picture of a cute animal (not mandatory but encouraged)
**UT logs: **
UT_tacacs_source_intf.txt
* [sonic-buildimage] Changes to make network specific sysctl
common for both host and docker namespace (in multi-npu).
This change is triggered with issue found in multi-npu platforms
where in docker namespace
net.ipv6.conf.all.forwarding was 0 (should be 1) because of
which RS/RA message were triggered and link-local router were learnt.
Beside this there were some other sysctl.net.ipv6* params whose value
in docker namespace is not same as host namespace.
So to make we are always in sync in host and docker namespace
created common file that list all sysctl.net.* params and used
both by host and docker namespace. Any change will get applied
to both namespace.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Address Review Comments and made sure to invoke augtool
only one and do string concatenation of all set commands
* Address Review Comments.
Add changes for syslog support for containers running in namespaces on multi ASIC platforms.
On Multi ASIC platforms
Rsyslog service is only running on the host. There is no rsyslog service running in each namespace.
On multi ASIC platforms the rsyslog service on the host will be listening on the docker0 ip address instead of loopback address.
The rsyslog.conf on the containers is modified to have omfwd target ip to be docker0 ipaddress instead of loopback ip
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
* Changes to make default route programming
correct in multi-asic platform where frr is not running
in host namespace. Change is to set correct administrative distance.
Also make NAMESPACE* enviroment variable available for all dockers
so that it can be used when needed.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Fix review comments
* Review comment to check to add default route
only if default route exist and delete is successful.
* [systemd-generator]: Fix the code to make sure that dependencies
of host services are generated correctly for multi-asic platforms.
Add code to make sure that systemd timer files are also modified
to add the correct service dependency for multi-asic platforms.
Signed-off-by: SuvarnaMeenakshi <sumeenak@microsoft.com>
* [systemd-generator]: Minor fix, remove debug code and
remove unused variable.
**- Why I did it**
Initially, the critical_processes file contains either the name of critical process or the name of group.
For example, the critical_processes file in the dhcp_relay container contains a single group name
`isc-dhcp-relay`. When testing the autorestart feature of each container, we need get all the critical
processes and test whether a container can be restarted correctly if one of its critical processes is
killed. However, it will be difficult to differentiate whether the names in the critical_processes file are
the critical processes or group names. At the same time, changing the syntax in this file will separate the individual process from the groups and also makes it clear to the user.
Right now the critical_processes file contains two different kind of entries. One is "program:xxx" which indicates a critical process. Another is "group:xxx" which indicates a group of critical processes
managed by supervisord using the name "xxx". At the same time, I also updated the logic to
parse the file critical_processes in supervisor-proc-event-listener script.
**- How to verify it**
We can first enable the autorestart feature of a specified container for example `dhcp_relay` by running the comman `sudo config container feature autorestart dhcp_relay enabled` on DUT. Then we can select a critical process from the command `docker top dhcp_relay` and use the command `sudo kill -SIGKILL <pid>` to kill that critical process. Final step is to check whether the container is restarted correctly or not.
When building the SONiC image, used systemd to mask all services which are set to "disabled" in init_cfg.json.
This PR depends on https://github.com/Azure/sonic-utilities/pull/944, otherwise `config load_minigraph will fail when trying to restart disabled services.
* Add secureboot support in boot0
* Initramfs changes for secureboot on Aboot devices
* Do not compress squashfs and gz in fs.zip
It doesn't make much sense to do so since these files are already
compressed.
Also not compressing the squashfs has the advantage of making it
mountable via a loop device.
* Add loopoffset parameter to initramfs-tools
- Ensure all features (services) are in the configured state when hostcfgd starts
- Better functionalization of code
- Also replace calls to deprecated `has_key()` method in `tacacs_server_handler()` and `tacacs_global_handler()` with `in` keyword.
This PR depends on https://github.com/Azure/sonic-utilities/pull/944, otherwise `config load_minigraph` will fail when trying to restart disabled services.
While migrating to SONiC 20181130, identified a couple of issues:
1. union-mount needs /host/machine.conf parameters for vendor specific checks : however, in case of migration, the /host/machine.conf is extracted from ONIE only in https://github.com/Azure/sonic-buildimage/blob/master/files/image_config/platform/rc.local#L127.
2. Since grub.cfg is updated to have net.ifnames=0 biosdevname=0, 70-persistent-net.rules changes are no longer required.
Don't limit iptables connection tracking to TCP protocol; allow connection tracking for all protocols. This allows services like NTP, which is UDP-based, to receive replies from an NTP server even if the port is blocked, as long as it is in reply to a request sent from the device itself.
* Fix the Build on 201911 (Stretch) where the directory
/usr/lib/systemd/system/ does not exist so creating
manually. Change should not harm Master (buster) where
the directory is created by Linux
* Fix as per review comments
* Support rw files allowlist for Sonic Secure Boot
* Improve the performance
* fix bug
* Move the config description into a md file
* Change to use a simple way to remove the blank line
* Support chmod a-x in rw folder
* Change function name
* Change some unnecessary words
**- Why I did it**
To ensure telemetry service is enabled by default after installing a fresh SONiC image
**- How I did it**
Set telemetry feature status to "enabled" when generating init_cfg.json file
Found another syncd timing issue related to clock going backwards.
To be safe disable the ntp long jump.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
**- Why I did it**
When I tested auto-restart feature of swss container by manually killing one of critical processes in it, swss will be stopped. Then syncd container as the peer container should also be
stopped as expected. However, I found sometimes syncd container can be stopped, sometimes
it can not be stopped. The reason why syncd container can not be stopped is the process
(/usr/local/bin/syncd.sh stop) to execute the stop() function will be stuck between the lines 164 –167. Systemd will wait for 90 seconds and then kill this process.
164 # wait until syncd quit gracefully
165 while docker top syncd$DEV | grep -q /usr/bin/syncd; do
166 sleep 0.1
167 done
The first thing I did is to profile how long this while loop will spin if syncd container can be
normally stopped after swss container is stopped. The result is 5 seconds or 6 seconds. If syncd
container can be normally stopped, two messages will be written into syslog:
str-a7050-acs-3 NOTICE syncd#dsserve: child /usr/bin/syncd exited status: 134
str-a7050-acs-3 INFO syncd#supervisord: syncd [5] child /usr/bin/syncd exited status: 134
The second thing I did was to add a timer in the condition of while loop to ensure this while loop will be forced to exit after 20 seconds:
After that, the testing result is that syncd container can be normally stopped if swss is stopped
first. One more thing I want to mention is that if syncd container is stopped during 5 seconds or 6 seconds, then the two log messages can be still seen in syslog. However, if the execution
time of while loop is longer than 20 seconds and is forced to exit, although syncd container can be stopped, I did not see these two messages in syslog. Further, although I observed the auto-restart feature of swss container can work correctly right now, I can not make sure the issue which syncd container can not stopped will occur in future.
**- How I did it**
I added a timer around the while loop in stop() function. This while loop will exit after spinning
20 seconds.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
I found that with IPv4Network types, calling list(ip_ntwrk.hosts()) is reliable. However, when doing the same with an IPv6Network, I found that the conversion to a list can hang indefinitely. This appears to me to be a bug in the ipaddress.IPv6Network implementation. However, I could not find any other reports on the web.
This patch changes the behavior to call next() on the ip_ntwrk.hosts() generator instead, which returns the IP address of the first host.