#### Why I did it
I made this change to support warm/fast reboot for SONiC extension packages as per HLD Azure/SONiC#682.
#### How I did it
I extended manifest.json.j2 with new warm/fast reboot related fields and also extended sonic_debian_extension.j2 script template to generate the shutdown order files for warm and fast reboot.
- Why I did it
Currently dhcp packets are disabled by the COPP manager for non ToRRouter type switches.
Even if the feature is enabled, DHCP packets wont hook to the CPU since the COPP manager will not trap this packets.
This change is to disable dhcp_relay by default for non ToRRouter switches from init_cfg.json.
With this approach, if the user want to enable the feature for non ToRRouter switches, manual enablement is required by the 'feature' configuration.
This is to keep the current approach for MSFT production issue with dhcp relay for non ToRRouter switched and allow the user to decide if to use it or not.
- How I did it
Configure dhcp_relay 'disabled' by default on init_cfg.json for non ToRRouter switches.
Remove the exclusion of dhcp packets on copp_cfg.json
- How to verify it
Enable dhcp_relay feature on a non ToRRouter switch.
Unit-tests modified so the default values on mocked CONFIG DB in 'test_vectors.py' for dhcp_relay will be 'disabled'.
This is by the change for 'init_cfg.json.j2'.
For ToRRouter the state will change from 'disabled' to 'enabled'.
Another test case added for a 'ToR' switch type, this is to test the state is 'enabled' if the user configured it to be so.
dash doesn't support += operation to append to a variable's value. Use KDUMP_CMDLINE_APPEND="${KDUMP_CMDLINE_APPEND} " instead
The below error message is seen when a reboot is issued.
[ 342.439096] kdump-tools[13655]: /etc/init.d/kdump-tools: 117: /etc/default/kdump-tools: KDUMP_CMDLINE_APPEND+= panic=10 debug hpet=disable pcie_port=compat pci=nommconf sonic_platform=x86_64-accton_as7326_56x-r0: not found
Why I did it
Support multiple pcie configuration file and change the pcie status table name
This is to match with below two PRs.
Azure/sonic-platform-common#195Azure/sonic-platform-daemons#189
How I did it
Check pcie configuration file with wild card and change the device status table name
How to verify it
Restart with changes and see if the pcie check works as expected.
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.
How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.
How to verify it
I verified this on the device str-7260cx3-acs-1.
Why I did it
In upgrade scenarios, where config_db.json is not carry forwarded to new image, it could be left w/o TACACS credentials.
Added a service to trigger 5 minutes after boot and restore TACACS, if /etc/sonic/old_config/tacacs.json is present.
How I did it
By adding a service, that would fire 5 mins after boot.
This service apply tacacs if available.
How to verify it
Upgrade and watch status of tacacs.timer & tacacs.service
You may create /etc/sonic/old_config/tacacs.json, with updated credentials
(before 5mins after boot) and see that appears in config & persisted too.
Which release branch to backport (provide reason below if selected)
201911
202006
202012
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
This PR aims to monitor the memory usage of streaming telemetry container and restart streaming telemetry container if memory usage is larger than the pre-defined threshold.
How I did it
I borrowed the system tool Monit to run a script memory_checker which will periodically check the memory usage of streaming telemetry container. If the memory usage of telemetry container is larger than the pre-defined threshold for 10 times during 20 cycles, then an alerting message will be written into syslog and at the same time Monit will run the script restart_service to restart the streaming telemetry container.
How to verify it
I verified this implementation on device str-7260cx3-acs-1.
#### Why I did it
If a process limits using nodes by mempolicy/cpusets, and those nodes become memory exhaustion status, one process may be killed by oom-killer.
No panic occurs in this case, because other node's memory may be free.
This means system total status may be not fatal yet.
#### How I did it
Remove 'vm.panic_on_oom=1' kernel flag from 'vmcore-sysctl.conf '
Why I did it
Currently, there is a bug in the ntp.conf jinja2 template where it will ignore the src_intf directive in CONFIG_DB if there are multiple IP addresses associated with an interface. This code change fixes that bug and allows the template to select the correct source interface for NTP.
How I did it
I did this by modifying the macro in ntp.conf.j2 which determines if there is an ip address associated with an interface to set a state variable when it detects a valid interface entry in CONFIG_DB instead of outputting "true" directly (which could result in multiple "trues" outputted for interfaces with multiple valid IP addresses).
How to verify it
Add two ipv4 addresses to an interface in SONiC
Add the following configuration to config_db.json
{
"NTP": {
"global": {
"src_intf": "Ethernet1"
}
}
}
Replace Ethernet1 with the interface name of the one you assigned the IP addresses to.
Run sudo config reload -y
Open /etc/ntp.conf and verify that the following line exists
...
interface listen Ethernet1
...
The interface specified should be the one set in the previous steps.
Description for the changelog
[ntp] Fix ntp.conf template to allow setting of source port in CONFIG_DB
Why I did it
start pcie-check.service after config-setup.service since pcie_util depends on device_info which is available with config db metadata.
How I did it
Add config-setup.service as a dependency of pcie-check.service
How to verify it
Upon reboot, check if the pcie-check.sh throws the platform api error which is dependent on DEVICE_METADATA
Why I did it
Finding running containers through "docker ps" breaks when kubernetes deploys container, as the names are mangled.
How I did it
The data is is available from FEATURE table, which takes care of kubernetes deployment too.
How to verify it
Deploy a feature via kubernetes and don't expect error from container_check.
Why I did it
Support readonly version of the command vtysh
How I did it
Check if the command starting with "show", and verify only contains single command in script.
Fix#7364
99-default.link - was always in SONiC, but previous systemd (<247) had an issue and it did not work due to issue systemd/systemd#3374. Now systemd 247 works.
However, such policy overrides teamd provided mac address which causes teamd netdev to use a random mac
address. Therefore, needs to be disabled.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Encounter error during "config-setup boot" if the updategraph is enabled.
How I did it
Correct the code inside the config-setup script.
Remove the space between the assignment operator.
How to verify it
Remove the /etc/sonic/config_db.json and reboot the device.
Originally, it will return following error after boot up.
rv: command not found
After modification, it can correctly parse the status of updategraph without error.
This commit has following changes:
* Add templates and code to support VoQ chassis iBGP peers
* Add support to convert a new VoQChassisInternal element in the
BGPSession element of the minigraph to a new BGP_VOQ_CHASSIS_NEIGHBOR
table in CONFIG_DB.
* Add a new set of "voq_chassis" templates to docker-fpm-frr
* Add a new BGP peer manager to bgpcfgd to add neighbors from the
BGP_VOQ_CHASSIS_NEIGHBOR table using the voq_chassis templates.
* Add a test case for minigraph.py, making sure the VoQChassisInternal
element creates a BGP_VOQ_CHASSIS_NEIGHBOR entry, but not if its
value is "false".
* Add a set of test cases for the new voq_chassis templates in
sonic-bgpcfgd tests.
Note that the templates expect the new
"bgp bestpath peer-type multipath-relax" bgpd configuration to be
available.
Signed-off-by: Joanne Mikkelson <jmmikkel@arista.com>
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
Since we introduced a new value always_disabled for the state field in FEATURE table, the expected running container list
should exclude the always_diabled containers. This bug was found by nightly test and posted at here: issue. This PR fixes#7210.
How I did it
I added a logic condition to decide whether the value of state field of a container was always_disabled or not.
How to verify it
I verified this on the device str-dx010-acs-1.
Which release branch to backport (provide reason below if selected)
201811
201911
202006
[ x] 202012
- Why I did it
Group all SONiC services together and able to manage them together. Will be used in config reload command as much simpler and generic way to restart services.
- How I did it
Add services to sonic.target
- How to verify it
Together with Azure/sonic-utilities#1199
config reload -y
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
show ip interfaces is enhanced recently to support multi ASIC platforms in this PR- https://github.com/Azure/sonic-utilities/pull/1396 .
The ipintutil script as to run as sudo user, to get the ip interface from each namespace.
Add this script to the sudoer file so that show ip interface command is available for user with read-only permissions
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
- Why I did it
The pcie configuration file location is under plugin directory not under platform directory.
#6437
- How I did it
Move all pcie.yaml configuration file from plugin to platform directory.
Remove unnecessary timer to start pcie-check.service
Move pcie-check.service to sonic-host-services
- How to verify it
Verify on the device
Update topology script to retrieve hwsku from minigraph
if hwsku information is not available in config_db.
Fix clean up of interfaces in msft_multi_asic_vs hwsku
topology script.
- Why I did it
When bringing up multi-asic VS switch, topology service is started during boot up.
Topology service starts a shell script which runs the topology script present in /usr/share/sonic/device// directory. To invoke hwsku specific script, the topology script tries to retrieve hwsku information from config_db.
During initial boot up config_db might not be populated. In order to start topology service before config_db is updated,
update topology script to get hwsku information from minigraph.xml if it is available.
This will be helpful to bring up multi-asic VS testbed by loading minigraph and starting topology service.
- How I did it
Update topology.sh script to retrieve hwsku information from minigraph.xml.
Fix clean up function on msft_multi_asic_vs toplogy script.
- How to verify it
single-asic VS - no change; topology service is only enabled for multi-asic VS.
multi-asic VS - Bring up multi-asic VS image, copy minigraph to vs image, start topology service. Topology service should be successful.
to test clean up function fix, start topology service - make sure interfaces are created and moved to the right namespaces.
stop topology service - make sure namespace do not have any interface and all front end interfaces are present in default namespace.
- Why I did it
As of Azure/sonic-utilities#1297, subcommands of pcieutil have changed to remove the redundant pcie- prefix. This PR adapts calling applications (pcie-check) to the new syntax.
Resolves#6676
- How I did it
Remove pcie- prefix from pcieutil subcommands in calling applications
Also add pcieutil * to sudoers file, as pcieutil requires elevated permissions
A few issues where discovered with crashkernel on Arista platforms.
1) platforms using `docker_inram=on` would end up OOM in kdump environment.
This happens because the same initramfs is used by SONiC and the crashkernel.
With `docker_inram=on` the `dockerfs.tar.gz` is extracted in a `tmpfs` created for the occasion.
Since `dockerfs.tar.gz` weights more than 1.5G, it doesn't fit into the kdump environment and ends up OOM.
This OOM event can in turn trigger a panic.
2) Arista platforms with `secureboot` enabled would fail to load the crashkernel because the kernel parameter would be discarded on boot.
This happens because the `boot0` in secureboot mode is strict about kernel parameter injection.
3) The secureboot path allowlist would remove kernel crash reports.
4) The kdump service would fail on Arista products since `/boot/` is empty in `secureboot`
**- How I did it**
1) To prevent an OOM event in the crashkernel the fix is to avoid the codepaths in `union-mount` that create tmpfs and populate them. Some more codepath specific to Arista devices are also skipped to make the kdump process faster.
This relies on detecting that the initramfs is starting in a kdump environment and skipping some initialization.
The `/usr/sbin/kdump-config` tool appends a few kernel cmdline arguments when loading the crashkernel.
The most unique one is `systemd.unit=kdump-tools.service` which is used in a few initramfs hooks to set `in_kdump`.
2) To allow `kdump` to work in `secureboot` environment the cmdline generation in boot0 was slightly modified.
The codepath to load kernel parameters changed by SONiC is now running for booting in secure mode.
It was altered to prevent an append only behavior which would grow the `kernel-cmdline` at every reboot.
This ever growing behavior would lead `kexec` to fail to load the kernel due to a too long cmdline.
3) To get the kernel crash under /var/crash this path has to be added to `allowlist_paths`
4) The `/host/image-XXX/boot` folder is now populated in `secureboot` mode but not used.
**- How to verify it**
Regular boot:
- enable kdump
- enable docker_inram=on via kernel-params
- reboot
- generate a crash `echo c > /proc/sysrq-trigger`
- before: witness OOM events on the console
- after: crash kernel works and crash available under /var/crash
Secure boot:
- enable kdump
- reboot
- generate a crash `echo c > /proc/sysrq-trigger`
- before: witness no kdump
- after: crash kernel works and crash available under /var/crash
Co-authored-by: Boyang Yu <byu@arista.com>
fixesAzure/sonic-utilities#1389
With the recent changes in sudoer files. The show commands fails for the read-only users.
The problem here is the 'docker ps' is failing in the function [get_routing_stack()](8a1109ed30/show/main.py (L54)) therefore all the CLI commands are failing.
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
- Why I did it
The command sudo ip netns identify <pid> is used in function get_current_namespace
to check in the cli command is running in host context or within a namespace.
This function is used for every CLI command and command sudo ip netns identify <pid> needs to be added in sudoer files to allow users with RO access to run show cli commands
This problem is not there on single asic platforms.
- How I did it
Add ip netns identify [0-9]* to sudoers file.
Following changes were done for ebtables:
- Support for Multi-asic platforms. Ebtable filters are installed in namespace for multi-asic and not host. On Single asic installed on host.
- For Multi-asic platforms we don't want to install on host otherwise Namespace-to-Namespace communication does not happens since ARP Request are not forwarded.
- Updated to use text file to restore ebtables rules then the binary format. Rules are restore as part of Database docker init instead of rc.local
- Removed the ebtable service files for buster as not needed as filters are restored/installed as part of database docker init.
All the binaries are pre-installed with ebtables* binary are same as ebatbles-legacy-*
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan arlakshm@microsoft.com
- Why I did it
This PR has the changes to support having different swss.rec and sairedis.rec for each asic.
The logrotate script is updated as well
- How I did it
Update the orchagent.sh script to use the logfile name options in these PRs(Azure/sonic-swss#1546 and Azure/sonic-sairedis#747)
In multi asic platforms the record files will be different for each asic, with the format swss.asic{x}.rec and sairedis.asic{x}.rec
Update the logrotate script for multiasic platform .
* [warm boot finalizer] only wait for enabled components to reconcile
Define the component with its associated service. Only wait for components that have associated service enabled to reconcile during warm reboot.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
**- Why I did it**
This PR aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command `sudo systemctl reset-failed <container_name>` manually.
**- How I did it**
We will employ Monit to monitor a script. This script will generate the expected running container list and compare it with the current running containers. If there are containers which were expected to run but were not running, then an alerting message will be written into syslog.
**- How to verify it**
I tested this feature on a lab device `str-a7050-acs-3` which has single ASIC and `str2-n3164-acs-3` which has a Multi-ASIC. First I manually stopped a container by running the command `sudo systemctl stop <container_name>`, then I checked whether there was an alerting message in the syslog.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
In scenario where upgrade gets config from minigraph, it could miss tacacs credentials as they are not in minigraph. Hence restore explicitly upon load-minigraph, if present.
- Why I did it
Upon boot, when config migration is required, the switch could load config from minigraph. The config-load from minigraph would wipe off TACACS key and disable login via TACACS, which would disable all remote user access. This change, would re-configure the TACACS if there is a saved copy available.
- How I did it
When config is loaded from minigraph, look for a TACACS credentials back up (tacacs.json) under /etc/sonic/old_config. If present, load the credentials into running config, before config-save is called.
- How to verify it
Remove /etc/sonic/config_db.json and do an image update. Upon reboot, w/o this change, you would not be able ssh in as remote user. You may login as admin and check out, "show tacacs" & "show aaa" to verify that tacacs-key is missing and login is not enabled for tacacs.
With this change applied, remove /etc/sonic/config_db.json, but save tacacs & aaa credentials as tacacs.json in /etc/sonic/. Upon reboot, you should see remote user access possible.
Added source interface support for NTP.
Also made NTP start on Mgmt-VRF by default when configured.
**- How I did it**
1) Updated hostcfg to listen to global config NTP and NTP_SERVER tables and restart ntp when ever the configuration changes. NTP table includes source interface configuration.
2) The ntp script updated to by default start on Mgmt-VFT when configured.
Signed-off-by: Prabhu Sreenivasan <prabhu.sreenivasan@broadcom>
Certain platform specific packages sonic-platform-xyz, installs files onto rootfs, which would be placed on read-write mount path on /host/image-name/rw/...
when ntpd starts it tries to do read access on /usr/bin /usr/sbin/ /usr/local/bin , which inturn links further to the read-write mount path also.
Where ntpd would get below Apparmor Warning message
LOG:-
audit: type=1400 audit(1606226503.240:21): apparmor="DENIED" operation="open" profile="/usr/sbin/ntpd" name="/image-HEAD-dirty-20201111.173951/rw/usr/local/bin/" pid=3733 comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
audit: type=1400 audit(1606226503.240:22): apparmor="DENIED" operation="open" profile="/usr/sbin/ntpd" name="/image-HEAD-dirty-20201111.173951/rw/usr/sbin/" pid=3733 comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
audit: type=1400 audit(1606226503.240:23): apparmor="DENIED" operation="open" profile="/usr/sbin/ntpd" name="/image-HEAD-dirty-20201111.173951/rw/usr/bin/" pid=3733 comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Fix:
Add rw/.. mount path similar to root path access provided for ntpd in /etc/apparmor.d/usr.sbin.ntpd
Signed-off-by: Antony Rheneus <arheneus@marvell.com>
Create new file to "sysctl.d" with desired panic conditions.
It will trigger a vmcore dump using kdump-tools on these situations.
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
The default /etc/default/kdump-tools file provided by the kdump-tools
package doesn't set a value for KDUMP_CMDLINE_APPEND.
The default kdump command line arguments need to be set in order
to extend them to use additional arguments required for SONiC
platforms.
Signed-off-by: Rajendra Dendukuri <rajendra.dendukuri@broadcom.com>
- Why I did it
Move frr logs from syslog from the directory /var/log/quagga/.log to /var/log/frr/log
- How I did it
Updated the rsyslog config files.
- How to verify it
Verified the logs come into the file zebra.log and bgpd.log in the DIR /var/log/frr/log
- Allow platform specific reboot script to be called after crash kernel has
finished copying the kernel vmcore
- Disable pcie advanced features when running crash kernel. This improves
reliability of the crash kernel to successfully create a vmcore and also
reboot
- Allow crash kernel to reboot if a panic is seen while it is generating a
vmcore
- Fix crash kernel to use the SONiC specific /usr/local/bin/reboot script
instead of the Linux reboot command /sbin/reboot
- Use sonic_platform as the kernel command line parameter to pass platform identifier string
Signed-off-by: Rajendra Dendukuri <rajendra.dendukuri@broadcom.com>
Make sure ntp-config service is executed before ntpd
Updated ntp-config service files to force dependency with ntp service. Also resolved circular dependency with --no-block flag. (needed as ntp-config service internally invokes systemd to restart ntp which in turn waits for ntp-config to complete)
Signed-off-by: Prabhu Sreenivasan <prabhu.sreenivasan@broadcom.com>
**- Why I did it**
Align style with slightly modified PEP8 standards (extend maximum line length to 120 chars). This will also help in the transition to Python 3, where it is more strict about whitespace, plus it helps unify style among the SONiC codebase. Will tackle other directories in separate PRs.
**- How I did it**
Using `autopep8 --in-place --max-line-length 120` and some manual tweaks.
* [bash.bashrc] Add reverse SSH script to bash.bashrc
* Fix command issue and add emptt line before EOF
* Add checks for SSH_TARGET_CONSOLE_LINE
Signed-off-by: Jing Kan jika@microsoft.com
- Why I did it
Add reboot history to State db so that can be used telemetry service
- How I did it
Split the process-reboot-cause service to determine-reboot-cause and process-reboot-cause
determine-reboot-cause to determine the reboot cause
process-reboot-cause to parse the reboot cause files and put the reboot history to state db
Moved to sonic-host-service* packages
- How to verify it
Performed unit test and tested on DUT
Fix 259 alerts reported by the LGTM tool:
- 245 for Unused import
- 7 for Testing equality to None
- 5 for Duplicate key in dict literal
- 1 for Module is imported more than once
- 1 for Unused local variable
- Why I did it
There is a issue for counters after warm-reboot:
If I clear counters by command "sonic-clear counters", then execute 'warm-reboot' and whenSONiC is restart, the counters showed with command "show interface counters" is still old counters before "sonic-clear". It is not the right counters because the counters file in '/tmp' is lost in warm-reboot process.
- How I did it
I fixed it by saving '/tmp/portstat-0' folders in '/host/' before executing 'warm-reboot' (in pull request Azure/sonic-utilities#1099 ), and restore the counters folders back to '/tmp/' after warm-reboot process is finished.
- How to verify it
Clear counters by command 'sonic-clear'
sonic-clear counters
sonic-clear dropcounters
sonic-clear pfccounters
sonic-clear queuecounters
sonic-clear rifcounters
Execute 'warm-reboot'
Use command ‘show interface counters’ to see if the counters is right.