Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.
Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.
The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.
Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:
Program 'container_memory_telemetry'
status Status ok
monitoring status Monitored
monitoring mode active
on reboot start
last exit value 0
last output -
data collected Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:
Program 'container_memory_telemetry'
status Status failed
monitoring status Monitored
monitoring mode active
on reboot start
last exit value 0
last output -
data collected Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.
How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.
How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
- Why I did it
Fixes#9628
During bootup, this error log is seen
Dec 22 04:26:29 sonic interfaces-config.sh[2546]: error: main exception: cannot find interfaces: eth0 (interface was probably never up ?)
This is of non-functional nature and doesn't affect the flow.
- How I did it
Dont take the ifdown if not needed
- How to verify it
Verified during reboot. Log did not appear and IP was acquired on eth0 as expected
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Why I did it
To reduce the processing time of rc.local, refactoring s6100 platform initialization.
Porting changes from 202012 branch [202012] Refactoring DELL platform init to reduce rc.local processing time #10171
- Why I did it
To implement blocking feature state change.
- How I did it
Record the actual feature state in STATE DB from hostcfg.
- How to verify it
UT + verification by running on the switch and checking STATE DB.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Why I did it
In parallel of this change Arista added a custom logrotate configuration as part of its driver library.
Having 2 logrotate configuration for the same log file triggers an issue.
Fixesaristanetworks/sonic#38
How I did it
Arista merged a few changes in sonic-buildimage which added a logrotate configuration aristanetworks/sonic@e43c797
It is therefore the right path to remove the arista.log line from the logrotate.d/rsyslog configuration.
How to verify it
Logrotate works without any error message, arista log rotation happens and arista daemons still append logs once file was truncated.
Why I did it
Fixed the monit container_checker fails due to unexpected "database-chassis" docker running on Supervisor card in the VOQ chassis. fixes#9042
How I did it
Added database-chassis to the always running docker list if platform is supervisor card.
How to verify it
Execute the CLI command "sudo monit status container_checker"
Signed-off-by: mlok <marty.lok@nokia.com>
* Update container_checker for multi-asic devices
Update container_checker for multi-asic devices to add database containers in always_running_containers.
Previous change was made for single-asic, and that database containers were not considered as feature when writing to state_db.
* Update container_checker
Update an indent
This issue causes negative threshold value and thus deleting log files even when there is enough space.
This issue causes negative threshold value and thus deleting log files even when there is enough space.
- Why I did it
To fix an issue when log files get deleted even if there is enough space.
- How I did it
Fixed an typo.
- How to verify it
Run the portion of the script that calculates threshold, see that the threshold is calculated correctly.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Why I did it
Eliminate benign firsttime boot error reported when running on platforms that do not support kdump.
How I did it
Change rc.local to check for presence of the file /etc/default/kdump-tools before referencing it.
How to verify it
Install a new image on an armhf or arm64 platform and check for a failed reference to /etc/default/kdump-tools on firsttime boot.
[image]: Prevent radius passkey and snmp community string into syslog. (#9727)
#### Why I did it
Prevent radius passkey and snmp community string into syslog.
#### How I did it
Add radius and snmp config command to PASSWD_CMDS
#### How to verify it
Run and pass all UTs.
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
#### Description for the changelog
Add radius and snmp config command to PASSWD_CMDS to prevent radius passkey and snmp community string into syslog.
#### A picture of a cute animal (not mandatory but encouraged)
Why I did it
The existing log file size in sonic is 1 Mb. Over a period of time this leads to huge number of log files which becomes difficult for monitoring applications to handle.
Instead of large number of small files, the size of the log file is not set to 16 Mb which reduces the number of files over a period of time.
How I did it
Changed the size parameter and related macros in logrotate config for rsyslog
How to verify it
Execute logrotate manually and verify the limit when the file gets rotated.
Signed-off-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>
1. Fix build for armhf and arm64
2. upgrade centec tsingma bsp support to 5.10 kernel
3. modify centec platform driver for linux 5.10
Co-authored-by: Shi Lei <shil@centecnetworks.com>
Python 2 is no longer available, so remove those packages, and remove
the pip2 commands. For picocom and systemd, just install from the
regular repo, since there's no backports yet.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Add code to interfaces-config.sh to configure eth1 in multi-asic
containers so that they can access midplane subnet.
Co-authored-by: Maxime Lorrillere <mlorrillere@arista.com>
copp-config service needs to be started after sonic.target so that it could
render the copp-config with the latest information.
It also needs to be restarted when config reload or load_minigraph is invoked.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Added logrotate file for wtmp and btmp to override default conf and set size cap as 100K as done in
PR: #865. For buster this is control by separate file wtmp and btmp.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
*Removed execute permissions from the systemd copp-config.service file.
Without this we will get a warning: "Configuration file /lib/systemd/system/copp-config.service is marked executable. Please remove executable permission bits. Proceeding anyway."
Why I did it
fstrim has dependency on pmon docker.
How I did it
start fstrim timer after sonic.target.
How to verify it
local test and PR test.
Signed-off-by: Ying Xie ying.xie@microsoft.com
This commit adds support for changing the default console baud rate configured
within the U-Boot bootloader. That default baud rate is exposed via the value
of the U-Boot 'baudrate' environment variable. This commit removes logic that
hardcoded the console baud rate to 115200 and instead ensures that the U-Boot
'baudrate' variable is always used when constructing the Linux kernel boot
arguments used when booting Sonic.
A change is also made to rc.local to ensure that the specified baud rate is set
correctly in the serial getty service.
*To run VNET route consistency check periodically.
*For any failure, the monit will raise alert based on return code.
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
Enable fib_multipath_use_neigh for v4
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
Why I did:
This is helpful if the neighbor are not directly connected then Kernel forward to unreachable neighbor option. With this option forwarding using neighbor state to be valid.
In version 3.0.0, If a broadcast address is specified in
/etc/network/interfaces, then when ifup is run, it will fail with an
error saying `'str' object has no attribute 'packed'`. This appears to
be because it expects all attributes for an interface to be "packable"
into a compact binary representation. However, it doesn't actually
convert the broadcast address into an IPNetwork object (other addresses
are handled).
Therefore, convert the broadcast address it reads in from a str to an
IPNetwork object.
Also explicitly specify the scope of the loopback address in
/etc/network/interfaces as host scope. Otherwise, it will get added as
global scope by default. As part of this, use JSON to parse ip's output
instead of text, for robustness.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
#### Why I did it
I made this change to support warm/fast reboot for SONiC extension packages as per HLD Azure/SONiC#682.
#### How I did it
I extended manifest.json.j2 with new warm/fast reboot related fields and also extended sonic_debian_extension.j2 script template to generate the shutdown order files for warm and fast reboot.
- Why I did it
Currently dhcp packets are disabled by the COPP manager for non ToRRouter type switches.
Even if the feature is enabled, DHCP packets wont hook to the CPU since the COPP manager will not trap this packets.
This change is to disable dhcp_relay by default for non ToRRouter switches from init_cfg.json.
With this approach, if the user want to enable the feature for non ToRRouter switches, manual enablement is required by the 'feature' configuration.
This is to keep the current approach for MSFT production issue with dhcp relay for non ToRRouter switched and allow the user to decide if to use it or not.
- How I did it
Configure dhcp_relay 'disabled' by default on init_cfg.json for non ToRRouter switches.
Remove the exclusion of dhcp packets on copp_cfg.json
- How to verify it
Enable dhcp_relay feature on a non ToRRouter switch.
Unit-tests modified so the default values on mocked CONFIG DB in 'test_vectors.py' for dhcp_relay will be 'disabled'.
This is by the change for 'init_cfg.json.j2'.
For ToRRouter the state will change from 'disabled' to 'enabled'.
Another test case added for a 'ToR' switch type, this is to test the state is 'enabled' if the user configured it to be so.
dash doesn't support += operation to append to a variable's value. Use KDUMP_CMDLINE_APPEND="${KDUMP_CMDLINE_APPEND} " instead
The below error message is seen when a reboot is issued.
[ 342.439096] kdump-tools[13655]: /etc/init.d/kdump-tools: 117: /etc/default/kdump-tools: KDUMP_CMDLINE_APPEND+= panic=10 debug hpet=disable pcie_port=compat pci=nommconf sonic_platform=x86_64-accton_as7326_56x-r0: not found
Why I did it
Support multiple pcie configuration file and change the pcie status table name
This is to match with below two PRs.
Azure/sonic-platform-common#195Azure/sonic-platform-daemons#189
How I did it
Check pcie configuration file with wild card and change the device status table name
How to verify it
Restart with changes and see if the pcie check works as expected.
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.
How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.
How to verify it
I verified this on the device str-7260cx3-acs-1.
Why I did it
In upgrade scenarios, where config_db.json is not carry forwarded to new image, it could be left w/o TACACS credentials.
Added a service to trigger 5 minutes after boot and restore TACACS, if /etc/sonic/old_config/tacacs.json is present.
How I did it
By adding a service, that would fire 5 mins after boot.
This service apply tacacs if available.
How to verify it
Upgrade and watch status of tacacs.timer & tacacs.service
You may create /etc/sonic/old_config/tacacs.json, with updated credentials
(before 5mins after boot) and see that appears in config & persisted too.
Which release branch to backport (provide reason below if selected)
201911
202006
202012
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
This PR aims to monitor the memory usage of streaming telemetry container and restart streaming telemetry container if memory usage is larger than the pre-defined threshold.
How I did it
I borrowed the system tool Monit to run a script memory_checker which will periodically check the memory usage of streaming telemetry container. If the memory usage of telemetry container is larger than the pre-defined threshold for 10 times during 20 cycles, then an alerting message will be written into syslog and at the same time Monit will run the script restart_service to restart the streaming telemetry container.
How to verify it
I verified this implementation on device str-7260cx3-acs-1.
#### Why I did it
If a process limits using nodes by mempolicy/cpusets, and those nodes become memory exhaustion status, one process may be killed by oom-killer.
No panic occurs in this case, because other node's memory may be free.
This means system total status may be not fatal yet.
#### How I did it
Remove 'vm.panic_on_oom=1' kernel flag from 'vmcore-sysctl.conf '
Why I did it
Currently, there is a bug in the ntp.conf jinja2 template where it will ignore the src_intf directive in CONFIG_DB if there are multiple IP addresses associated with an interface. This code change fixes that bug and allows the template to select the correct source interface for NTP.
How I did it
I did this by modifying the macro in ntp.conf.j2 which determines if there is an ip address associated with an interface to set a state variable when it detects a valid interface entry in CONFIG_DB instead of outputting "true" directly (which could result in multiple "trues" outputted for interfaces with multiple valid IP addresses).
How to verify it
Add two ipv4 addresses to an interface in SONiC
Add the following configuration to config_db.json
{
"NTP": {
"global": {
"src_intf": "Ethernet1"
}
}
}
Replace Ethernet1 with the interface name of the one you assigned the IP addresses to.
Run sudo config reload -y
Open /etc/ntp.conf and verify that the following line exists
...
interface listen Ethernet1
...
The interface specified should be the one set in the previous steps.
Description for the changelog
[ntp] Fix ntp.conf template to allow setting of source port in CONFIG_DB
Why I did it
start pcie-check.service after config-setup.service since pcie_util depends on device_info which is available with config db metadata.
How I did it
Add config-setup.service as a dependency of pcie-check.service
How to verify it
Upon reboot, check if the pcie-check.sh throws the platform api error which is dependent on DEVICE_METADATA
Why I did it
Finding running containers through "docker ps" breaks when kubernetes deploys container, as the names are mangled.
How I did it
The data is is available from FEATURE table, which takes care of kubernetes deployment too.
How to verify it
Deploy a feature via kubernetes and don't expect error from container_check.
Why I did it
Support readonly version of the command vtysh
How I did it
Check if the command starting with "show", and verify only contains single command in script.
Fix#7364
99-default.link - was always in SONiC, but previous systemd (<247) had an issue and it did not work due to issue systemd/systemd#3374. Now systemd 247 works.
However, such policy overrides teamd provided mac address which causes teamd netdev to use a random mac
address. Therefore, needs to be disabled.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Encounter error during "config-setup boot" if the updategraph is enabled.
How I did it
Correct the code inside the config-setup script.
Remove the space between the assignment operator.
How to verify it
Remove the /etc/sonic/config_db.json and reboot the device.
Originally, it will return following error after boot up.
rv: command not found
After modification, it can correctly parse the status of updategraph without error.
This commit has following changes:
* Add templates and code to support VoQ chassis iBGP peers
* Add support to convert a new VoQChassisInternal element in the
BGPSession element of the minigraph to a new BGP_VOQ_CHASSIS_NEIGHBOR
table in CONFIG_DB.
* Add a new set of "voq_chassis" templates to docker-fpm-frr
* Add a new BGP peer manager to bgpcfgd to add neighbors from the
BGP_VOQ_CHASSIS_NEIGHBOR table using the voq_chassis templates.
* Add a test case for minigraph.py, making sure the VoQChassisInternal
element creates a BGP_VOQ_CHASSIS_NEIGHBOR entry, but not if its
value is "false".
* Add a set of test cases for the new voq_chassis templates in
sonic-bgpcfgd tests.
Note that the templates expect the new
"bgp bestpath peer-type multipath-relax" bgpd configuration to be
available.
Signed-off-by: Joanne Mikkelson <jmmikkel@arista.com>