sonic-buildimage

Author	SHA1	Message	Date
geogchen	90a849ea85	Add support for generating interface configuration in /etc/network/interfaces for multiple management interfaces (#11204 ) * [Interfaces] Modify template to support multiple management interfaces * Modify minigraph to process interfaces in sorted order Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net> * Add UT minigraph Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net> * make case insensitve comparison Signed-off-by: George Chen <gechen@microsoft.com> * Use natural sort Signed-off-by: George Chen <gechen@microsoft.com> Co-authored-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>	2022-06-21 10:16:10 -07:00
jingwenxie	fdc65d7600	Remove minigraph loading in updategraph script (#11146 ) Why I did it Minigraph will be deprecated in the future. So minigraph related reload should be deleted. How I did it Remove unused load_minigraph	2022-06-21 08:57:57 +08:00
Stepan Blyshchak	42576d2664	[auto-ts] add memory check (#10433 ) #### Why I did it To support automatic techsupport invokation in case memory usage is too high. #### How I did it Implemented according to https://github.com/Azure/SONiC/pull/939 #### How to verify it UT, manual test on the switch. DEPENDS on https://github.com/Azure/sonic-utilities/pull/2116	2022-06-20 09:39:05 -07:00
yozhao101	241f4454b4	[memory_checker] Do not check memory usage of containers which are not created (#11129 ) Signed-off-by: Yong Zhao yozhao@microsoft.com Why I did it This PR aims to fix an issue (#10088) by enhancing the script memory_checker. Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage. How I did it In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage. How to verify it I tested on a lab device by following the steps: Stops telemetry container with command sudo systemctl stop telemetry.service Removes telemetry container with command docker rm telemetry Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.	2022-06-17 12:13:18 -07:00
jingwenxie	cca3b5be5b	Reduce logic in updategraph (#11010 ) Why I did it The dhcp_graph_url used by internal service is always set as "N/A". So we can make the updategraph logic short. How I did it Shorten 'if statement' logic for /tmp/dhcp_graph_url	2022-06-14 22:18:47 +08:00
abdosi	0285bfe42e	[chassis] Fix issues regarding database service failure handling and mid-plane connectivity for namespace. (#10500 ) What/Why I did: Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability. Scenario: For example if let's say database service goes in fail state then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created. Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface. Issue 2: Use of IPVLAN vs MACVLAN Currently we are using ipvlan mode. However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device." based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN) it remains in Kernel Database and never goes to free pool even though namespace is deleted. Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address. Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis Server. Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle Solution: Added explicit PING from namespace that will check this reachability. Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet. Solution: Handle gracefully via try..except and log the message.	2022-05-24 16:54:12 -07:00
Arun Saravanan Balachandran	f4b22f67a4	[initramfs]: SSD firmware upgrade in initramfs (#10748 ) Why I did it To upgrade SSD firmware in initramfs while rebooting from SONiC to SONiC and during NOS to SONiC migration. How I did it New option 'ssd-upgrader-part’ is introduced in grub command line, to indicate the partition and its filesystem type in which the SSD firmware updater is present. ‘ssd-upgrader-part’ syntax is ssd-upgrader-part=<partition>,<filesystem type>. Example: ssd-upgrader-part=/dev/sda8,ext4 A new initramfs script ‘ssd-upgrade’ is included in init-premount and it invokes the SSD firmware updater (ssd-fw-upgrade) present in the partition indicated by the boot option 'ssd-upgrader-part' How to verify it In SONiC, the SSD firmware updater is copied to “/host/” directory. Fast-reboot is to be initiated with the ‘-u’ option ([scripts/fast-reboot] Add option to include ssd-upgrader-part boot option with SONiC partition sonic-utilities#2150) After reboot, while booting into SONiC the SSD firmware updater will be executed in initramfs.	2022-05-12 08:11:02 -07:00
Junchao-Mellanox	681c24878b	Fix race condition between networking service and interface-config service (#10573 ) Why I did it The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like: Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp. Systemd starts interface-config service who depends on networking service Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down. Interface-config service updates /etc/networking/interface to static configuration. Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP) When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces. The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3. How I did it Check networking service state before running "ifdown –force eth0", wait for it done if it is activating. How to verify it Manual test.	2022-05-05 15:21:44 -07:00
yozhao101	e24fe9bc60	[Monit] Fix the issue which shows Monit can not reset its counter. (#10288 ) Signed-off-by: Yong Zhao <yozhao@microsoft.com> Why I did it This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container. Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following: check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted. Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window. The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok. Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry: Program 'container_memory_telemetry' status Status ok monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Sat, 19 Mar 2022 19:56:26 Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service: Program 'container_memory_telemetry' status Status failed monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Tue, 01 Feb 2022 22:52:55 After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok. How I did it In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles. How to verify it I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.	2022-04-20 18:08:06 -07:00
Vivek R	ed14eb5263	[interfaces-config] "main exception: cannot find interfaces: eth0" error log avoided (#10463 ) - Why I did it Fixes #9628 During bootup, this error log is seen Dec 22 04:26:29 sonic interfaces-config.sh[2546]: error: main exception: cannot find interfaces: eth0 (interface was probably never up ?) This is of non-functional nature and doesn't affect the flow. - How I did it Dont take the ifdown if not needed - How to verify it Verified during reboot. Log did not appear and IP was acquired on eth0 as expected Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>	2022-04-06 16:59:47 +03:00
Santhosh Kumar T	e2502edefd	Refactoring DELL platform init to reduce rc.local processing time porting changes in master (#10318 ) Why I did it To reduce the processing time of rc.local, refactoring s6100 platform initialization. Porting changes from 202012 branch [202012] Refactoring DELL platform init to reduce rc.local processing time #10171	2022-03-24 11:14:37 -07:00
xumia	1017ee6002	[Build]: Use one debian mirror config (#10274 ) Why I did it Use one debian mirror config. The empty config in https://github.com/Azure/sonic-buildimage/blob/master/files/image_config/apt/sources.list overrides the file https://github.com/Azure/sonic-buildimage/blob/master/files/apt/sources.list.amd64 (armhf/arm64), it does not make sense. All the content in files/image_config/apt is no use, any one wants to add mirror config, please add in files/apt. How I did it Remove files/image_config/apt and the reference.	2022-03-21 16:47:20 +08:00
Stepan Blyshchak	2919b4820f	[hostcfgd] record feature state in STATE DB (#9842 ) - Why I did it To implement blocking feature state change. - How I did it Record the actual feature state in STATE DB from hostcfg. - How to verify it UT + verification by running on the switch and checking STATE DB. Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>	2022-03-14 13:45:27 +02:00
Samuel Angebault	8d419ca2c5	[Arista] Remove arista.log from rsyslog default logrotate (#9731 ) Why I did it In parallel of this change Arista added a custom logrotate configuration as part of its driver library. Having 2 logrotate configuration for the same log file triggers an issue. Fixes aristanetworks/sonic#38 How I did it Arista merged a few changes in sonic-buildimage which added a logrotate configuration aristanetworks/sonic@e43c797 It is therefore the right path to remove the arista.log line from the logrotate.d/rsyslog configuration. How to verify it Logrotate works without any error message, arista log rotation happens and arista daemons still append logs once file was truncated.	2022-03-11 08:09:07 -08:00
Marty Y. Lok	c40f04f0e2	[chassis][supervisor]monit container-checker failed due to unexpected "database-chassis" docker running #9042 (#9043 ) Why I did it Fixed the monit container_checker fails due to unexpected "database-chassis" docker running on Supervisor card in the VOQ chassis. fixes #9042 How I did it Added database-chassis to the always running docker list if platform is supervisor card. How to verify it Execute the CLI command "sudo monit status container_checker" Signed-off-by: mlok <marty.lok@nokia.com>	2022-03-03 17:56:08 -08:00
wenyiz2021	2d0b063191	Update container_checker for multi-asic devices when state is 'always_enabled' (#10067 ) * Update container_checker for multi-asic devices Update container_checker for multi-asic devices to add database containers in always_running_containers. Previous change was made for single-asic, and that database containers were not considered as feature when writing to state_db. * Update container_checker Update an indent	2022-02-23 18:06:30 -08:00
Stepan Blyshchak	fb752a4ae5	[rsyslog.j2] fix typo in VAR_LOG_SIZE_KB (#9954 ) This issue causes negative threshold value and thus deleting log files even when there is enough space. This issue causes negative threshold value and thus deleting log files even when there is enough space. - Why I did it To fix an issue when log files get deleted even if there is enough space. - How I did it Fixed an typo. - How to verify it Run the portion of the script that calculates threshold, see that the threshold is calculated correctly. Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>	2022-02-17 10:16:44 +02:00
Prince George	ff14aebef9	Close console session due to user inactivity (#9890 ) Signed-off-by: Prince George <prgeor@microsoft.com>	2022-02-02 09:41:21 +05:30
dflynn-Nokia	b6939b9927	[firsttime boot] suppress error message on platforms not supporting kdump (#9521 ) Why I did it Eliminate benign firsttime boot error reported when running on platforms that do not support kdump. How I did it Change rc.local to check for presence of the file /etc/default/kdump-tools before referencing it. How to verify it Install a new image on an armhf or arm64 platform and check for a failed reference to /etc/default/kdump-tools on firsttime boot.	2022-01-20 18:27:10 -08:00
liuh-80	f166b991a7	[image]: Prevent radius passkey and snmp community string into syslog. (#9727 ) [image]: Prevent radius passkey and snmp community string into syslog. (#9727) #### Why I did it Prevent radius passkey and snmp community string into syslog. #### How I did it Add radius and snmp config command to PASSWD_CMDS #### How to verify it Run and pass all UTs. #### Which release branch to backport (provide reason below if selected) <!-- - Note we only backport fixes to a release branch, not features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 #### Description for the changelog Add radius and snmp config command to PASSWD_CMDS to prevent radius passkey and snmp community string into syslog. #### A picture of a cute animal (not mandatory but encouraged)	2022-01-17 16:26:22 +08:00
Sudharsan Dhamal Gopalarathnam	bd0a19aa17	[rsyslog]Setting log file size to 16Mb (#9504 ) Why I did it The existing log file size in sonic is 1 Mb. Over a period of time this leads to huge number of log files which becomes difficult for monitoring applications to handle. Instead of large number of small files, the size of the log file is not set to 16 Mb which reduces the number of files over a period of time. How I did it Changed the size parameter and related macros in logrotate config for rsyslog How to verify it Execute logrotate manually and verify the limit when the file gets rotated. Signed-off-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>	2022-01-14 10:24:07 -08:00
noaOrMlnx	0908f9ec49	[CoPP] Add always_enabled field (#9302 ) *Add the "always_enabled" field to copp_cfg.j2 file, in order to allow traps without an entry in features table, to be installed automatically.	2021-11-30 11:04:15 -08:00
Brian O'Connor	002827f08e	[PINS] Add APPL_STATE_DB and response path log (#9082 ) - Add APPL_STATE_DB to database_config.json - Clear APPL_STATE_DB during SwSS container restarts - Add response path log file to logrotate config: responsepublisher.rec Co-authored-by: PINS Working Group <sonic-pins-subgroup@googlegroups.com>	2021-11-24 10:31:06 -08:00
Renuka Manavalan	a685fe1765	add arista.log to logrotate (#9245 )	2021-11-15 07:29:30 -08:00
Guohan Lu	5f11eb320e	Revert "sysready (#8889 )" This reverts commit `d7e5372e54`.	2021-11-10 15:36:20 -08:00
LuiSzee	5b284767f6	Update Centec platform support for Bullseye and 5.10 kernel (#7 ) 1. Fix build for armhf and arm64 2. upgrade centec tsingma bsp support to 5.10 kernel 3. modify centec platform driver for linux 5.10 Co-authored-by: Shi Lei <shil@centecnetworks.com>	2021-11-10 15:27:22 -08:00
Saikrishna Arcot	b8a7a6355b	Update the base Debian system installation script to get Bullseye Python 2 is no longer available, so remove those packages, and remove the pip2 commands. For picocom and systemd, just install from the regular repo, since there's no backports yet. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>	2021-11-10 15:27:22 -08:00
Senthil Kumar Guruswamy	d7e5372e54	sysready (#8889 )	2021-11-10 14:52:52 -08:00
abdosi	ea91a72b79	[multi-asic] fix syslog not getting generated. (#9160 ) Fixes #9159	2021-11-03 18:29:09 -07:00
Maxime Lorrillere	81f4fca3dc	Allow database instances on multi-asic linecards to connect to chassis DB (#8583 ) Add code to interfaces-config.sh to configure eth1 in multi-asic containers so that they can access midplane subnet. Co-authored-by: Maxime Lorrillere <mlorrillere@arista.com>	2021-10-26 18:27:09 -07:00
Ying Xie	638c287837	[copp] bind copp-config.service to sonic.target (#8969 ) copp-config service needs to be started after sonic.target so that it could render the copp-config with the latest information. It also needs to be restarted when config reload or load_minigraph is invoked. Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2021-10-13 21:07:44 -07:00
abdosi	13ec43bc68	[baseimage]: Logrotate for wtmp and btmp files. (#8743 ) Added logrotate file for wtmp and btmp to override default conf and set size cap as 100K as done in PR: #865. For buster this is control by separate file wtmp and btmp. Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>	2021-09-15 23:28:27 -07:00
Sudharsan Dhamal Gopalarathnam	db529af203	Removing execute permission from copp config file (#8680 ) *Removed execute permissions from the systemd copp-config.service file. Without this we will get a warning: "Configuration file /lib/systemd/system/copp-config.service is marked executable. Please remove executable permission bits. Proceeding anyway."	2021-09-13 09:10:21 -07:00
Ying Xie	41643a9729	[202012][fstrim] delay fstrim timer after sonic.target (#8737 ) Why I did it fstrim has dependency on pmon docker. How I did it start fstrim timer after sonic.target. How to verify it local test and PR test. Signed-off-by: Ying Xie ying.xie@microsoft.com	2021-09-13 07:37:46 -07:00
dflynn-Nokia	7bae388e2f	[Nokia ixs7215] Add support for changing the console baud rate (#8595 ) This commit adds support for changing the default console baud rate configured within the U-Boot bootloader. That default baud rate is exposed via the value of the U-Boot 'baudrate' environment variable. This commit removes logic that hardcoded the console baud rate to 115200 and instead ensures that the U-Boot 'baudrate' variable is always used when constructing the Linux kernel boot arguments used when booting Sonic. A change is also made to rc.local to ensure that the specified baud rate is set correctly in the serial getty service.	2021-08-26 07:14:34 -07:00
byu343	cdfb4855dc	[macsec] Add eapol to copp config (#8416 ) This change enables the control packets of MACsec to be processed by CPU.	2021-08-23 18:56:23 -07:00
Volodymyr Samotiy	e3a30deea9	[monit] Periodically monitor VNET route consistency (#8266 ) To run VNET route consistency check periodically. For any failure, the monit will raise alert based on return code. Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>	2021-08-19 16:29:25 -07:00
abdosi	2348794ef0	Enable sysctl fib_multipath_use_neigh (#8502 ) Enable fib_multipath_use_neigh for v4 https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt Why I did: This is helpful if the neighbor are not directly connected then Kernel forward to unreachable neighbor option. With this option forwarding using neighbor state to be valid.	2021-08-18 15:53:17 -07:00
Saikrishna Arcot	c8b5daed27	Upgrade to ifupdown2 3.0.0 with a patch to fix using broadcast addresses In version 3.0.0, If a broadcast address is specified in /etc/network/interfaces, then when ifup is run, it will fail with an error saying `'str' object has no attribute 'packed'`. This appears to be because it expects all attributes for an interface to be "packable" into a compact binary representation. However, it doesn't actually convert the broadcast address into an IPNetwork object (other addresses are handled). Therefore, convert the broadcast address it reads in from a str to an IPNetwork object. Also explicitly specify the scope of the loopback address in /etc/network/interfaces as host scope. Otherwise, it will get added as global scope by default. As part of this, use JSON to parse ip's output instead of text, for robustness. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>	2021-08-12 23:18:01 -07:00
Blueve	3da6f12b0b	[port_config] Introduce ad-hoc mport_config.json file (#8066 ) Signed-off-by: Jing Kan jika@microsoft.com	2021-07-15 08:56:35 +08:00
Stepan Blyshchak	3a2b8c6ba5	[SONiC Application Extension] support warm/fast reboot for extension packages (#7286 ) #### Why I did it I made this change to support warm/fast reboot for SONiC extension packages as per HLD Azure/SONiC#682. #### How I did it I extended manifest.json.j2 with new warm/fast reboot related fields and also extended sonic_debian_extension.j2 script template to generate the shutdown order files for warm and fast reboot.	2021-07-11 06:58:05 -07:00
shlomibitton	776a446d76	[dhcp_relay] Disable dhcp_relay for ToRRouter switches type by the feature manager (#7789 ) - Why I did it Currently dhcp packets are disabled by the COPP manager for non ToRRouter type switches. Even if the feature is enabled, DHCP packets wont hook to the CPU since the COPP manager will not trap this packets. This change is to disable dhcp_relay by default for non ToRRouter switches from init_cfg.json. With this approach, if the user want to enable the feature for non ToRRouter switches, manual enablement is required by the 'feature' configuration. This is to keep the current approach for MSFT production issue with dhcp relay for non ToRRouter switched and allow the user to decide if to use it or not. - How I did it Configure dhcp_relay 'disabled' by default on init_cfg.json for non ToRRouter switches. Remove the exclusion of dhcp packets on copp_cfg.json - How to verify it Enable dhcp_relay feature on a non ToRRouter switch. Unit-tests modified so the default values on mocked CONFIG DB in 'test_vectors.py' for dhcp_relay will be 'disabled'. This is by the change for 'init_cfg.json.j2'. For ToRRouter the state will change from 'disabled' to 'enabled'. Another test case added for a 'ToR' switch type, this is to test the state is 'enabled' if the user configured it to be so.	2021-07-08 09:10:46 +03:00
rajendra-dendukuri	f4b0c8fe4e	[kdump] Fix kdump error message when a reboot is issued (#7985 ) dash doesn't support += operation to append to a variable's value. Use KDUMP_CMDLINE_APPEND="${KDUMP_CMDLINE_APPEND} " instead The below error message is seen when a reboot is issued. [ 342.439096] kdump-tools[13655]: /etc/init.d/kdump-tools: 117: /etc/default/kdump-tools: KDUMP_CMDLINE_APPEND+= panic=10 debug hpet=disable pcie_port=compat pci=nommconf sonic_platform=x86_64-accton_as7326_56x-r0: not found	2021-07-01 11:52:38 -07:00
xumia	5c503b81ae	Fix vtysh shell-ingestion security issue (#7759 ) Fix vtysh shell-ingestion security issue Only expose the limited parameters of the command vtysh show.	2021-06-28 09:57:08 +08:00
Sujin Kang	ecc5073731	Support multiple pcie configuration file and change the pcie status table name to match with pcied changes (#7886 ) Why I did it Support multiple pcie configuration file and change the pcie status table name This is to match with below two PRs. Azure/sonic-platform-common#195 Azure/sonic-platform-daemons#189 How I did it Check pcie configuration file with wild card and change the device status table name How to verify it Restart with changes and see if the pcie check works as expected.	2021-06-16 16:05:48 -07:00
yozhao101	1a3cab43ac	[Monit] Deprecate the feature of monitoring the critical processes by Monit (#7676 ) Signed-off-by: Yong Zhao yozhao@microsoft.com Why I did it Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit. How I did it I removed the script process_checker and corresponding Monit configuration entries of critical processes. How to verify it I verified this on the device str-7260cx3-acs-1.	2021-06-04 10:16:53 -07:00
Renuka Manavalan	73447efc31	Add service to restore TACACS from old config (#7560 ) Why I did it In upgrade scenarios, where config_db.json is not carry forwarded to new image, it could be left w/o TACACS credentials. Added a service to trigger 5 minutes after boot and restore TACACS, if /etc/sonic/old_config/tacacs.json is present. How I did it By adding a service, that would fire 5 mins after boot. This service apply tacacs if available. How to verify it Upgrade and watch status of tacacs.timer & tacacs.service You may create /etc/sonic/old_config/tacacs.json, with updated credentials (before 5mins after boot) and see that appears in config & persisted too. Which release branch to backport (provide reason below if selected) 201911 202006 202012	2021-06-03 20:07:17 -07:00
Andriy Kokhan	6931a45ecf	Fixed typos in config-setup (#7754 ) Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com>	2021-06-03 08:59:38 -07:00
yozhao101	37863ac854	[Monit] Restart telemetry container if memory usage is beyond the threshold (#7645 ) Signed-off-by: Yong Zhao yozhao@microsoft.com Why I did it This PR aims to monitor the memory usage of streaming telemetry container and restart streaming telemetry container if memory usage is larger than the pre-defined threshold. How I did it I borrowed the system tool Monit to run a script memory_checker which will periodically check the memory usage of streaming telemetry container. If the memory usage of telemetry container is larger than the pre-defined threshold for 10 times during 20 cycles, then an alerting message will be written into syslog and at the same time Monit will run the script restart_service to restart the streaming telemetry container. How to verify it I verified this implementation on device str-7260cx3-acs-1.	2021-05-28 11:13:44 -07:00
Renuka Manavalan	2cd61bc136	Invoke disk check periodically. (#7374 ) Why I did it Helps with periodic scan of disk for RO state. If found, this script makes transient fix and raise error message.	2021-05-26 17:59:08 -07:00

1 2 3 4 5 ...

412 Commits