sonic-buildimage

Archived

Author	SHA1	Message	Date
Jing Zhang	5d03b5d0df	Avoid write_standby in warm restart context (#11283 ) Avoid write_standby in warm restart context. sign-off: Jing Zhang zhangjing@microsoft.com Why I did it In warm restart context, we should avoid mux state change. How I did it Check warm restart flag before applying changes to app db. How to verify it Ran write_standby in table missing, key missing, field missing scenarios. Did a warm restart, app db changes were skipped. Saw this in syslog: WARNING write_standby: Taking no action due to ongoing warmrestart.	2022-06-29 21:34:02 -07:00
andywongarista	6e0559d5fa	[Arista] Add initial support for 720DT-48S (#10656 ) Added initial set of config files to allow for booting and partial traffic testing in SONiC on the 720DT-48S. How to verify it - Switch boots - show interfaces status shows links up on interfaces Ethernet24-51 - Traffic flows with no errors on interfaces Ethernet24-51	2022-06-29 09:56:24 -07:00
davidpil2002	8b7d069880	Add support for Password Hardening (#10323 ) - Why I did it New security feature for enforcing strong passwords when login or changing passwords of existing users into the switch. - How I did it By using mainly Linux package named pam-cracklib that support the enforcement of user passwords, the daemon named hostcfgd, will support add/modify password policies that enforce and strengthen the user passwords. - How to verify it Manually Verification- 1. Enable the feature, using the new sonic-cli command passw-hardening or manually add the password hardening table like shown in HLD by using redis-cli command 2. Change password policies manually like in step 1. Notes: password hardening CLI can be found in sonic-utilities repo- P.R: Add support for Password Hardening sonic-utilities#2121 code config path: config/plugins/sonic-passwh_yang.py code show path: show/plugins/sonic-passwh_yang.py 3. Create a new user (using adduser command) or modify an existing password by using passwd command in the terminal. And it will now request a strong password instead of default linux policies. Automatic Verification - Unitest: This PR contained unitest that cover: 1. test default init values of the feature in PAM files 2. test all the types of classes policies supported by the feature in PAM files 3. test aging policy configuration in PAM files	2022-06-29 15:34:56 +03:00
bingwang-ms	ac86f71287	Add extra lossy PG profile for ports between T1 and T2 (#11157 ) Signed-off-by: bingwang <wang.bing@microsoft.com> Why I did it This PR brings two changes Add lossy PG profile for PG2 and PG6 on T1 for ports between T1 and T2. After PR Update qos config to clear queues for bounced back traffic #10176 , the DSCP_TO_TC_MAP and TC_TO_PG_MAP is updated when remapping is enable DSCP_TO_TC_MAP Before After Why do this change "2" : "1" "2" : "2" Only change for leaf router to map DSCP 2 to TC 2 as TC 2 will be used for lossless TC "6" : "1" "6" : "6" Only change for leaf router to map DSCP 6 to TC 6 as TC 6 will be used for lossless TC TC_TO_PRIORITY_GROUP_MAP Before After Why do this change "2" : "0" "2" : "2" Only change for leaf router to map TC 2 to PG 2 as PG 2 will be used for lossless PG "6" : "0" "6" : "6" Only change for leaf router to map TC 6 to PG 6 as PG 6 will be used for lossless PG So, we have two new lossy PGs (2 and 6) for the T2 facing ports on T1, and two new lossless PGs (2 and 6) for the T0 facing port on T1. However, there is no lossy PG profile for the T2 facing ports on T1. The lossless PGs for ports between T1 and T0 have been handled by buffermgrd .Therefore, We need to add lossy PG profiles for T2 facing ports on T1. We don't have this issue on T0 because PG 2 and PG 6 are lossless PGs, and there is no lossy traffic mapped to PG 2 and PG 6 Map port level TC7 to PG0 Before the PCBB change, DSCP48 -> TC 6 -> PG 0. After the PCBB change, DSCP48 -> TC 7 -> PG 7 Actually, we can map TC7 to PG0 to save a lossy PG. How I did it Update the qos and buffer template. How to verify it Verified by UT.	2022-06-28 12:50:33 -07:00
judyjoseph	c9f36957db	Update include_macsec flag if type is SpineRouter (#11141 ) Add the support to enable macsec when type is SpineRouter	2022-06-24 10:32:02 -07:00
geogchen	6a9c058a92	Revert "Add support for generating interface configuration in /etc/network/interfaces for multiple management interfaces (#11204 )" (#11241 ) This reverts commit `90a849ea85`. #### Why I did it The interfaces unit test did not cover some of the conditions in interfaces.j2 that was changed in #11204. Therefore reverting the change and add the tests before making the change to interfaces.j2. #### How I did it Git revert. #### How to verify it #### Which release branch to backport (provide reason below if selected) - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 #### Description for the changelog #### Link to config_db schema for YANG module changes #### A picture of a cute animal (not mandatory but encouraged)	2022-06-24 06:30:33 -07:00
Sudharsan Dhamal Gopalarathnam	9452095e25	[lldp]Fix lldp spawned after reboot when disabled (#11080 ) - Why I did it When LLDP is disabled through feature command, it gets spawned after reboot. - How I did it In syncd.sh check if the service is enabled before spawning automatically during cold reboot. - How to verify it Disable lldp feature. Perform cold reboot and verify its not spawned.	2022-06-22 03:11:41 +03:00
geogchen	90a849ea85	Add support for generating interface configuration in /etc/network/interfaces for multiple management interfaces (#11204 ) * [Interfaces] Modify template to support multiple management interfaces * Modify minigraph to process interfaces in sorted order Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net> * Add UT minigraph Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net> * make case insensitve comparison Signed-off-by: George Chen <gechen@microsoft.com> * Use natural sort Signed-off-by: George Chen <gechen@microsoft.com> Co-authored-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>	2022-06-21 10:16:10 -07:00
xumia	fdef1f0342	[Build]: Support to use symbol links for lazy installation targets to reduce the image size (#10923 ) Why I did it Support to use symbol links in platform folder to reduce the image size. The current solution is to copy each lazy installation targets (xxx.deb files) to each of the folders in the platform folder. The size will keep growing when more and more packages added in the platform folder. For cisco-8000 as an example, the size will be up to 2G, while most of them are duplicate packages in the platform folder. How I did it Create a new folder in platform/common, all the deb packages are copied to the folder, any other folders where use the packages are the symbol links to the common folder. Why platform.tar? We have implemented a patch for it, see #10775, but the problem is the the onie use really old unzip version, cannot support the symbol links. The current solution is similar to the PR 10775, but make the platform folder into a tar package, which can be supported by onie. During the installation, the package.tar will be extracted to the original folder and removed.	2022-06-21 13:03:55 +08:00
jingwenxie	fdc65d7600	Remove minigraph loading in updategraph script (#11146 ) Why I did it Minigraph will be deprecated in the future. So minigraph related reload should be deleted. How I did it Remove unused load_minigraph	2022-06-21 08:57:57 +08:00
Stepan Blyshchak	42576d2664	[auto-ts] add memory check (#10433 ) #### Why I did it To support automatic techsupport invokation in case memory usage is too high. #### How I did it Implemented according to https://github.com/Azure/SONiC/pull/939 #### How to verify it UT, manual test on the switch. DEPENDS on https://github.com/Azure/sonic-utilities/pull/2116	2022-06-20 09:39:05 -07:00
yozhao101	241f4454b4	[memory_checker] Do not check memory usage of containers which are not created (#11129 ) Signed-off-by: Yong Zhao yozhao@microsoft.com Why I did it This PR aims to fix an issue (#10088) by enhancing the script memory_checker. Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage. How I did it In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage. How to verify it I tested on a lab device by following the steps: Stops telemetry container with command sudo systemctl stop telemetry.service Removes telemetry container with command docker rm telemetry Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.	2022-06-17 12:13:18 -07:00
bingwang-ms	83f23e26ff	Generate switch level dscp_to_tc_map entry from qos_config template (#11087 ) * Generate switch level dscp_to_tc_map Signed-off-by: bingwang <wang.bing@microsoft.com>	2022-06-17 08:44:30 +08:00
byu343	89020f53e4	[Arista] Add support support for 7060dx5_64s and 7060px5_64s (#10888 ) Why I did it This change adds the support for Arista 7060dx5_64s and 7060px5_64s How I did it How to verify it We verified the platform driver is working and the ports are up on 7060dx5_64s and 7060px5_64s.	2022-06-16 09:51:42 -07:00
Samuel Angebault	30bfed92fd	[Arista] Add configuration files for 7050X4-32S platform (#10799 ) Add most configuration files for the DCS-7050PX4-32S and DCS-7050DX4-32S. This review only contains platform configuration files, dataplane ones will follow in future change. Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>	2022-06-16 09:42:10 -07:00
shlomibitton	1474ad76d8	[Mellanox] [pmon] Fix for PMON service not starting when restarting SWSS service after fast/warm reboot (#10901 ) - Why I did it Recent change to delay PMON service in case of fast/warm reboot introduce an issue when restarting only SWSS service after fast/warm reboot for Nvidia platform. Since the timer is triggered only when the system boot, in a scenario when the system is after a fast/warm reboot and the user restart SWSS service, as part of syncd.sh script, PMON service will stop but the timer will not start again. - How I did it On syncd.sh script, in case of fast/warm indication, check if pmon.timer is running. If it is running it means we are at the first boot and continue normally. If it is not running, meaning the service was restarted, start the timer to keep the system behavior consistent. - How to verify it Run fast/warm reboot. service swss restart. Observe PMON service starting. Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>	2022-06-16 12:15:09 +03:00
jingwenxie	cca3b5be5b	Reduce logic in updategraph (#11010 ) Why I did it The dhcp_graph_url used by internal service is always set as "N/A". So we can make the updategraph logic short. How I did it Shorten 'if statement' logic for /tmp/dhcp_graph_url	2022-06-14 22:18:47 +08:00
judyjoseph	0b1ae9c43c	Cleanup macsec stateDB tables on restart (#11066 ) Clean macsec tables in STATE_DB on start	2022-06-09 15:32:24 -07:00
bingwang-ms	1cc602c6af	Add two extra lossless queues for bounced back traffic (#10496 ) Signed-off-by: bingwang <bingwang@microsoft.com> Why I did it This PR is to add two extra lossless queues for bounced back traffic. HLD sonic-net/SONiC#950 SKUs include Arista-7050CX3-32S-C32 Arista-7050CX3-32S-D48C8 Arista-7260CX3-D108C8 Arista-7260CX3-C64 Arista-7260CX3-Q64 How I did it Update the buffers.json.j2 template and buffers_config.j2 template to generate new BUFFER_QUEUE table. For T1 devices, queue 2 and queue 6 are set as lossless queues on T0 facing ports. For T0 devices, queue 2 and queue 6 are set as lossless queues on T1 facing ports. Queue 7 is added as a new lossy queue as DSCP 48 is mapped to TC 7, and then mapped into Queue 7 How to verify it Verified by UT Verified by coping the new template and generate buffer config with sonic-cfggen	2022-06-02 13:03:27 -07:00
bingwang-ms	0c9bbee735	Update qos template to support SYSTEM_DEFAULT table (#10936 ) * Update qos template to support SYSTEM_DEFAULT table Signed-off-by: bingwang <wang.bing@microsoft.com>	2022-06-02 21:48:57 +08:00
xumia	0552d6b172	Support symcrypt fips config for aboot/uboot (#10729 ) Why I did it Support symcrypt fips config for aboot/uboot	2022-06-02 15:35:17 +08:00
Hua Liu	96954f0134	[swsscommon] Add c++ version sonic-db-cli from sonic-swss-common (#10825 ) #### Why I did it Fix sonic-db-cli high CPU usage on SONiC startup issue: https://github.com/Azure/sonic-buildimage/issues/10218 ETA of this issue will be 2022/05/31 #### How I did it Re-write sonic-cli with c++ in sonic-swss-common: https://github.com/Azure/sonic-swss-common/pull/607 Modify swss-common rules and slave.mk to install c++ version sonic-db-cli. #### How to verify it Pass all E2E test scenario. #### Which release branch to backport (provide reason below if selected) <!-- - Note we only backport fixes to a release branch, not features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 #### Description for the changelog Build and install c++ version sonic-db-cli from swss-common. #### Link to config_db schema for YANG module changes <!-- Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/SONiC/wiki/Configuration. --> #### A picture of a cute animal (not mandatory but encouraged)	2022-06-01 08:05:53 +08:00
Lukas Stockner	c9b27cde71	[swss] Clear VXLAN tunnel table from State DB on startup (#10822 ) * When reloading config after crashes, VTEP interfaces are sometimes not created since the tunnel still exists in the STATE_DB. * Adding VXLAN_TUNNEL_TABLE to the list of tables to be cleaned in swss.sh fixes the problem.	2022-05-31 08:54:31 -07:00
davidpil2002	ab0930313b	[YANG] Add support for Password Hardening (#10322 ) - Why I did it Yang Model about password hardening feature, the sonic CLI of this feature was autogenerated from this Yang model - How I did it Create new Yang model in src/sonic-yang-models/yang-models/sonic-passwh.yang. - How to verify it There are unitests(yang test) in this P.R covering all the passwords policies with good and bad values cases. Or is possible manually using the config/show password commands that were autogenerated from this Yang model. (this CLI code added in sonic-utilities)	2022-05-29 13:54:51 +03:00
xumia	f0dfd398a6	Revert "Reduce image size for lazy installation packages (#10775 )" (#10916 ) This reverts commit `15cf9b0d70`. Why I did it Revert the PR #10775, for it has impact on onie installation. It is caused by the symbol links not supported in some of the onie unzip. We will enable after fixing the issue, see #10914	2022-05-26 09:39:48 +08:00
abdosi	0285bfe42e	[chassis] Fix issues regarding database service failure handling and mid-plane connectivity for namespace. (#10500 ) What/Why I did: Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability. Scenario: For example if let's say database service goes in fail state then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created. Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface. Issue 2: Use of IPVLAN vs MACVLAN Currently we are using ipvlan mode. However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device." based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN) it remains in Kernel Database and never goes to free pool even though namespace is deleted. Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address. Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis Server. Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle Solution: Added explicit PING from namespace that will check this reachability. Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet. Solution: Handle gracefully via try..except and log the message.	2022-05-24 16:54:12 -07:00
Maxime Lorrillere	392899682f	[Arista] Add support for Wolverine linecards (#8887 ) Add support for WolverineQCpu, WolverineQCpuMs, WolverineQCpuBk, WolverineQCpuBkMs Co-authored-by: Maxime Lorrillere <mlorrillere@arista.com>	2022-05-20 14:11:06 -07:00
Senthil Kumar Guruswamy	f37dd770cd	System Ready (#10479 ) Why I did it At present, there is no mechanism in an event driven model to know that the system is up with all the essential sonic services and also, all the docker apps are ready along with port ready status to start the network traffic. With the asynchronous architecture of SONiC, we will not be able to verify if the config has been applied all the way down to the HW. But we can get the closest up status of each app and arrive at the system readiness. How I did it A new python based system monitor tool is introduced under system-health framework to monitor all the essential system host services including docker wrapper services on an event based model and declare the system is ready. This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any. How to verify it "show system-health sysready-status" click CLI Syslogs for system ready	2022-05-20 13:25:11 -07:00
Arun Saravanan Balachandran	f4b22f67a4	[initramfs]: SSD firmware upgrade in initramfs (#10748 ) Why I did it To upgrade SSD firmware in initramfs while rebooting from SONiC to SONiC and during NOS to SONiC migration. How I did it New option 'ssd-upgrader-part’ is introduced in grub command line, to indicate the partition and its filesystem type in which the SSD firmware updater is present. ‘ssd-upgrader-part’ syntax is ssd-upgrader-part=<partition>,<filesystem type>. Example: ssd-upgrader-part=/dev/sda8,ext4 A new initramfs script ‘ssd-upgrade’ is included in init-premount and it invokes the SSD firmware updater (ssd-fw-upgrade) present in the partition indicated by the boot option 'ssd-upgrader-part' How to verify it In SONiC, the SSD firmware updater is copied to “/host/” directory. Fast-reboot is to be initiated with the ‘-u’ option ([scripts/fast-reboot] Add option to include ssd-upgrader-part boot option with SONiC partition sonic-utilities#2150) After reboot, while booting into SONiC the SSD firmware updater will be executed in initramfs.	2022-05-12 08:11:02 -07:00
Marty Y. Lok	23f9126f59	[VoQ][config] Multiasic Supervisor card fails to load config_db#.json in chassis when system is reboot (#10106 ) Supervisor card fails to load config_db#.json in chassis when system reboot. This is an intermittent issue, fixes #10105	2022-05-09 11:06:11 -07:00
xumia	15cf9b0d70	Reduce image size for lazy installation packages (#10775 ) Why I did it The image size is too large, when there are multiple lazy packages and multiple platforms. It is not necessary to keep the lazy installation packages in multiple copies. For cisco image, the image size will reduce from 3.5G to 1.7G. How I did it Use symbol links to only keep one package for each of the lazy package. Make a new folder fsroot/platform/common Copy the lazy packages into the folder. When using a package in each of the platform, such as x86_64-grub, x86_64-8800_rp-r0, x86_64-8201_on-r0, etc, only make a symbol link to the package in the common folder.	2022-05-09 08:26:09 -07:00
xumia	8ec8900d31	Support SONiC OpenSSL FIPS 140-3 based on SymCrypt engine (#9573 ) Why I did it Support OpenSSL FIPS 140-3, see design doc: https://github.com/Azure/SONiC/blob/master/doc/fips/SONiC-OpenSSL-FIPS-140-3.md. How I did it Install the fips packages. To build the fips packages, see https://github.com/Azure/sonic-fips Azure pipelines: https://dev.azure.com/mssonic/build/_build?definitionId=412 How to verify it Validate the SymCrypt engine: admin@sonic:~$ dpkg-query -W \| grep openssl openssl 1.1.1k-1+deb11u1+fips symcrypt-openssl 0.1 admin@sonic:~$ openssl engine -v \| grep -i symcrypt (symcrypt) SCOSSL (SymCrypt engine for OpenSSL) admin@sonic:~$	2022-05-06 07:21:30 +08:00
Junchao-Mellanox	681c24878b	Fix race condition between networking service and interface-config service (#10573 ) Why I did it The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like: Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp. Systemd starts interface-config service who depends on networking service Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down. Interface-config service updates /etc/networking/interface to static configuration. Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP) When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces. The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3. How I did it Check networking service state before running "ifdown –force eth0", wait for it done if it is activating. How to verify it Manual test.	2022-05-05 15:21:44 -07:00
shlomibitton	4ec3af86af	[Fastboot] Delay PMON service for better fastboot performance (#10567 ) - Why I did it Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time. This parallel execution consume CPU time and the duration of create_switch is longer than it should be. Following this finding, and the motivation to ensure these services will not interfere in the future, PMON is delayed in 90 seconds until the system finish the init flow after fastboot. - How I did it Add a timer for PMON service. Exclude for MLNX platform the start trigger of PMON when SYNCD starts in case of fastboot. Copy the timer file to the host bin image. - How to verify it Run fast-reboot on MLNX platform and observe faster create_switch execution time.	2022-05-02 10:44:17 +03:00
shlomibitton	1d84e0d7df	[Fastboot] Delay LLDP service for better fastboot performance (#10568 ) - Why I did it Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time. This parallel execution consume CPU time and the duration of create_switch is longer than it should be. Following this finding, and the motivation to ensure these services will not interfere in the future, LLDP is delayed in 90 seconds until the system finish the init flow after fastboot. - How I did it Add a timer for LLDP service. Copy the timer file to the host bin image. - How to verify it Run fast-reboot on MLNX platform and observe faster create_switch execution time. This PR is dependent on PR: #10567	2022-04-28 10:35:14 +03:00
ganglv	9d7387a18e	[sonic-host-services]: Fix import and invalid path (#10660 ) Why I did it Can not start sonic-hostservice How I did it Install python3-dbus and systemd-python, and replace invalid path How to verify it Start the service with below commands: sudo systemctl start sonic-hostservice sudo systemctl status sonic-hostservice Signed-off-by: Gang Lv ganglv@microsoft.com	2022-04-27 07:14:51 +08:00
Saikrishna Arcot	64187a1b15	Remove SSH host keys after installing the custom version of sshd (#10633 ) * Remove SSH host keys after installing the custom version of sshd Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Use an override for for sshd instead of overwriting the service file Don't overwrite upstream's .service file, and instead use an override file for making sure the host key(s) are generated. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>	2022-04-25 10:38:52 -07:00
bingwang-ms	3fc3259a35	Define qos map `AZURE_TUNNEL` for QoS remapping of tunnel traffic (#10565 ) * Add AZURE_TUNNEL map Signed-off-by: bingwang <wang.bing@microsoft.com>	2022-04-25 15:06:10 +08:00
yozhao101	e24fe9bc60	[Monit] Fix the issue which shows Monit can not reset its counter. (#10288 ) Signed-off-by: Yong Zhao <yozhao@microsoft.com> Why I did it This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container. Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following: check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted. Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window. The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok. Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry: Program 'container_memory_telemetry' status Status ok monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Sat, 19 Mar 2022 19:56:26 Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service: Program 'container_memory_telemetry' status Status failed monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Tue, 01 Feb 2022 22:52:55 After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok. How I did it In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles. How to verify it I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.	2022-04-20 18:08:06 -07:00
Samuel Angebault	fb147764b5	[Arista] Fix arista-net initramfs hook (#10624 ) The interface renaming logic fails if one interface is missing. Because of the `set -e` the whole initramfs hook would abort early on error. This change fixes the current behavior to make sure missing interfaces are properly skipped and ensure existing interface are renamed.	2022-04-20 10:03:05 -07:00
Junhua Zhai	128d762af3	[gearbox] Add peer gbsyncd for swss if gearbox exists (#10504 ) Fix the issues #10501 and #9733 If having gearbox, we need: * add gbsyncd as a peer since swss also has dependency on gbsyncd * add service gbsyncd to FEATURE table if it is missing	2022-04-20 19:02:49 +08:00
kellyyeh	2a516a7763	[dhcp_relay] Enable dhcp_relay on EPMS, MgmtTsTor, MgmtToRRouter and BackEndToRRouter (#10474 )	2022-04-15 18:01:24 -07:00
Yakiv Huryk	d9117d9411	[Mellanox][asan] add address sanitizer support for syncd (#10266 ) Why I did it To support address sanitizer for Mellanox syncd How I did it /var/log/asan is mapped for syncd container (the same as for swss) container stop() has a timeout (60s) for syncd (the same as for swss) This is so libasan has enough time to generate a report. added ASAN's log path to Mellanox syncd supervisord.conf added "asan: yes" to sonic_version.yml How to verify it Added artificial memory leaks Compiled with ENABLE_ASAN=y Installed the image on DUT Rebooted the DUT Verified that /var/log/asan/syncd-asan.log contains the leaks Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>	2022-04-14 15:00:32 -07:00
Saikrishna Arcot	12ebe3ffa0	Run tune2fs during initramfs instead of image install (#10536 ) If it is run during image install, it's not guaranteed that the installation environment will have tune2fs available. Therefore, run it during initramfs instead. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>	2022-04-12 16:24:13 -07:00
byu343	f7a6553933	[docker-syncd]: Add optional shm-size to syncd container (#10516 ) Why I did it In the bringup of tomahawk4/trident4, we realized that such chips need a larger size of /dev/shm in syncd container, so we added the option --shm-size to the docker create for syncd. The default value for shm-size is 64m; after this change, people can add SYNCD_SHM_SIZE=128m to platform_env.conf to change it to 128m. How to verify it We verified that after this change, 1) on existing platforms without platform_env.conf, the size of /dev/shm in syncd container (df -h \| grep shm) is still the default 64M; 2) after we add SYNCD_SHM_SIZE=128m to platform_env.conf, /dev/shm in syncd becomes 128M.	2022-04-09 10:47:18 -07:00
Vivek R	ed14eb5263	[interfaces-config] "main exception: cannot find interfaces: eth0" error log avoided (#10463 ) - Why I did it Fixes #9628 During bootup, this error log is seen Dec 22 04:26:29 sonic interfaces-config.sh[2546]: error: main exception: cannot find interfaces: eth0 (interface was probably never up ?) This is of non-functional nature and doesn't affect the flow. - How I did it Dont take the ifdown if not needed - How to verify it Verified during reboot. Log did not appear and IP was acquired on eth0 as expected Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>	2022-04-06 16:59:47 +03:00
bingwang-ms	b9dd1df372	Update qos config to clear queues for bounced back traffic (#10176 ) * Update qos config to clear queues for bounced back traffic Signed-off-by: bingwang <bingwang@microsoft.com>	2022-04-05 22:32:25 +08:00
judyjoseph	8e642848c2	Introduce the asic_subtype field for adding the sub platform variants. (#10235 ) * Introduce the asic_subtype field for adding the sub platform variants. It uses the value of TARGET_MACHINE variable in slave.mk.	2022-03-28 11:22:32 -07:00
Santhosh Kumar T	e2502edefd	Refactoring DELL platform init to reduce rc.local processing time porting changes in master (#10318 ) Why I did it To reduce the processing time of rc.local, refactoring s6100 platform initialization. Porting changes from 202012 branch [202012] Refactoring DELL platform init to reduce rc.local processing time #10171	2022-03-24 11:14:37 -07:00
xumia	e9ac13678d	[Build]: Fix armhf mirrors not existing issue (#10312 ) Why I did it [Build]: Fix armhf mirrors not existing issue The mirror endpoint debian-archive.trafficmanager.net does not support armhf, change to use deb.debian.org and security.debian.org.	2022-03-22 15:24:15 +08:00

1 2 3 4 5 ...

995 Commits