sonic-buildimage

Author	SHA1	Message	Date
Lawrence Lee	a41c15a329	[swss]: Listen for undeliverable tunnel packets (#9348 ) - Create a script in the orchagent docker container which listens for these encapsulated packets which are trapped to CPU (indicating that they cannot be routed/no neighbor info exists for the inner packet). When such a packet is received, the script will issue a ping command to the packet's inner destination IP to start the neighbor learning process. - This script is also resilient to portchannel status changes (i.e. interface going up or down). An interface going down does not affect traffic sniffing on interfaces which are still up. When an interface comes back up, we restart the sniffer to start capturing traffic on that interface again.	2021-12-16 11:59:34 -08:00
Joe LeVeque	dd9be59cd1	[202012][dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7203 ) #### Why I did it Backport of https://github.com/Azure/sonic-buildimage/pull/7083 to the 202012 branch. To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged: ``` Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46 ``` This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10. This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802). I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.	2021-04-01 12:52:19 -07:00
yozhao101	cdef77f4c5	[SwSS] Disabled the autorestart of process `coppmgrd`. (#6774 ) coppmgrd process do not need to be auto-restarted if it exited unexpectedly. Signed-off-by: Yong Zhao <yozhao@microsoft.com>	2021-02-16 15:32:36 -08:00
yozhao101	cc9c3f567e	[supervisord] Monitoring the critical processes with supervisord. (#6242 ) - Why I did it Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance. Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by Supervisord, we can only focus on the logic of monitoring. - How I did it We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take following steps if it was notified one of critical processes exited unexpectedly: The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted. If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog. - How to verify it First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not. Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute. - Which release branch to backport (provide reason below if selected) 201811 201911 [x ] 202006	2021-01-28 09:28:27 -08:00
Prince Sunny	8fd50e895c	[submodule]: swss Tunnel Manager changes (#5843 ) Introduce tunnel manager daemon. Start the process as part of swss container Submodule update for swss: 9ed3026 - 2020-12-24 : [NAT] ACL Rule with DO_NOT_NAT action is getting failed. (#1502) [Akhilesh Samineni] c39a4b1 - 2020-12-23 : Mux/IPTunnel orchagent changes (#1497) [Prince Sunny] bc8df0e - 2020-12-23 : Add support for headroom pool watermark (#1567) [Neetha John]	2020-12-26 11:17:18 -08:00
KISHORE KUNAL	4bb8ab3495	Add support to start fdbsyncd when orchagent docker starts (#5979 ) Add support to start fdbsyncd when swss docker starts. New demon is added to sync MAC from Kernel to DB and vise versa.	2020-12-24 18:36:01 -08:00
Stephen Sun	e010d83fc3	[Dynamic buffer calc] Support dynamic buffer calculation (#6194 ) - Why I did it To support dynamic buffer calculation. This PR also depends on the following PRs for sub modules - [sonic-swss: [buffermgr/bufferorch] Support dynamic buffer calculation #1338](https://github.com/Azure/sonic-swss/pull/1338) - [sonic-swss-common: Dynamic buffer calculation #361](https://github.com/Azure/sonic-swss-common/pull/361) - [sonic-utilities: Support dynamic buffer calculation #973](https://github.com/Azure/sonic-utilities/pull/973) - How I did it 1. Introduce field `buffer_model` in `DEVICE_METADATA\|localhost` to represent which buffer model is running in the system currently: - `dynamic` for the dynamic buffer calculation model - `traditional` for the traditional model in which the `pg_profile_lookup.ini` is used 2. Add the tables required for the feature: - ASIC_TABLE in platform/\<vendor\>/asic_table.j2 - PERIPHERAL_TABLE in platform/\<vendor\>/peripheral_table.j2 - PORT_PERIPHERAL_TABLE on a per-platform basis in device/\<vendor\>/\<platform\>/port_peripheral_config.j2 for each platform with gearbox installed. - DEFAULT_LOSSLESS_BUFFER_PARAMETER and LOSSLESS_TRAFFIC_PATTERN in files/build_templates/buffers_config.j2 - Add lossless PGs (3-4) for each port in files/build_templates/buffers_config.j2 3. Copy the newly introduced j2 files into the image and rendering them when the system starts 4. Update the CLI options for buffermgrd so that it can start with dynamic mode 5. Fetches the ASIC vendor name in orchagent: - fetch the vendor name when creates the docker and pass it as a docker environment variable - `buffermgrd` can use this passed-in variable 6. Clear buffer related tables from STATE_DB when swss docker starts 7. Update the src/sonic-config-engine/tests/sample_output/buffers-dell6100.json according to the buffer_config.j2 8. Remove buffer pool sizes for ingress pools and egress_lossy_pool Update the buffer settings for dynamic buffer calculation	2020-12-13 11:35:39 -08:00
Sudharsan Dhamal Gopalarathnam	98a434e8c1	Copp Manager Changes (#4861 ) *Introduce CoPP Manager infrastructure Copp service to generate initial copp config template file Co-authored-by: dgsudharsan <sudharsan_gopalarat@dell.com>	2020-11-23 09:31:42 -08:00
Joe LeVeque	7bf05f7f4f	[supervisor] Install vanilla package once again, install Python 3 version in Buster container (#5546 ) - Why I did it We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0. - How I did it - Remove Makefiles and patches for building supervisor package from source - Install Python 3 supervisor package version 4.2.1 in Buster base container - Also install Python 3 version of supervisord-dependent-startup in Buster base container - Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again. - Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers	2020-11-19 23:41:32 -08:00
Syd Logan	0311a4a037	Add gearbox phy device files and a new physyncd docker to support VS gearbox phy feature (#4851 ) * buildimage: Add gearbox phy device files and a new physyncd docker to support VS gearbox phy feature * scripts and configuration needed to support a second syncd docker (physyncd) * physyncd supports gearbox device and phy SAI APIs and runs multiple instances of syncd, one per phy in the device * support for VS target (sonic-sairedis vslib has been extended to support a virtual BCM81724 gearbox PHY). HLD is located at `b817a12fd8/doc/gearbox/gearbox_mgr_design.md` - Why I did it This work is part of the gearbox phy joint effort between Microsoft and Broadcom, and is based on multi-switch support in sonic-sairedis. - How I did it Overall feature was implemented across several projects. The collective pull requests (some in late stages of review at this point): https://github.com/Azure/sonic-utilities/pull/931 - CLI (merged) https://github.com/Azure/sonic-swss-common/pull/347 - Minor changes (merged) https://github.com/Azure/sonic-swss/pull/1321 - gearsyncd, config parsers, changes to orchargent to create gearbox phy on supported systems https://github.com/Azure/sonic-sairedis/pull/624 - physyncd, virtual BCM81724 gearbox phy added to vslib - How to verify it In a vslib build: root@sonic:/home/admin# show gearbox interfaces status PHY Id Interface MAC Lanes MAC Lane Speed PHY Lanes PHY Lane Speed Line Lanes Line Lane Speed Oper Admin -------- ----------- --------------- ---------------- --------------- ---------------- ------------ ----------------- ------ ------- 1 Ethernet48 121,122,123,124 25G 200,201,202,203 25G 204,205 50G down down 1 Ethernet49 125,126,127,128 25G 206,207,208,209 25G 210,211 50G down down 1 Ethernet50 69,70,71,72 25G 212,213,214,215 25G 216 100G down down In addition, docker ps \| grep phy should show a physyncd docker running. Signed-off-by: syd.logan@broadcom.com	2020-09-25 08:32:44 -07:00
Tamer Ahmed	b43f1129b4	[swss] Start Restore Neighbor After SWSS Config (#5451 ) SWSS config script restore ARP/FDB/Routes. Restore neighbor script uses config DB ARP information to restore ARP entries and so needs to be started after swssconfig exits. signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>	2020-09-24 14:57:42 -07:00
Joe LeVeque	5b3b4804ad	[dockers][supervisor] Increase event buffer size for dependent-startup (#5247 ) When stopping the swss, pmon or bgp containers, log messages like the following can be seen: ``` Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34 Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35 Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36 Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37 ``` This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100. Resolves https://github.com/Azure/sonic-buildimage/issues/5241	2020-09-08 23:36:38 -07:00
lguohan	cebb85b161	[docker-orchagent]: start portsyncd before orchagent (#4845 ) when portsyncd starts, it first enumerates all front panel ports and marks them as old interfaces. Then, for new front panel ports it checks if their indexes exist in previous sets. If yes, it will treats them as old interfaces and ignore them. The reason we have this check is because broadcom SAI only removes front panel ports after sai switch init. So, if portsyncd starts after orchagent, new interfaces could be created before portsyncd and treated as old interface. Signed-off-by: Guohan Lu <lguohan@gmail.com>	2020-06-24 22:48:37 -07:00
Guohan Lu	b8da6c3588	[docker-orchagent]: use service dependency in supervisord to start services	2020-05-22 11:01:28 -07:00
yozhao101	91e5fb5602	[Service] Enable/disable container auto-restart based on configuration. (#4073 )	2020-02-07 12:34:07 -08:00
zhenggen-xu	c23aac1581	[swss] Remove "-p port_config.ini" option from the portsyncd (#3671 ) * [portsyncd] Remove "-p port_config.ini" option from the portsyncd Signed-off-by: Zhenggen Xu <zxu@linkedin.com>	2019-10-27 21:15:39 -07:00
Joe LeVeque	6eca27e564	[services] Restart SwSS service upon unexpected critical process exit (#2845 ) * [service] Restart SwSS Docker container if orchagent exits unexpectedly * Configure systemd to stop restarting swss if it attempts to restart more than 3 times in 20 minutes * Move supervisor-proc-exit-listener script * [docker-dhcp-relay] Enhance wait_for_intf.sh.j2 to utilize STATEDB * Ensure dependent services stop/start/restart with SwSS * Change 'StartLimitInterval' to 'StartLimitIntervalSec', as Stretch installs systemd 232 (>= v230) * Also update journald.conf options * Remove 'PartOf' option from unit files * Add '$(SUPERVISOR_PROC_EXIT_LISTENER_SCRIPT)' to new shared docker-orchagent makefile * Make supervisor-proc-exit-listener script read from 'critical_processes' file inside container * Update critical_processes file for swss container	2019-05-01 08:02:38 -07:00
Ze Gan	2e86caaedb	[vxlanmgrd]: Add vxlanmgrd start command (#2705 ) * Add bridge-utils to orchagent image - Add vxlanmgrd to supervisorctl in docker -orchagent Signed-off-by: Ze Gan zegan@microsoft.com * Update submodule pointer for swss to include Vxlanmgrd changes	2019-04-23 20:38:08 -07:00
Marian Pritsak	178764e3fa	[swss][supervisord.conf] Remove intfsyncd	2019-01-13 16:04:39 +02:00
Prince Sunny	43f6df4654	Add nbrmgr to supervisor control (#2265 ) * Add nbrmgr to supervisord conf * Corrected priority values [Fix typo] * Submodule update for Neighbor manager daemon Submodule update sonic-swss-common: edbfeec - Remove default docker name value of swss. (#250) 9728462 - Corrected configDB name for neigh table (#251) 6decc65 - Add NEIGH_TABLE to configDB for neighbor configuration (#249) 9918ae6 - Add ProducerStateTable temp view implementation and UT (#247) 41408f2 - Update README on dependencies d9c0ba4 -Update README on the section 'Build with Google Test' bb7fa5b - [ut]: explicit convert is to bool type (#248) 661b82c - Add gtest instruction in README Submodule update sonic-swss 705b092 - Support ConfigDB neighbor configuration, introduce nbrmgr daemon (#693) 8522390 - Add vxlan switch attributes to switch orch (#712) b123fa0 - [schema] update WARM_RESTART_TABLE:process_name schema document (#707) 2d7ab0c - Revert "Align default MTU value as SAI default (#705)" (#710) 836a58c - Align default MTU value as SAI default (#705) bffa01f - VNET/VXLAN changes (#643) b750a4b - [watermarkorch] add watermarkorch, extend queue and pg counters with wat\u2026 (#629)	2018-11-28 21:58:59 -08:00
zhenggen-xu	51a76614a3	Restore neighbor table to kernel during system warm-reboot (#2213 ) * Restore neighbor table to kernel during system warm-reboot Added a service: "restore_neighbors" to restore neighbor table into kernel during system warm reboot. The service is started by supervisord in swss docker when the docker is started. In case system warm reboot is enabled, it will try to restore the neighbor table from appDB into kernel through netlink API calls and update the neighbor table by sending arp/ns requests to all neighbor entries, then it sets the stateDB flag for neighsyncd to continue the reconciliation process. -- Added tcpdump python-scapy debian package into orchagent and vs dockers. -- Added python module: pyroute2 netifaces into orchagent and vc dockers. -- Workarounded tcpdump issue in the vs docker Signed-off-by: Zhenggen Xu <zxu@linkedin.com> * Move the restore_neighbors.py to sonic-swss submodule Made changes to makefiles accordingly Make dockerfile.j2 changes and supervisord config changes Add python monotonic lib for time access Signed-off-by: Zhenggen Xu <zxu@linkedin.com> * Added PYTHON_SWSSCOMMON as swss runtime dependency Signed-off-by: Zhenggen Xu <zxu@linkedin.com>	2018-11-09 17:06:09 -08:00
Qi Luo	709cd5a9f5	Set swssconfig.sh startsecs=0 for quick exit (#2181 ) The default startsecs is 1 second. However, swssconfig.sh will quickly exit with expected exit code 0 during warm starting. This case should not be treated as a failure	2018-10-22 23:40:24 -07:00
Marian Pritsak	8a5e6ac47d	[docker-orchagent]: Add vrfmgrd to supervisorctl (#2055 ) * [docker-orchagent]: Add vrfmgrd to supervisorctl Signed-off-by: Marian Pritsak <marianp@mellanox.com> * [sonic-vs]: Add vrfmgrd to supervisorctl Signed-off-by: Marian Pritsak <marianp@mellanox.com>	2018-09-19 22:18:39 -07:00
Shuotian Cheng	9413fa9a7b	[interfaces]: Move IP/MTU information from interfaces file into database (#1908 ) - Move front panel ports and port channels MTU and IP configurations out of the current /etc/network/interfaces file and store them in the configuration database. - The default MTU value for both front panel ports and the port channels is 9100. They are set via the minigraph or 9100 by default. - Introduce portmgrd which will pick up the MTU configurations from the configuration database. - The updated intfmgrd will pick up IP address changes from the configuration database. - Update sonic-swss submodule Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>	2018-08-20 11:19:16 -07:00
pavel-shirshov	10b4bbcae8	[swss]: Start counter from swss container (#1875 ) * sonic-quagga update. Don't spam with 'Vtysh connected from' message * Enable counters inside swss container. systemd is not flexible enough to follow our business rules	2018-07-26 13:39:08 -07:00
pavel-shirshov	c52fb762dd	Convert arp_update into a 'start-it-once' mode (#1864 ) * Run arp_update just once, don't restart it. It will run continuosly with 5 min pauses	2018-07-18 13:04:57 -07:00
Andriy Moroz	58d8302b53	Buffers configuration update on port speed change (#1345 ) * Move buffer configuration to ConfigDB Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Converted Dell and Arista configs Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Add buffer configs for ACS-MSN2740 Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Updated buffers template Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Fixed j2 unit test Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Update buffers config for Force10-S6100 Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Update VS docker to support speed and buffers test Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Update buffers config generation - fixed support of sonic-to-sonic install Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Update submodules pointers for buffers config Signed-off-by: Andriy Moroz <c_andriym@mellanox.com>	2018-01-29 08:11:05 -08:00
Ying Xie	2b91c9681d	Revert "Buffers configuration update on port speed change (#1250 )" (#1340 ) This reverts commit `814e50fd5e`.	2018-01-26 10:13:43 -08:00
Andriy Moroz	814e50fd5e	Buffers configuration update on port speed change (#1250 ) * Move buffer configuration to ConfigDB Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Converted Dell and Arista configs Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Add buffer configs for ACS-MSN2740 Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Updated buffers template Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Fixed j2 unit test Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Update buffers config for Force10-S6100 Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Update VS docker to support speed and buffers test Signed-off-by: Andriy Moroz <c_andriym@mellanox.com> * Update buffers config generation - fixed support of sonic-to-sonic install Signed-off-by: Andriy Moroz <c_andriym@mellanox.com>	2018-01-26 08:09:31 -08:00
JipanYanga	7406d3709b	[configdb]: Add support for vlanconfd and intfconfd (#1063 ) * Add support for vlanconfd and intfconfd Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com> * Change name to vlanmgrd and intfmgrd Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com> * Add missing vlan_members for parse_dpg result Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com> * Remove cfgmgr debug CLI from image Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com> * Update swss and swss-common submodules for VLAN trunk support Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>	2017-11-05 22:37:16 -08:00
Qi Luo	554114cfaa	Make swssconfig status FATAL when it fails (#1009 ) * Make supervisor controlled one-shot program autorestart 0 time, so the status will become FATAL instead of EXITED if failure happens Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com> * Make swssconfig.sh strictly exit on any failure Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com> * Tune startretries, tested in supervisor 3.3.2-1 Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>	2017-10-04 01:02:30 -07:00
Shuotian Cheng	d1b12dc0ca	[swss]: Sleep 5 min regardless of arp_update return code (#743 ) - arp_update return code is not guaranteed to be true/false. When there is no VLAN, arp_update will return true. When there are VLANs, arp_update will return false because the command arping returns 1 due to the option '-w 0'. - This script should be run every 5 minutes regardless of the return code.	2017-06-22 21:26:33 -07:00
Shuotian Cheng	8af03fd0f9	[orchagent]: Add ARP update script to maintain VLAN neighbors (#401 ) - Extend ARP reachable time to 30min - Add arping to docker-swss - Add arp_update script to routinely probe neighbors Signed-off-by: Shuotian Cheng <shuche@microsoft.com>	2017-05-15 17:06:19 -07:00
Joe LeVeque	6e45307a49	[docker-orchagent]: Properly manage with supervisord (#589 )	2017-05-11 11:18:10 -07:00
Joe LeVeque	d5c13c0a83	[dockers]: Disable autorestart on all supervisor processes inside containers (#580 )	2017-05-09 17:37:08 -07:00
Joe LeVeque	8f348399f5	[Dockers]: Manage all Docker containers with Supervisord (#573 ) - Consolidate config.sh and start.sh scripts into one script (start.sh) - Solve issue #435 - All dockers now run supervisord as their ENTRYPOINT - All stdout/stderr output from processes managed by supervisord is now sent to syslog instead of their own files - Supervisord log messages are now also sent to syslog - Removed unused smartmontools package from docker-platform-monitor	2017-05-08 15:43:31 -07:00

36 Commits