When stopping the swss, pmon or bgp containers, log messages like the following can be seen:
```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```
This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.
Resolves https://github.com/Azure/sonic-buildimage/issues/5241
**- Why I did it**
Initially, the critical_processes file contains either the name of critical process or the name of group.
For example, the critical_processes file in the dhcp_relay container contains a single group name
`isc-dhcp-relay`. When testing the autorestart feature of each container, we need get all the critical
processes and test whether a container can be restarted correctly if one of its critical processes is
killed. However, it will be difficult to differentiate whether the names in the critical_processes file are
the critical processes or group names. At the same time, changing the syntax in this file will separate the individual process from the groups and also makes it clear to the user.
Right now the critical_processes file contains two different kind of entries. One is "program:xxx" which indicates a critical process. Another is "group:xxx" which indicates a group of critical processes
managed by supervisord using the name "xxx". At the same time, I also updated the logic to
parse the file critical_processes in supervisor-proc-event-listener script.
**- How to verify it**
We can first enable the autorestart feature of a specified container for example `dhcp_relay` by running the comman `sudo config container feature autorestart dhcp_relay enabled` on DUT. Then we can select a critical process from the command `docker top dhcp_relay` and use the command `sudo kill -SIGKILL <pid>` to kill that critical process. Final step is to check whether the container is restarted correctly or not.
libpython2.7, libdaemon0, libdbus-1-3, libjansson4 are common
across different containers. move them into docker-base-stretch
Signed-off-by: Guohan Lu <lguohan@gmail.com>
Put a flag for fast-reboot to the db using EXPIRE feature. Using this flag in other part of SONiC to start in Fast-reboot mode. If we reload a config, the state in the db will be removed.
- create a dockerfile-marcros.j2 file with all common operations
written as j2 macro
- use single dockerfile instruction for COPY and RUN commands
when possible to improve build time
- reorganize dockerfile instructions to make more cache friendly
(in case someday we will remove --no-cache to build docker images)
Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
* Restore neighbor table to kernel during system warm-reboot
Added a service: "restore_neighbors" to restore neighbor table into
kernel during system warm reboot. The service is started by supervisord
in swss docker when the docker is started.
In case system warm reboot is enabled, it will try to restore the neighbor
table from appDB into kernel through netlink API calls and update the neighbor
table by sending arp/ns requests to all neighbor entries, then it sets the
stateDB flag for neighsyncd to continue the reconciliation process.
-- Added tcpdump python-scapy debian package into orchagent and vs dockers.
-- Added python module: pyroute2 netifaces into orchagent and vc dockers.
-- Workarounded tcpdump issue in the vs docker
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
* Move the restore_neighbors.py to sonic-swss submodule
Made changes to makefiles accordingly
Make dockerfile.j2 changes and supervisord config changes
Add python monotonic lib for time access
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
* Added PYTHON_SWSSCOMMON as swss runtime dependency
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
Remove the teamd.j2 templates used for starting the teamd. Add
teammgrd instead to manage all port channel related configuration
changes. Remove front panel port related configurations in
interfaces.j2 templates as well.
Remove teamd.sh script and use teammgrd to start all the teamd
processes. Remove all the logics in the start.sh script as well.
Update the sonic-swss submodule.
Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
Previously use / to separate container name and program name.
However, in rsyslogd:
Precisely, the programname is terminated by either (whichever occurs first):
end of tag
nonprintable character
‘:’
‘[‘
‘/’
The above definition has been taken from the FreeBSD syslogd sources.
Signed-off-by: Guohan Lu <gulv@microsoft.com>
* Use MAC from EEPROM for PortChannels
Signed-off-by: Andriy Moroz <c_andriym@mellanox.com>
* Use MAC from EEPROM in DEVICE_METADATA
Will affect MAC for VLAN interfaces
Signed-off-by: Andriy Moroz <c_andriym@mellanox.com>
* Get MAC via decode-syseeprom
Signed-off-by: Andriy Moroz <c_andriym@mellanox.com>
* hw-management is now a service
Signed-off-by: Andriy Moroz <c_andriym@mellanox.com>
* Add error handling for MAC fetch process
Signed-off-by: Andriy Moroz <c_andriym@mellanox.com>
* [libteam] Add fallback support for single-member-port LAG
* Allow the port to be selected if the LAG is configured
with fallback and port is in defaulted state due to missing
LACP PDUs from remote end
* Only enable port if LAG is admin up and the member port
is link up
* [team] Add lacp fallback config to teamd.j2 template
* [teamd] Resolve config conflict between fallback and minlink
* Remove min_link config if fallback is configured
* Add support for fallback config in minigraph
* [teamd] Only enable fallback if it is single-member-port LAG
Signed-off-by: Haiyang Zheng <haiyang.z@alibaba-inc.com>
* [teamd] Removing the admin status check in lacp_port_link_update
Will submit another pull request to fix this issue.
Signed-off-by: Haiyang Zheng <haiyang.z@alibaba-inc.com>
This reverts commit a6edef2fa5.
The reason to revert this commit is that it breaks the current nightly test as
no port channel interfaces are get created after boot. teamd failed to start and
complained about 'Cannot allocate memory' possibly due to nlmsg_alloc function
failure.
Will revert this commit to investigate it further before moving to supervisor.
Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
With the fixes in /etc/network/interfaces file, host interfaces
could be added into the corresponding LAGs automatically. Thus,
the logic of checking if port initialization is ready is no longer
needed.
Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
Modify minigraph parser output format so it fit DB schema
Modify configuration templates to fit new schema
Systemd services dependencies are modified so database starts before any configuration consumer
In Jinja2, '|' cannot be treated directly as piping operator. The
operator precedence of '|' is higher than '*'. The filter only applies
to the value just before it. Group the expression to make sure that the
filter is applied to the outcome of the expression.
Update the unit test to add such case.
* [docker-teamd]: Explicitly set LAG hwaddr
Team device is initially created without any members and has a random HW
address, which is later changed to port's address. This configuration
sets team device's address explicitly to base MAC to avoid reassignment.
Signed-off-by: marian-pritsak <marianp@mellanox.com>
* Update teamd config tests with hwaddr
Signed-off-by: marian-pritsak <marianp@mellanox.com>
* Align HW addr byte for Centec and Mellanox
Signed-off-by: marian-pritsak <marianp@mellanox.com>
* Change HW addr to unicast in config tests
Signed-off-by: marian-pritsak <marianp@mellanox.com>
- Consolidate config.sh and start.sh scripts into one script (start.sh)
- Solve issue #435 - All dockers now run supervisord as their ENTRYPOINT
- All stdout/stderr output from processes managed by supervisord is now sent to syslog instead of their own files
- Supervisord log messages are now also sent to syslog
- Removed unused smartmontools package from docker-platform-monitor
- Add -p --port-config option to feed sonic-cfggen with port_config.ini
file when necessary.
- Update minigraph.py file to accept the -p option
- Add test_j2files.py test to test config.sh and all .j2 templates
* Currently test_teamd is added to test both the config.sh and teamd.j2
file works well with the t0 sample minigraph and sample port config
file
* The sample output is added to the folder sample_output for comparison
Signed-off-by: Shuotian Cheng <shuche@microsoft.com>
- minigraph_portchannel_interfaces and minigraph_vlan_interfaces are lists
of interfaces and the name could duplicate due to multiple IPs
- Add minigraph_portchannels and minigraph_vlans dictionaries to support
querying port channels and vlans via the name
- Update teamd.j2 template and config.sh file in docker-teamd
- Update zebra.conf.j2 template to add port channel interfaces
Signed-off-by: Shuotian Cheng <shuche@microsoft.com>
This change should be temporary because the current teamd cannot
re-create net devices acrosss restart. Basically, it will fail
when there're files in /var/run/teamd/ folder or the previously
created net devices are still there. Thus, the current workaround
is to remove the obsolete files to restart the docker-teamd.
This workaround cannot resolve the swss restart issue. Before
restarting swss, docker teamd needs to be stopped manually. After
swss starts, docker teamd needs to be restarted manually.
This change will only make sure that rebooting the switch will
make the switch at the correct state.
Signed-off-by: Shuotian Cheng <shuche@microsoft.com>
CMD is not longer a file name but a command that needs to be executed,
thus /bin/bash is not enough for the entrypoint and -c is needed.
Signed-off-by: Shuotian Cheng <shuche@microsoft.com>
* [docker-config-engine]: introduce docker sonic config engine
sonic config engine provide the sonic configure engine for all sonic
dockers that rely on the engine to generate runtime configuration.