2017-10-05 01:35:43 -05:00
|
|
|
[supervisord]
|
|
|
|
logfile_maxbytes=1MB
|
|
|
|
logfile_backups=2
|
|
|
|
nodaemon=true
|
|
|
|
|
2020-05-16 10:27:31 -05:00
|
|
|
[eventlistener:dependent-startup]
|
2020-11-20 01:41:32 -06:00
|
|
|
command=python3 -m supervisord_dependent_startup
|
2020-05-16 10:27:31 -05:00
|
|
|
autostart=true
|
|
|
|
autorestart=unexpected
|
|
|
|
startretries=0
|
|
|
|
exitcodes=0,3
|
|
|
|
events=PROCESS_STATE
|
[dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7083)
To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:
```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```
This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.
This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).
I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-27 23:14:24 -05:00
|
|
|
buffer_size=1024
|
2020-05-16 10:27:31 -05:00
|
|
|
|
2019-11-05 20:32:14 -06:00
|
|
|
[eventlistener:supervisor-proc-exit-listener]
|
2020-02-07 14:34:07 -06:00
|
|
|
command=/usr/bin/supervisor-proc-exit-listener --container-name dhcp_relay
|
[supervisord] Monitoring the critical processes with supervisord. (#6242)
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.
Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.
- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:
The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.
If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.
- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.
Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.
- Which release branch to backport (provide reason below if selected)
201811
201911
[x ] 202006
2021-01-21 14:57:49 -06:00
|
|
|
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
|
2019-11-05 20:32:14 -06:00
|
|
|
autostart=true
|
|
|
|
autorestart=unexpected
|
[dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7083)
To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:
```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```
This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.
This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).
I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-27 23:14:24 -05:00
|
|
|
buffer_size=1024
|
2019-11-05 20:32:14 -06:00
|
|
|
|
2020-05-16 10:27:31 -05:00
|
|
|
[program:rsyslogd]
|
|
|
|
command=/usr/sbin/rsyslogd -n -iNONE
|
2017-10-05 01:35:43 -05:00
|
|
|
priority=1
|
2020-05-16 10:27:31 -05:00
|
|
|
autostart=false
|
2017-10-05 01:35:43 -05:00
|
|
|
autorestart=false
|
|
|
|
stdout_logfile=syslog
|
|
|
|
stderr_logfile=syslog
|
2020-05-16 10:27:31 -05:00
|
|
|
dependent_startup=true
|
2017-10-05 01:35:43 -05:00
|
|
|
|
2020-05-16 10:27:31 -05:00
|
|
|
[program:start]
|
|
|
|
command=/usr/bin/start.sh
|
2017-10-05 01:35:43 -05:00
|
|
|
priority=2
|
|
|
|
autostart=false
|
|
|
|
autorestart=false
|
2020-05-16 10:27:31 -05:00
|
|
|
startsecs=0
|
2017-10-05 01:35:43 -05:00
|
|
|
stdout_logfile=syslog
|
|
|
|
stderr_logfile=syslog
|
2020-05-16 10:27:31 -05:00
|
|
|
dependent_startup=true
|
|
|
|
dependent_startup_wait_for=rsyslogd:running
|
2017-10-05 01:35:43 -05:00
|
|
|
|
|
|
|
{# If our configuration has VLANs... #}
|
2021-01-25 12:48:48 -06:00
|
|
|
{% if VLAN_INTERFACE %}
|
2017-10-05 01:35:43 -05:00
|
|
|
{# Count how many VLANs require a DHCP relay agent... #}
|
2021-07-16 09:31:05 -05:00
|
|
|
{% set ipv4_num_relays = { 'count': 0 } %}
|
|
|
|
{% set ipv6_num_relays = { 'count': 0 } %}
|
2021-01-25 12:48:48 -06:00
|
|
|
{% for vlan_name in VLAN_INTERFACE %}
|
2021-03-04 22:43:08 -06:00
|
|
|
{% if VLAN and vlan_name in VLAN and 'dhcp_servers' in VLAN[vlan_name] and VLAN[vlan_name]['dhcp_servers']|length > 0 %}
|
2021-07-16 09:31:05 -05:00
|
|
|
{% set _dummy = ipv4_num_relays.update({'count': ipv4_num_relays.count + 1}) %}
|
|
|
|
{% endif %}
|
2022-06-04 13:37:04 -05:00
|
|
|
{% if DHCP_RELAY and vlan_name in DHCP_RELAY and DHCP_RELAY[vlan_name]['dhcpv6_servers']|length > 0 %}
|
2021-07-16 09:31:05 -05:00
|
|
|
{% set _dummy = ipv6_num_relays.update({'count': ipv6_num_relays.count + 1}) %}
|
2017-10-05 01:35:43 -05:00
|
|
|
{% endif %}
|
|
|
|
{% endfor %}
|
|
|
|
{# If one or more of the VLANs require a DHCP relay agent... #}
|
2021-07-16 09:31:05 -05:00
|
|
|
{% if ipv4_num_relays.count > 0 or ipv6_num_relays.count > 0 %}
|
|
|
|
{% include 'dhcp-relay.programs.j2' %}
|
2017-10-05 01:35:43 -05:00
|
|
|
|
|
|
|
{# Create a program entry for each DHCP relay agent instance #}
|
2019-09-06 14:01:08 -05:00
|
|
|
{% set relay_for_ipv4 = { 'flag': False } %}
|
2021-07-16 09:31:05 -05:00
|
|
|
{% set relay_for_ipv6 = { 'flag': False } %}
|
2021-01-25 12:48:48 -06:00
|
|
|
{% for vlan_name in VLAN_INTERFACE %}
|
2021-07-16 09:31:05 -05:00
|
|
|
{% include 'dhcpv4-relay.agents.j2' %}
|
2020-01-07 19:48:03 -06:00
|
|
|
{% endfor %}
|
|
|
|
|
2022-06-04 13:37:04 -05:00
|
|
|
{% include 'dhcpv6-relay.agents.j2' %}
|
2022-04-12 12:44:17 -05:00
|
|
|
{% include 'dhcp-relay.monitors.j2' %}
|
[dhcp_relay] Use dhcprelayd to manage critical processes (#17236)
Modify j2 template files in docker-dhcp-relay. Add dhcprelayd to group dhcp-relay instead of isc-dhcp-relay-VlanXXX, which would make dhcprelayd to become critical process.
In dhcprelayd, subscribe FEATURE table to check whether dhcp_server feature is enabled.
2.1 If dhcp_server feature is disabled, means we need original dhcp_relay functionality, dhcprelayd would do nothing. Because dhcrelay/dhcpmon configuration is generated in supervisord configuration, they will automatically run.
2.2 If dhcp_server feature is enabled, dhcprelayd will stop dhcpmon/dhcrelay processes started by supervisord and subscribe dhcp_server related tables in config_db to start dhcpmon/dhcrelay processes.
2.3 While dhcprelayd running, it will regularly check feature status (by default per 5s) and would encounter below 4 state change about dhcp_server feature:
A) disabled -> enabled
In this scenario, dhcprelayd will subscribe dhcp_server related tables and stop dhcpmon/dhcrelay processes started by supervisord and start new pair of dhcpmon/dhcrelay processes. After this, dhcpmon/dhcrelay processes are totally managed by dhcprelayd.
B) enabled -> enabled
In this scenaro, dhcprelayd will monitor db changes in dhcp_server related tables to determine whether to restart dhcpmon/dhrelay processes.
C) enabled -> disabled
In this scenario, dhcprelayd would unsubscribe dhcp_server related tables and kill dhcpmon/dhcrelay processes started by itself. And then dhcprelayd will start dhcpmon/dhcrelay processes via supervisorctl.
D) disabled -> disabled
dhcprelayd will check whether dhcrelay processes running status consistent with supervisord configuration file. If they are not consistent, dhcprelayd will kill itself, then dhcp_relay container will stop because dhcprelayd is critical process.
2023-11-27 11:30:01 -06:00
|
|
|
{% endif %}
|
|
|
|
{% endif %}
|
2023-11-02 10:09:01 -05:00
|
|
|
[program:dhcprelayd]
|
|
|
|
command=/usr/local/bin/dhcprelayd
|
|
|
|
priority=3
|
|
|
|
autostart=false
|
|
|
|
autorestart=false
|
|
|
|
stdout_logfile=syslog
|
|
|
|
stderr_logfile=syslog
|
|
|
|
dependent_startup=true
|
|
|
|
dependent_startup_wait_for=start:exited
|