sonic-buildimage/dockers/docker-fpm-frr/frr/supervisord/supervisord.conf.j2

181 lines
4.1 KiB
Plaintext
Raw Normal View History

[supervisord]
logfile_maxbytes=1MB
logfile_backups=2
nodaemon=true
[eventlistener:dependent-startup]
command=python3 -m supervisord_dependent_startup
autostart=true
autorestart=unexpected
startretries=0
exitcodes=0,3
events=PROCESS_STATE
[dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7083) To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged: ``` Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46 ``` This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10. This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802). I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-27 23:14:24 -05:00
buffer_size=1024
[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name bgp
[supervisord] Monitoring the critical processes with supervisord. (#6242) - Why I did it Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance. Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by Supervisord, we can only focus on the logic of monitoring. - How I did it We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take following steps if it was notified one of critical processes exited unexpectedly: The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted. If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog. - How to verify it First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not. Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute. - Which release branch to backport (provide reason below if selected) 201811 201911 [x ] 202006
2021-01-21 14:57:49 -06:00
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected
[dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7083) To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged: ``` Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46 ``` This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10. This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802). I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-27 23:14:24 -05:00
buffer_size=1024
[program:rsyslogd]
command=/usr/sbin/rsyslogd -n -iNONE
priority=1
autostart=false
autorestart=unexpected
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
[program:zebra]
command=/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M fpm -M snmp
priority=4
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=rsyslogd:running
[program:zsocket]
command=/usr/bin/zsocket.sh
priority=4
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=zebra:running
[program:staticd]
command=/usr/lib/frr/staticd -A 127.0.0.1
priority=4
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=zsocket:exited
{% if DEVICE_METADATA.localhost.frr_mgmt_framework_config is defined and DEVICE_METADATA.localhost.frr_mgmt_framework_config == "true" %}
[program:bfdd]
command=/usr/lib/frr/bfdd -A 127.0.0.1
priority=4
stopsignal=KILL
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=zebra:running
{% endif %}
[program:bgpd]
command=/usr/lib/frr/bgpd -A 127.0.0.1 -M snmp
priority=5
stopsignal=KILL
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=zsocket:exited
{% if DEVICE_METADATA.localhost.frr_mgmt_framework_config is defined and DEVICE_METADATA.localhost.frr_mgmt_framework_config == "true" %}
[program:ospfd]
command=/usr/lib/frr/ospfd -A 127.0.0.1 -M snmp
priority=5
stopsignal=KILL
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=zebra:running
[program:pimd]
command=/usr/lib/frr/pimd -A 127.0.0.1
priority=5
stopsignal=KILL
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=zebra:running
{% endif %}
[program:fpmsyncd]
command=fpmsyncd
priority=6
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=bgpd:running
{% if DEVICE_METADATA.localhost.frr_mgmt_framework_config is defined and DEVICE_METADATA.localhost.frr_mgmt_framework_config == "true" %}
[program:frrcfgd]
command=/usr/local/bin/frrcfgd
{% else %}
[program:bgpcfgd]
command=/usr/local/bin/bgpcfgd
{% endif %}
priority=6
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=bgpd:running
[program:bgpmon]
command=/usr/local/bin/bgpmon
priority=6
autostart=false
autorestart=true
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=bgpd:running
{% if DEVICE_METADATA.localhost.docker_routing_config_mode is defined and DEVICE_METADATA.localhost.docker_routing_config_mode == "unified" %}
[program:vtysh_b]
command=/usr/bin/vtysh -b
priority=6
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
[docker-fpm-frr]: Start bgpd after zebra was started (#5038) fixes https://github.com/Azure/sonic-buildimage/issues/5026 Explanation: In the log from the issue I found: ``` I see following in the log Jul 22 21:13:06.574831 vlab-01 WARNING bgp#bgpd[49]: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed ``` Analyzing source code I found that the error message could be issues only when `zclient_send_rnh()` return less than 0. ``` ret = zclient_send_rnh(zclient, command, p, exact_match, bnc->bgp->vrf_id); /* TBD: handle the failure */ if (ret < 0) flog_warn(EC_BGP_ZEBRA_SEND, "sendmsg_nexthop: zclient_send_message() failed"); ``` I checked [zclient_send_rnh()](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L654) and found that this function will return the exit code which the function gets from [zclient_send_message()](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L266) But the latter function could return not 0 in two cases: 1. bgpd didn’t connect to the zclient socket yet [code](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L269) 2. The socket was closed. But in this case we would receive the error message in the log. (And I can find the message in the log when we reboot sonic) [code](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L277) Also I see from the logs that client connection was set later we had the issue in bgpd. Bgpd.log ``` Jul 22 21:13:06.574831 vlab-01 WARNING bgp#bgpd[49]: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed ``` Vs Zebra.log ``` Jul 22 21:13:12.713249 vlab-01 NOTICE bgp#zebra[48]: client 25 says hello and bids fair to announce only static routes vrf=0 Jul 22 21:13:12.820352 vlab-01 NOTICE bgp#zebra[48]: client 30 says hello and bids fair to announce only bgp routes vrf=0 Jul 22 21:13:12.820352 vlab-01 NOTICE bgp#zebra[48]: client 33 says hello and bids fair to announce only vnc routes vrf=0 ``` So in our case we should start zebra first. Wait until it is started and then start bgpd and other daemons. **- How I did it** I changed a graph to start daemons in the following order: 1. First start zebra 2. Then starts staticd and bgpd 3. Then starts vtysh -b and bgpeoi after bgpd is started.
2020-07-25 05:48:47 -05:00
dependent_startup_wait_for=bgpd:running
{% endif %}
{% if WARM_RESTART is defined and WARM_RESTART.bgp is defined and WARM_RESTART.bgp.bgp_eoiu is defined and WARM_RESTART.bgp.bgp_eoiu == "true" %}
[program:bgp_eoiu_marker]
command=/usr/bin/bgp_eoiu_marker.py
priority=7
autostart=false
autorestart=false
startsecs=0
startretries=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
[docker-fpm-frr]: Start bgpd after zebra was started (#5038) fixes https://github.com/Azure/sonic-buildimage/issues/5026 Explanation: In the log from the issue I found: ``` I see following in the log Jul 22 21:13:06.574831 vlab-01 WARNING bgp#bgpd[49]: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed ``` Analyzing source code I found that the error message could be issues only when `zclient_send_rnh()` return less than 0. ``` ret = zclient_send_rnh(zclient, command, p, exact_match, bnc->bgp->vrf_id); /* TBD: handle the failure */ if (ret < 0) flog_warn(EC_BGP_ZEBRA_SEND, "sendmsg_nexthop: zclient_send_message() failed"); ``` I checked [zclient_send_rnh()](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L654) and found that this function will return the exit code which the function gets from [zclient_send_message()](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L266) But the latter function could return not 0 in two cases: 1. bgpd didn’t connect to the zclient socket yet [code](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L269) 2. The socket was closed. But in this case we would receive the error message in the log. (And I can find the message in the log when we reboot sonic) [code](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L277) Also I see from the logs that client connection was set later we had the issue in bgpd. Bgpd.log ``` Jul 22 21:13:06.574831 vlab-01 WARNING bgp#bgpd[49]: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed ``` Vs Zebra.log ``` Jul 22 21:13:12.713249 vlab-01 NOTICE bgp#zebra[48]: client 25 says hello and bids fair to announce only static routes vrf=0 Jul 22 21:13:12.820352 vlab-01 NOTICE bgp#zebra[48]: client 30 says hello and bids fair to announce only bgp routes vrf=0 Jul 22 21:13:12.820352 vlab-01 NOTICE bgp#zebra[48]: client 33 says hello and bids fair to announce only vnc routes vrf=0 ``` So in our case we should start zebra first. Wait until it is started and then start bgpd and other daemons. **- How I did it** I changed a graph to start daemons in the following order: 1. First start zebra 2. Then starts staticd and bgpd 3. Then starts vtysh -b and bgpeoi after bgpd is started.
2020-07-25 05:48:47 -05:00
dependent_startup_wait_for=bgpd:running
{% endif %}