[supervisord] Monitoring the critical processes with supervisord. (#6242)

- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
This commit is contained in:
yozhao101 2021-01-21 12:57:49 -08:00 committed by GitHub
parent 5c31f6d8cc
commit be3c036794
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
30 changed files with 122 additions and 66 deletions

View File

@ -5,7 +5,7 @@ nodaemon=true
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name database command=/usr/bin/supervisor-proc-exit-listener --container-name database
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=50
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name dhcp_relay command=/usr/bin/supervisor-proc-exit-listener --container-name dhcp_relay
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=50
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name bgp command=/usr/bin/supervisor-proc-exit-listener --container-name bgp
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -5,7 +5,7 @@ nodaemon=true
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name bgp command=/usr/bin/supervisor-proc-exit-listener --container-name bgp
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -5,7 +5,7 @@ nodaemon=true
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name bgp command=/usr/bin/supervisor-proc-exit-listener --container-name bgp
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name lldp command=/usr/bin/supervisor-proc-exit-listener --container-name lldp
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name nat command=/usr/bin/supervisor-proc-exit-listener --container-name nat
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=100
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name swss command=/usr/bin/supervisor-proc-exit-listener --container-name swss
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=100
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name pmon command=/usr/bin/supervisor-proc-exit-listener --container-name pmon
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-script] [eventlistener:supervisor-proc-exit-script]
command=/usr/bin/supervisor-proc-exit-listener --container-name radv command=/usr/bin/supervisor-proc-exit-listener --container-name radv
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name sflow command=/usr/bin/supervisor-proc-exit-listener --container-name sflow
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=50
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name snmp command=/usr/bin/supervisor-proc-exit-listener --container-name snmp
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name restapi command=/usr/bin/supervisor-proc-exit-listener --container-name restapi
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=false autorestart=false

View File

@ -14,7 +14,7 @@ buffer_size=50
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name telemetry command=/usr/bin/supervisor-proc-exit-listener --container-name telemetry
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=false autorestart=false

View File

@ -14,7 +14,7 @@ buffer_size=50
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name teamd command=/usr/bin/supervisor-proc-exit-listener --container-name teamd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -2,11 +2,14 @@
import getopt import getopt
import os import os
import select
import signal import signal
import sys import sys
import syslog import syslog
import time
import swsssdk import swsssdk
from supervisor import childutils from supervisor import childutils
# Each line of this file should specify either one critical process or one # Each line of this file should specify either one critical process or one
@ -20,10 +23,18 @@ CRITICAL_PROCESSES_FILE = '/etc/supervisor/critical_processes'
# The FEATURE table in config db contains auto-restart field # The FEATURE table in config db contains auto-restart field
FEATURE_TABLE_NAME = 'FEATURE' FEATURE_TABLE_NAME = 'FEATURE'
# Read the critical processes/group names from CRITICAL_PROCESSES_FILE # Value of parameter 'timeout' in select(...) method
SELECT_TIMEOUT_SECS = 1.0
# Alerting message will be written into syslog in the following interval
ALERTING_INTERVAL_SECS = 60
def get_critical_group_and_process_list(): def get_critical_group_and_process_list():
"""
@summary: Read the critical processes/group names from CRITICAL_PROCESSES_FILE.
@return: Two lists which contain critical processes and group names respectively.
"""
critical_group_list = [] critical_group_list = []
critical_process_list = [] critical_process_list = []
@ -49,6 +60,47 @@ def get_critical_group_and_process_list():
return critical_group_list, critical_process_list return critical_group_list, critical_process_list
def generate_alerting_message(process_name):
"""
@summary: If a critical process was not running, this function will determine it resides in host
or in a specific namespace. Then an alerting message will be written into syslog.
"""
namespace_prefix = os.environ.get("NAMESPACE_PREFIX")
namespace_id = os.environ.get("NAMESPACE_ID")
if not namespace_prefix or not namespace_id:
namespace = "host"
else:
namespace = namespace_prefix + namespace_id
syslog.syslog(syslog.LOG_ERR, "Process '{}' is not running in namespace '{}'.".format(process_name, namespace))
def get_autorestart_state(container_name):
"""
@summary: Read the status of auto-restart feature from Config_DB.
@return: Return the status of auto-restart feature.
"""
config_db = swsssdk.ConfigDBConnector()
config_db.connect()
features_table = config_db.get_table(FEATURE_TABLE_NAME)
if not features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve features table from Config DB. Exiting...")
sys.exit(2)
if container_name not in features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve feature '{}'. Exiting...".format(container_name))
sys.exit(3)
is_auto_restart = features_table[container_name].get('auto_restart')
if not is_auto_restart:
syslog.syslog(
syslog.LOG_ERR, "Unable to determine auto-restart feature status for '{}'. Exiting...".format(container_name))
sys.exit(4)
return is_auto_restart
def main(argv): def main(argv):
container_name = None container_name = None
opts, args = getopt.getopt(argv, "c:", ["container-name="]) opts, args = getopt.getopt(argv, "c:", ["container-name="])
@ -62,51 +114,55 @@ def main(argv):
critical_group_list, critical_process_list = get_critical_group_and_process_list() critical_group_list, critical_process_list = get_critical_group_and_process_list()
while True: process_under_alerting = {}
# Transition from ACKNOWLEDGED to READY # Transition from ACKNOWLEDGED to READY
childutils.listener.ready() childutils.listener.ready()
line = sys.stdin.readline() while True:
file_descriptor_list = select.select([sys.stdin], [], [], SELECT_TIMEOUT_SECS)[0]
if len(file_descriptor_list) > 0:
line = file_descriptor_list[0].readline()
headers = childutils.get_headers(line) headers = childutils.get_headers(line)
payload = sys.stdin.read(int(headers['len'])) payload = sys.stdin.read(int(headers['len']))
# Transition from READY to ACKNOWLEDGED # Handle the PROCESS_STATE_EXITED event
childutils.listener.ok()
# We only care about PROCESS_STATE_EXITED events
if headers['eventname'] == 'PROCESS_STATE_EXITED': if headers['eventname'] == 'PROCESS_STATE_EXITED':
payload_headers, payload_data = childutils.eventdata(payload + '\n') payload_headers, payload_data = childutils.eventdata(payload + '\n')
expected = int(payload_headers['expected']) expected = int(payload_headers['expected'])
processname = payload_headers['processname'] process_name = payload_headers['processname']
groupname = payload_headers['groupname'] group_name = payload_headers['groupname']
# Read the status of auto-restart feature from Config_DB. if (process_name in critical_process_list or group_name in critical_group_list) and expected == 0:
config_db = swsssdk.ConfigDBConnector() is_auto_restart = get_autorestart_state(container_name)
config_db.connect() if is_auto_restart != "disabled":
features_table = config_db.get_table(FEATURE_TABLE_NAME) MSG_FORMAT_STR = "Process '{}' exited unexpectedly. Terminating supervisor '{}'"
if not features_table: msg = MSG_FORMAT_STR.format(payload_headers['processname'], container_name)
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve features table from Config DB. Exiting...")
sys.exit(2)
if container_name not in features_table:
syslog.syslog(syslog.LOG_ERR, "Unable to retrieve feature '{}'. Exiting...".format(container_name))
sys.exit(3)
restart_feature = features_table[container_name].get('auto_restart')
if not restart_feature:
syslog.syslog(
syslog.LOG_ERR, "Unable to determine auto-restart feature status for '{}'. Exiting...".format(container_name))
sys.exit(4)
# If auto-restart feature is not disabled and at the same time
# a critical process exited unexpectedly, terminate supervisor
if (restart_feature != 'disabled' and expected == 0 and
(processname in critical_process_list or groupname in critical_group_list)):
MSG_FORMAT_STR = "Process {} exited unxepectedly. Terminating supervisor..."
msg = MSG_FORMAT_STR.format(payload_headers['processname'])
syslog.syslog(syslog.LOG_INFO, msg) syslog.syslog(syslog.LOG_INFO, msg)
os.kill(os.getppid(), signal.SIGTERM) os.kill(os.getppid(), signal.SIGTERM)
else:
process_under_alerting[process_name] = time.time()
# Handle the PROCESS_STATE_RUNNING event
elif headers['eventname'] == 'PROCESS_STATE_RUNNING':
payload_headers, payload_data = childutils.eventdata(payload + '\n')
process_name = payload_headers['processname']
if process_name in process_under_alerting:
process_under_alerting.pop(process_name)
# Transition from BUSY to ACKNOWLEDGED
childutils.listener.ok()
# Transition from ACKNOWLEDGED to READY
childutils.listener.ready()
# Check whether we need write alerting messages into syslog
for process in process_under_alerting.keys():
epoch_time = time.time()
if epoch_time - process_under_alerting[process] >= ALERTING_INTERVAL_SECS:
process_under_alerting[process] = epoch_time
generate_alerting_message(process)
if __name__ == "__main__": if __name__ == "__main__":

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -5,7 +5,7 @@ nodaemon=true
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -13,7 +13,7 @@ events=PROCESS_STATE
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd command=python2 /usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -13,7 +13,7 @@ events=PROCESS_STATE
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name gbsyncd command=/usr/bin/supervisor-proc-exit-listener --container-name gbsyncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=25
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name syncd command=/usr/bin/supervisor-proc-exit-listener --container-name syncd
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=50
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name dhcp_relay command=/usr/bin/supervisor-proc-exit-listener --container-name dhcp_relay
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected

View File

@ -14,7 +14,7 @@ buffer_size=50
[eventlistener:supervisor-proc-exit-listener] [eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name dhcp_relay command=/usr/bin/supervisor-proc-exit-listener --container-name dhcp_relay
events=PROCESS_STATE_EXITED events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true autostart=true
autorestart=unexpected autorestart=unexpected