sonic-buildimage/dockers/docker-sonic-telemetry/supervisord.conf

61 lines
1.2 KiB
Plaintext
Raw Permalink Normal View History

[supervisord]
logfile_maxbytes=1MB
logfile_backups=2
nodaemon=true
[eventlistener:dependent-startup]
command=python3 -m supervisord_dependent_startup
autostart=true
autorestart=unexpected
startretries=0
exitcodes=0,3
events=PROCESS_STATE
[dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7083) To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged: ``` Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46 ``` This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10. This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802). I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-27 23:14:24 -05:00
buffer_size=1024
[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name telemetry
[supervisord] Monitoring the critical processes with supervisord. (#6242) - Why I did it Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance. Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by Supervisord, we can only focus on the logic of monitoring. - How I did it We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take following steps if it was notified one of critical processes exited unexpectedly: The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted. If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog. - How to verify it First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not. Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute. - Which release branch to backport (provide reason below if selected) 201811 201911 [x ] 202006
2021-01-21 14:57:49 -06:00
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
SONiC Management Framework Release 1.0 (#3488) * Added sonic-mgmt-framework as submodule / docker * fix build issues * update sonic-mgmt-framework submodule branch to master * Merged changes 70007e6d2ba3a4c0b371cd693ccc63e0a8906e77..00d4fcfed6a759e40d7b92120ea0ee1f08300fc6 00d4fcfed6a759e40d7b92120ea0ee1f08300fc6 Modified environemnt variables * Changes to build sonic-mgmt-framework docker * bumped up sonic-mgmt-framework commit-id * version bump for sonic-mgmt-framework commit-it * bumped up sonic-mgmt-framework commit-id * Add python packages to docker * Build fix for docker with python packages * added libyang as dependent package * Allow building images on NFS-mounted clones Prior to this change, `build_debian.sh` would generate a Debian filesystem in `./fsroot`. This needs root permissions, and one of the tests that is performed is whether the user can create a character special file in the filesystem (using mknod). On most NFS deployments, `root` is the least privileged user, and cannot run mknod. Also, attempting to run commands like rm or mv as root would fail due to permission errors, since the root user gets mapped to an unprivileged user like `nobody`. This commit changes the location of the Debian filesystem to `/fsroot`, which is a tmpfs mount within the slave Docker. The default squashfs, docker tarball and zip files are also created within /tmp, before being copied back to /sonic as the regular user. The side effect of this change is that the contents of `/fsroot` are no longer available once the slave container exits, however they are available within the squashfs image. Signed-off-by: Nirenjan Krishnan <Nirenjan.Krishnan@dell.com> * bumped up sonc-mgmt-framework commit to include PR #18 * REST Server startup script is enahnced to read the settings from ConfigDB. Below table provides mapping of db field to command line argument name. ============================================================ ConfigDB entry key Field name REST Server argument ============================================================ REST_SERVER|default port -port REST_SERVER|default client_auth -client_auth REST_SERVER|default log_level -v DEVICE_METADATA|x509 server_crt -cert DEVICE_METADATA|x509 server_key -key DEVICE_METADATA|x509 ca_crt -cacert ============================================================ * Replace src/telemetry as submodule to sonic-telemetry * Update telemetry commit HEAD * Update sonic-telemetry commit HEAD * libyang env path update * Add libyang dependency to telemetry * Add scripts to create JSON files for CLI backend Scripts to create /var/platform/syseeprom and /var/platform/system, which are back-end files for CLI, for system EEPROM and system information. Signed-off-by: Howard Persh <Howard_Persh@dell.com> * In startup script, create directory where CLI back-end files live Signed-off-by: Howard Persh <Howard_Persh@dell.com> * build dependency pkgs added to docker for build failure fix * Changes to fix build issue for mgmt framework * Fix exec path issue with telemetry * s5232[device] PSU detecttion and default led state support * Processing of first boot in rc.local should not have premature exit Signed-off-by: Howard Persh <Howard_Persh@dell.com> * docker mount options added for platform, system features * bumped up sonic-mgmt-framework commit id to pick 23rd July 2019 changes * Added mount options for telemetry docker to get access for system and platform info. * Update commit for sonic-utilities * [dell]: Corrected dport map and renamed config files for S5232F * Fix telemetry submodule commit * added support for sonic-cli console * [Dell S5232F, Z9264F] Harden FPGA driver kernel module For Dell S5232F and Z9264F platforms, be more strict when checking state in ISR of FPGA driver, to harden against spurious interrupts. Signed-off-by: Howard Persh <Howard_Persh@dell.com> * update mgmt-framework submodule to 27th Aug commit. * remove changes not related to mgmt-framework and sonic-telemetry * Revert "Replace src/telemetry as submodule to sonic-telemetry" This reverts commit 11c31929759a17122782d4944066a6ac8453b78d. * Revert "Replace src/telemetry as submodule to sonic-telemetry" This reverts commit 11c31929759a17122782d4944066a6ac8453b78d. * make submodule changes and remove a change not related to PR * more changes * Update .gitmodules * Update Dockerfile.j2 * Update .gitmodules * Update .gitmodules * Update .gitmodules reverting experimental change * Removed syspoll for release_1.0 Signed-off-by: Jeff Yin <29264773+jeff-yin@users.noreply.github.com> * Update docker-sonic-mgmt-framework.mk * Update sonic-mgmt-framework.mk * Update sonic-mgmt-framework.mk * Update docker-sonic-mgmt-framework.mk * Update docker-sonic-mgmt-framework.mk * Revert "Processing of first boot in rc.local should not have premature exit" This reverts commit e99a91ffc28a0fd13f4ad458719d2511c3665431. * Remove old telemetry directory * Update docker-sonic-mgmt-framework.mk * Resolving merge conflict with Azure * Reverting the wrong merge * Use CVL_SCHEMA_PATH instead of changing directory for telemetry startup * Add missing export * Add python mmh3 to slave dockerfile * Remove sonic-mgmt-framework build dep for telemetry, fix dialout startup issues * Provided flag to disable compiling mgmt-framework * Update sonic-utilites point latest commit id * Point sonic-utilities to Azure accepted SHA * Updating mgmt framework to right sha * Add sonic-telemetry submodule * Update the mgmt-framework commit id Co-authored-by: jghalam <joe.ghalam@gmail.com> Co-authored-by: Partha Dutta <51353699+dutta-partha@users.noreply.github.com> Co-authored-by: srideepDell <srideep_devireddy@dell.com> Co-authored-by: nirenjan <nirenjan@users.noreply.github.com> Co-authored-by: Sachin Holla <51310506+sachinholla@users.noreply.github.com> Co-authored-by: Eric Seifert <seiferteric@gmail.com> Co-authored-by: Howard Persh <hpersh@yahoo.com> Co-authored-by: Jeff Yin <29264773+jeff-yin@users.noreply.github.com> Co-authored-by: Arunsundar Kannan <31632515+arunsundark@users.noreply.github.com> Co-authored-by: rvasanthm <51932293+rvasanthm@users.noreply.github.com> Co-authored-by: Ashok Daparthi-Dell <Ashok_Daparthi@Dell.com> Co-authored-by: anand-kumar-subramanian <51383315+anand-kumar-subramanian@users.noreply.github.com>
2019-12-23 23:47:16 -06:00
autostart=true
autorestart=false
[dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7083) To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged: ``` Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46 ``` This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10. This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802). I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-03-27 23:14:24 -05:00
buffer_size=1024
[program:rsyslogd]
command=/usr/sbin/rsyslogd -n -iNONE
priority=1
autostart=false
autorestart=true
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
[program:start]
command=/usr/bin/start.sh
priority=2
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=rsyslogd:running
[program:telemetry]
command=/usr/bin/telemetry.sh
priority=3
autostart=false
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=start:exited
[program:dialout]
command=/usr/bin/dialout.sh
priority=4
autostart=false
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=telemetry:running