Put a flag for fast-reboot to the db using EXPIRE feature. Using this flag in other part of SONiC to start in Fast-reboot mode. If we reload a config, the state in the db will be removed.
The sflow service should not start unless the swss service is started. However, if this service is not started, the sflow service should not attempt to start them, instead it should simply fail to start. Using Requisite=, we will achieve this behavior, whereas using Requires= will cause the required service to be started.
ASIC reset events are captured by hw-mgmt and hw-mgmt calls chipup/chipdown internally without OS iteraction
Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
* Updates per review comments
1) core_uploader service waits for syslog.service
2) core_uploader service enabled for restart on failure
3) Use mtime instead of file size + ample time to be robust.
* Avoid reloading already uploaded file, by marking the names with a prefix.
* Updated failing path.
1) If rc file is missing or required data missing, it periodically logs error in forever loop.
2) If upload fails, retry every hour with a error log, forever.
* Fix few bugs
* The binary update_json.py will come from sonic-utilities.
If we need to stop swss during fast-reboot procedure on the boot up path,
it means that something went wrong, like syncd/orchagent crashed already,
we are stopping and restarting swss/syncd to re-initialize. In this case,
we should proceed as if it is a cold reboot.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
In place editing (sed -i) seems having some issues with filesystem
interaction. It could leave 0 size file or corrupted file behind.
It would be safer to sed the file contents into a new file and switch
new file with the old file.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* Corefile uploader service
1) A service is added to watch /var/core and upload to Azure storage
2) The service is disabled on boot. One may enable explicitly.
3) The .rc file to be updated with acct credentials and http proxy to use.
4) If service is enabled with no credentials, it would sleep, with periodic log messages
5) For any update in .rc, the service has to be restarted to take effect.
* Remove rw permission for .rc file for group & others.
* Changes per review comments.
Re-ordered .rc file per JSON.dump order.
Added a script to enable partial update of .rc, which HWProxy would use to add acct key.
* Azure storage upload requires python module futures, hence added it to install list.
* Removed trailing spaces.
* A mistake in name corrected.
Copy the .rc updater script to /usr/bin.
* [process-reboot-cause]Address the issue: Incorrect reboot cause returned when warm reboot follows a hardware caused reboot
1. check whether /proc/cmdline indicates warm/fast reboot.
if yes the software reboot cause file will be treated as the reboot cause.
finish
2. check whether platform api returns a reboot cause.
if yes it is treated as the reboot cause.
finish.
3. check whether /hosts/reboot-cause contains a cause.
if yes it is treated as the cause otherwise return unknown.
* [process-reboot-cause]Fix review comments
* [process-reboot-cause]address comments
1. use "with" statement
2. update fast/warm reboot BOOT_ARG
* [process-reboot-cause]address comments
* refactor the code flow
* Remove escape
* Remove extra ':'
This PR is to handle the issue 3527.
When device boots up, NTP throws a traceback as explained in the issue 3527.
- Traceback will be seen when MGMT_VRF_CONFIG does not exist in the database. Traceback is coming from the script “/etc/init.d/ntp”.
- Traceback does not affect the NTP functionality with/without management VRF. When MGMT_VRF_CONFIG does not exist or when MGMT_VRF_CONFIG’s mgmtVrfEnabled is configured to “false”, “NTP” will be started in the “default VRF” context, which is working fine even with this traceback.
- This traceback error will be hidden by redirecting the error to /dev/null without affecting functionality.
Add the same mechanism I developed for the SwSS service in #2845 to the syncd service. However, in order to cause the SwSS service to also exit and restart in this situation, I developed a docker-wait-any program which the SwSS service uses to wait for either the swss or syncd containers to exit.
* In the event of a kernel crash, we need to gather as much information
as possible to understand and identify the root cause of the crash.
Currently, the kernel does not provide much information, which make
kernel crash investigation difficult and time consuming.
Fortunately, there is a way in the kernel to provide more information
in the case of a kernel crash. kdump is a feature of the Linux kernel
that creates crash dumps in the event of a kernel crash. This PR
will add kermel kdump support.
An extension to the CLI utilities config and show is provided to
configure and manage kdump:
- enable / disable kdump functionality
- configure kdump (how many kernel crash logs can be saved, memory
allocated for capture kernel)
- view kernel crash logs
* Rename asn/deployment_id_asn_map.yaml to constants/constants.yaml
* Fix bgp templates
* Add community for loopback when bgpd is isolated
* Use correct community value
We noticed in tests/production that there is a low probability failure
where /etc/hosts could have some garbage characters before the entry for
local host name. The consequence is that all sudo command would be very
slow. In extreme cases it would prevent some services from starting
properly.
I suspect that the /etc/hosts file might be opened by some process causing
the issue. Editing contents with new file level and replace the whole file
should be safer.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
While doing CLI changes for SNMP configuration, few changes are made in backend to handle the modified CLI.
** Changes**
- "community" for "snmp trap" is also made as "configurable". snmpd_conf.j2 is modified to handle the same.
- Changed the snmp.yml file generation from postStartAction to preStartAction in docker_image_ctl.j2 specific to SNMP docker, to ensure that the snmp.yml is generated before sonic-cfggen generates the snmpd.conf.
- Changed to make the code common for management vrf and default vrf. Users can configure snmp trap and snmp listening IP for both management vrf and default vrf.
- after reloading minigraph, write latest version string in the DB.
- if old config_db.json file exists, use it and migrate to latest version.
- only reload minigraph when config_db.json doesn't exist and minigraph
exists.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Issue Overview
shutdown flow
For any shutdown flow, which means all dockers are stopped in order, pmon docker stops after syncd docker has stopped, causing pmon docker fail to release sx_core resources and leaving sx_core in a bad state. The related logs are like the following:
INFO syncd.sh[23597]: modprobe: FATAL: Module sx_core is in use.
INFO syncd.sh[23597]: Unloading sx_core[FAILED]
INFO syncd.sh[23597]: rmmod: ERROR: Module sx_core is in use
config reload & service swss.restart
In the flows like "config reload" and "service swss restart", the failure cause further consequences:
sx_core initialization error with error message like "sx_core: create EMAD sdq 0 failed. err: -16"
syncd fails to execute the create switch api with error message "syncd_main: Runtime error: :- processEvent: failed to execute api: create, key: SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000, status: SAI_STATUS_FAILURE"
swss fails to call SAI API "SAI_SWITCH_ATTR_INIT_SWITCH", which causes orchagent to restart. This will introduce an extra 1 or 2 minutes for the system to be available, failing related test cases.
reboot, warm-reboot & fast-reboot
In the reboot flows including "reboot", "fast-reboot" and "warm-reboot" this failure doesn't have further negative effects since the system has already rebooted. In addition, "warm-reboot" requires the system to be shutdown as soon as possible to meet the GR time restriction of both BGP and LACP. "fast-reboot" also requires to meet the GR time restriction of BGP which is longer than LACP. In this sense, any unnecessary steps should be avoided. It's better to keep those flows untouched.
summary
To summarize, we have to come up with a way to ensure:
shutdown pmon docker ahead of syncd for "config reload" or "service swss restart" flow;
don't shutdown pmon docker ahead of syncd for "fast-reboot" or "warm-reboot" flow in order to save time.
for "reboot" flow, either order is acceptable.
Solution
To solve the issue, pmon shoud be stopped ahead of syncd stopped for all flows except for the warm-reboot.
- How I did it
To stop pmon ahead of syncd stopped. This is done in /usr/local/bin/syncd.sh::stop() and for all shutdown sequence.
Now pmon stops ahead of syncd so there must be a way in which pmon can start after syncd started. Another point that should be taken consideration is that pmon starting should be deferred so that services which have the logic of graceful restart in fast-reboot and warm-reboot have sufficient CPU cycles to meet their deadline.
This is done by add "syncd.service" as "After" to pmon.service and startin /usr/local/bin/syncd.sh::wait()
To start pmon automatically after syncd started.
slave.mk: add SONIC_PLATFORM_API_PY2 as dependency of host
sonic_debian_extension.j2: install sonic_daemon_base and Mellanox-specific sonic_platform on host
mlnx-platform-api.mk: export mlnx_platform_api_py2_wheel_path for sonic_debian_extension.j2
sonic-daemon-base.mk: export daemon_base_py2_wheel_path for sonic_debian_extension.j2
daemon_base.py: hind unnecessary dependency of swss_common on host
* [SNMP] management VRF SNMP support
This commit adds SNMP support for Management VRF using l3mdev.
The patch included provides VRF support, there is no single
"listendevice" configuration, rather multiple agentaddress
config options can each have their own "interface" to bind to
using "ip%interface". The snmpd.conf file is accordingly
generated using the snmp.yml file and redis database info.
Adding below the comments of SNMP patch 1376
--------------------------------------------
Since the Linux kernel added support for Virtual Routing
and Forwarding (VRF) in version 4.3
(Note: these won't compile on non-linux platforms)
https://www.kernel.org/doc/Documentation/networking/vrf.txt
Linux users could not use snmpd in its current form to
bind specific listening IP addresses to specific VRF
devices. A simplified description of a VRF inteface
is an interface that is a master (a container of sorts)
that collects a set of physicalinterfaces to form a
routing table.
This set of two patches (one for V5-7-patches and one
for V5-8-patches branches) is almost identical to patch
single "listendevice" configuration. Rather, multiple
agentAddress config options can each have their own
"interface" to bind to using the <ip>%<interface>
syntax.</interface></ip>
-------------------------------------------
Signed-off-by: Harish Venkatraman <harish_venkatraman@dell.com>