We noticed in tests/production that there is a low probability failure
where /etc/hosts could have some garbage characters before the entry for
local host name. The consequence is that all sudo command would be very
slow. In extreme cases it would prevent some services from starting
properly.
I suspect that the /etc/hosts file might be opened by some process causing
the issue. Editing contents with new file level and replace the whole file
should be safer.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
While doing CLI changes for SNMP configuration, few changes are made in backend to handle the modified CLI.
** Changes**
- "community" for "snmp trap" is also made as "configurable". snmpd_conf.j2 is modified to handle the same.
- Changed the snmp.yml file generation from postStartAction to preStartAction in docker_image_ctl.j2 specific to SNMP docker, to ensure that the snmp.yml is generated before sonic-cfggen generates the snmpd.conf.
- Changed to make the code common for management vrf and default vrf. Users can configure snmp trap and snmp listening IP for both management vrf and default vrf.
- after reloading minigraph, write latest version string in the DB.
- if old config_db.json file exists, use it and migrate to latest version.
- only reload minigraph when config_db.json doesn't exist and minigraph
exists.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Issue Overview
shutdown flow
For any shutdown flow, which means all dockers are stopped in order, pmon docker stops after syncd docker has stopped, causing pmon docker fail to release sx_core resources and leaving sx_core in a bad state. The related logs are like the following:
INFO syncd.sh[23597]: modprobe: FATAL: Module sx_core is in use.
INFO syncd.sh[23597]: Unloading sx_core[FAILED]
INFO syncd.sh[23597]: rmmod: ERROR: Module sx_core is in use
config reload & service swss.restart
In the flows like "config reload" and "service swss restart", the failure cause further consequences:
sx_core initialization error with error message like "sx_core: create EMAD sdq 0 failed. err: -16"
syncd fails to execute the create switch api with error message "syncd_main: Runtime error: :- processEvent: failed to execute api: create, key: SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000, status: SAI_STATUS_FAILURE"
swss fails to call SAI API "SAI_SWITCH_ATTR_INIT_SWITCH", which causes orchagent to restart. This will introduce an extra 1 or 2 minutes for the system to be available, failing related test cases.
reboot, warm-reboot & fast-reboot
In the reboot flows including "reboot", "fast-reboot" and "warm-reboot" this failure doesn't have further negative effects since the system has already rebooted. In addition, "warm-reboot" requires the system to be shutdown as soon as possible to meet the GR time restriction of both BGP and LACP. "fast-reboot" also requires to meet the GR time restriction of BGP which is longer than LACP. In this sense, any unnecessary steps should be avoided. It's better to keep those flows untouched.
summary
To summarize, we have to come up with a way to ensure:
shutdown pmon docker ahead of syncd for "config reload" or "service swss restart" flow;
don't shutdown pmon docker ahead of syncd for "fast-reboot" or "warm-reboot" flow in order to save time.
for "reboot" flow, either order is acceptable.
Solution
To solve the issue, pmon shoud be stopped ahead of syncd stopped for all flows except for the warm-reboot.
- How I did it
To stop pmon ahead of syncd stopped. This is done in /usr/local/bin/syncd.sh::stop() and for all shutdown sequence.
Now pmon stops ahead of syncd so there must be a way in which pmon can start after syncd started. Another point that should be taken consideration is that pmon starting should be deferred so that services which have the logic of graceful restart in fast-reboot and warm-reboot have sufficient CPU cycles to meet their deadline.
This is done by add "syncd.service" as "After" to pmon.service and startin /usr/local/bin/syncd.sh::wait()
To start pmon automatically after syncd started.
slave.mk: add SONIC_PLATFORM_API_PY2 as dependency of host
sonic_debian_extension.j2: install sonic_daemon_base and Mellanox-specific sonic_platform on host
mlnx-platform-api.mk: export mlnx_platform_api_py2_wheel_path for sonic_debian_extension.j2
sonic-daemon-base.mk: export daemon_base_py2_wheel_path for sonic_debian_extension.j2
daemon_base.py: hind unnecessary dependency of swss_common on host
* [SNMP] management VRF SNMP support
This commit adds SNMP support for Management VRF using l3mdev.
The patch included provides VRF support, there is no single
"listendevice" configuration, rather multiple agentaddress
config options can each have their own "interface" to bind to
using "ip%interface". The snmpd.conf file is accordingly
generated using the snmp.yml file and redis database info.
Adding below the comments of SNMP patch 1376
--------------------------------------------
Since the Linux kernel added support for Virtual Routing
and Forwarding (VRF) in version 4.3
(Note: these won't compile on non-linux platforms)
https://www.kernel.org/doc/Documentation/networking/vrf.txt
Linux users could not use snmpd in its current form to
bind specific listening IP addresses to specific VRF
devices. A simplified description of a VRF inteface
is an interface that is a master (a container of sorts)
that collects a set of physicalinterfaces to form a
routing table.
This set of two patches (one for V5-7-patches and one
for V5-8-patches branches) is almost identical to patch
single "listendevice" configuration. Rather, multiple
agentAddress config options can each have their own
"interface" to bind to using the <ip>%<interface>
syntax.</interface></ip>
-------------------------------------------
Signed-off-by: Harish Venkatraman <harish_venkatraman@dell.com>
This commit adds NTP support for management VRF using L3mdev. Config vrf add
mgmt will enable management VRF, enslave the eth0 device to the master device
mgmt, stop ntp service in default, restart interfaces-configs and restart ntp
service in mgmt-vrf context. Requirement and design are covered in mgmt vrf
design document.
Signed-off-by: Harish Venkatraman <harish_venkatraman@dell.com>
Introduce a new "sflow" container (if ENABLE_SFLOW is set). The new docker will include:
hsflowd : host-sflow based daemon is the sFlow agent
psample : Built from libpsample repository. Useful in debugging sampled packets/groups.
sflowtool : Locally dump sflow samples (e.g. with a in-unit collector)
In case of SONiC-VS, enable psample & act_sample kernel modules.
VS' syncd needs iproute2=4.20.0-2~bpo9+1 & libcap2-bin=1:2.25-1 to support tc-sample
tc-syncd is provided as a convenience tool for debugging (e.g. tc-syncd filter show ...)
* Use dot1p to tc mapping for backend switches
Signed-off-by: Wenda Ni <wenni@microsoft.com>
* Do not write DSCP to TC mapping into CONFIG_DB or config_db.json for
storage switches
Signed-off-by: Wenda Ni <wenni@microsoft.com>
* [cron.d] Create cron job to periodically clean-up core files
* Create script to scan /var/core and clean-up older core files
* Create cron job to run clean-up script
Signed-off-by: Danny Allen <daall@microsoft.com>
* Update interval for running cron job
* Respond to feedback
* Change syslog id
- monit config broke by one monit upgrade
- abandon sed approach since it is suspestible to monit config changes
- use unixsocket instead of httpd due to a bug in 5.20.0
* Use dot1p to tc mapping for backend switches
Signed-off-by: Wenda Ni <wenni@microsoft.com>
* Do not write DSCP to TC mapping into CONFIG_DB or config_db.json for
storage switches
Signed-off-by: Wenda Ni <wenni@microsoft.com>
[build_debian] Generate checksum of ASIC config files
* Adds script to generate checksums for ASIC config files
* Adds step to build_debian that copies ASIC config checksum into SONiC filesystem
Signed-off-by: Danny Allen daall@microsoft.com
this is the first step to moving different databases tables into different database instances
in this PR, only handle multiple database instances creation based on user configuration at /etc/sonic/database_config.json
we keep current method to create single database instance if no extra/new DATABASE configuration exist in database_config.json file.
if user try to configure more db instances at database_config.json , we create those new db instances along with the original db instance existing today.
The configuration is as below, later we can add more db related information if needed:
{
...
"DATABASE": {
"redis-db-01" : {
"port" : "6380",
"database": ["APPL_DB", "STATE_DB"]
},
"redis-db-02" : {
"port" : "6381",
"database":["ASIC_DB"]
},
}
...
}
The detail description is at design doc at Azure/SONiC#271
The main idea is : when database.sh started, we check the configuration and generate corresponding scripts.
rc.local service handle old_config copy when loading new images, there is no dependency between rc.local and database service today, for safety and make sure the copy operation are done before database try to read it, we make database service run after rc.local
Then database docker started, we check the configuration and generate corresponding scripts/.conf in database docker as well.
based on those conf, we create databases instances as required.
at last, we ping_pong check database are up and continue
Signed-off-by: Dong Zhang d.zhang@alibaba-inc.com
radv should be left alone during warm restart of swss. Otherwise it will
announce departure and cause hosts to lose default gateway.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Present: Servers are listed in the same order as in redis-db
Fix: Save the sort o/p, hence use sorted list to write into pam.d's conf.
As well convert priority to integer for use by sort.
* [service dependent] describe non-warm-reboot dependency outside systemctl
When dependency was described with systemctl, it will kick in all the time,
including under warm reboot/restart scenarios. This is not what we always
want. For components that are capable of warm reboot/start, they need to
describe dependency in service files.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* [service] teamd service should not require swss service
Adding require swss will cause teamd to be killed by systemctl when swss
stops. This is not what we want in warm reboot.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* refactoring code
* rename functions to match other functions in the file
when device disk is small, do not unzip dockerfs.tar.gz on disk.
keep the tar file on the disk, unzip to tmpfs in the initrd phase.
enabled this for 7050-qx32
Signed-off-by: Guohan Lu <gulv@microsoft.com>
- What I did
Move the enabling of Systemd services from sonic_debian_extension to a new systemd generator
- How I did it
Create a new systemd generator to manually create symlinks to enable systemd services
Add rules/Makefile to build generator
Add services to be enabled to /etc/sonic/generated_services.conf to be read by the generator at boot time
Signed-off-by: Lawrence Lee <t-lale@microsoft.com>
ARM Architecture support in SONIC
make configure platform=[ASIC_VENDOR_ARCH] PLATFORM_ARCH=[ARM_ARCH]
SONIC_ARCH: default amd64
armhf - arm32bit
arm64 - arm64bit
Signed-off-by: Antony Rheneus <arheneus@marvell.com>
This commit adds support for New feature management VRF using L3mdev. Added
commands to enable/disable management VRF. Config vrf add mgmt will enable
management VRF, enslave the eth0 device to the master device mgmt and restart
interfaces-configs in mgmt-vrf context.
management interface (eth0) can be configured using config interface eth0 ip
add command and removed using config interface eth0 ip remove command.
Requirement and design are covered in mgmt vrf design document. Currently show
command displays linux command output; will update show command display in next
PR after concluding what would be the output for the show commands. Added
metric for default routes in dhcp and static, any changes for metric will be
addressed subsequently after discussing.
Signed-off-by: Harish Venkatraman <harish_venkatraman@dell.com>
* [warm reboot] save configuration after warm reboot
After warm reboot, save a copy of in memory database to config_db.json,
upgrade procedure might have removed config_db.json to force new image
to reload minigraph. However, reload minigraph is skipped during warm
reboot. Missing config_db.json would cause device to fault in next
non-upgrading cold/fast reboot.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* Update finalize-warmboot.sh
* Upgrade ifupdown2 to version 1.2.8
Required by ZTP to support ZTP over IPv6 transport
Signed-off-by: Rajendra Dendukuri <rajendra.dendukuri@broadcom.com>
In case of going from previous iteration of SONiC, and the last reboot
was hardware, REBOOT_CAUSE_FILE may not be present and the service may
throw an error.
- Make sure that migrated DB contents persisted for next boot
- Make sure that db saved after warm reboot.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>