* [Monit] Change the monitoring period of monit from 120 seconds to 60
seconds and also at the same time double the interval for existing sonic monit config file in
host.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
* [201811][monit] address build issue: hard code ARCH to amd64
- also hard code the debian package path as in 201811 branch.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* Updates per review comments
1) core_uploader service waits for syslog.service
2) core_uploader service enabled for restart on failure
3) Use mtime instead of file size + ample time to be robust.
* Avoid reloading already uploaded file, by marking the names with a prefix.
* Updated failing path.
1) If rc file is missing or required data missing, it periodically logs error in forever loop.
2) If upload fails, retry every hour with a error log, forever.
* Fix few bugs
* The binary update_json.py will come from sonic-utilities.
* Corefile uploader service
1) A service is added to watch /var/core and upload to Azure storage
2) The service is disabled on boot. One may enable explicitly.
3) The .rc file to be updated with acct credentials and http proxy to use.
4) If service is enabled with no credentials, it would sleep, with periodic log messages
5) For any update in .rc, the service has to be restarted to take effect.
* Remove rw permission for .rc file for group & others.
* Changes per review comments.
Re-ordered .rc file per JSON.dump order.
Added a script to enable partial update of .rc, which HWProxy would use to add acct key.
* Azure storage upload requires python module futures, hence added it to install list.
* Removed trailing spaces.
* A mistake in name corrected.
Copy the .rc updater script to /usr/bin.
Rely on platform= and sid= on the command line to detect the platform rather than the eeprom
The platform will now properly initialize even if the system eeprom died or is unreachable.
Add support for the 7260CX3-64E
This is a variant of the 7260CX3-64 with no real difference for software.
If we need to stop swss during fast-reboot procedure on the boot up path,
it means that something went wrong, like syncd/orchagent crashed already,
we are stopping and restarting swss/syncd to re-initialize. In this case,
we should proceed as if it is a cold reboot.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* [process-reboot-cause]Address the issue: Incorrect reboot cause returned when warm reboot follows a hardware caused reboot
1. check whether /proc/cmdline indicates warm/fast reboot.
if yes the software reboot cause file will be treated as the reboot cause.
finish
2. check whether platform api returns a reboot cause.
if yes it is treated as the reboot cause.
finish.
3. check whether /hosts/reboot-cause contains a cause.
if yes it is treated as the cause otherwise return unknown.
* [process-reboot-cause]Fix review comments
* [process-reboot-cause]address comments
1. use "with" statement
2. update fast/warm reboot BOOT_ARG
* [process-reboot-cause]address comments
* refactor the code flow
* Remove escape
* Remove extra ':'
In place editing (sed -i) seems having some issues with filesystem
interaction. It could leave 0 size file or corrupted file behind.
It would be safer to sed the file contents into a new file and switch
new file with the old file.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* Add watchdog-control service to disable watchdog during bootup
Disable only if it's applicable and the watchdog is enabled.
* Address the review comment
* Correct the watchdog start script name
* Change to call common watchdog api instead of platform specific
* Start watchdog control service after swss starts
* advance sonic-utility submodule
Revert "Configure buffer profile to all ports (#3561)" (#3628)
Configure buffer profile to all ports (#3561)
This reverts commit 8861cbe98e.
Signed-off-by: Wenda Ni <wenni@microsoft.com>
interfaces-config service configures lo address. If bgp service
starts before lo address is configured, then following config
in zebra will not be applied.
route-map RM_SET_SRC permit 10
set src 10.1.0.32
The adds a few seconds delay in bgp service start
We noticed in tests/production that there is a low probability failure
where /etc/hosts could have some garbage characters before the entry for
local host name. The consequence is that all sudo command would be very
slow. In extreme cases it would prevent some services from starting
properly.
I suspect that the /etc/hosts file might be opened by some process causing
the issue. Editing contents with new file level and replace the whole file
should be safer.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
- What I did
This fix removes the possibility of 'localhost' entry getting removed from /etc/hosts file by hostname-config service.
Without this change, whenever we change the hostname from 'localhost' to any other name on the config_db.json and reload the config, /etc/hosts file will only have the new hostname on it. But there are multiple sonic utilities (eg: swssconfig) which relies on the hard coded 'localhost' name and they tend to stop working.
- How I did it
Added a new check on hostname-config.sh script to avid blindly deleting the line containing the old hostname from /etc/hosts file. Now it will delete the old hostname only if its not localhost or when the hostname is not changing.
- How to verify it
Bring up SONiC on a device with hostname as localhost
Edit /etc/sonic/config_db.json to update the 'hostname' filed under DEVICE_METADATA from "hostname" : "localhost" --> "hostname" : "sonic"
run config reload -y to reflect the hostname change done on config_db.json file.
cat /etc/hosts and check whether both 127.0.0.1 localhost and 127.0.0.1 sonic entry are present on the file.
ping localhost should work fine.
- Description for the changelog
Make hostname-config service more robust in handling SONiC hostname change from localhost to anything else.
- after reloading minigraph, write latest version string in the DB.
- if old config_db.json file exists, use it and migrate to latest version.
- only reload minigraph when config_db.json doesn't exist and minigraph
exists.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* Add debug docker for SNMP.
* Removed a redundant install of debug packages.
Propagate the debug flag to template file to mount /dbg & /src to debug containers.
* Revert the last change to retain the original
* [cron.d] Create cron job to periodically clean-up core files
* Create script to scan /var/core and clean-up older core files
* Create cron job to run clean-up script
Signed-off-by: Danny Allen <daall@microsoft.com>
* Update interval for running cron job
* Respond to feedback
* Change syslog id
- monit config broke by one monit upgrade
- abandon sed approach since it is suspestible to monit config changes
- use unixsocket instead of httpd due to a bug in 5.20.0
[build_debian] Generate checksum of ASIC config files
* Adds script to generate checksums for ASIC config files
* Adds step to build_debian that copies ASIC config checksum into SONiC filesystem
Signed-off-by: Danny Allen daall@microsoft.com
radv should be left alone during warm restart of swss. Otherwise it will
announce departure and cause hosts to lose default gateway.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Present: Servers are listed in the same order as in redis-db
Fix: Save the sort o/p, hence use sorted list to write into pam.d's conf.
As well convert priority to integer for use by sort.
* [service dependent] describe non-warm-reboot dependency outside systemctl
When dependency was described with systemctl, it will kick in all the time,
including under warm reboot/restart scenarios. This is not what we always
want. For components that are capable of warm reboot/start, they need to
describe dependency in service files.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* [service] teamd service should not require swss service
Adding require swss will cause teamd to be killed by systemctl when swss
stops. This is not what we want in warm reboot.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* refactoring code
* rename functions to match other functions in the file
when device disk is small, do not unzip dockerfs.tar.gz on disk.
keep the tar file on the disk, unzip to tmpfs in the initrd phase.
enabled this for 7050-qx32
Signed-off-by: Guohan Lu <gulv@microsoft.com>
* [warm reboot] save configuration after warm reboot
After warm reboot, save a copy of in memory database to config_db.json,
upgrade procedure might have removed config_db.json to force new image
to reload minigraph. However, reload minigraph is skipped during warm
reboot. Missing config_db.json would cause device to fault in next
non-upgrading cold/fast reboot.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* Update finalize-warmboot.sh
* backport new platform api to 201811, reboot cause part
* install new platform api on host
* 1. remove chassis's dependency on sonic_platform_daemon.
2. add some mellanox-specific hardware reboot causes.
3. fix typo in files/image_config/process-reboot-cause/process-reboot-cause.
* 1. add dependency of sonic_platform for base image
2. handle the case of reboot cause file not found
* adjust log message.
In case of going from previous iteration of SONiC, and the last reboot
was hardware, REBOOT_CAUSE_FILE may not be present and the service may
throw an error.
- Make sure that migrated DB contents persisted for next boot
- Make sure that db saved after warm reboot.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* fix fast reboot compatibility
We should handle both cases for backward-compatible with 201803:
- fast-reboot
- SONIC_BOOT_TYPE=fast-reboot
* handle review comments
* add a comment that getBootType code snippet is shared between two files
* [submodule] update sonic-linux-kernel (#2985)
* Fix many version strings
* Update minor version
* Update arista-drivers submodule (#9)
* Rebuild SDK on new kernel (#10)
* [logrotate] Decrease frequency to every 10 minutes; kill any lingering logrotate processes
* [logrotate] Delete all *.1.gz files as firstaction; Remove note about init-system-helpers < 1.47 workaround
However, continue to send SIGHUP directly to rsyslogd process
because 'service rsyslog rotate' still doesn't work properly with
init-system-helpers version 1.48
* Switch the nss look up order as "compat" followed by "tacplus".
This helps use the legacy passwd file for user info and go to tacacs only if not found.
This means, we never contact tacacs for local users like "admin".
This isolates local users from any issues with tacacs servers.
W/o this fix, the sudo commands by local users could take <count of servers> * <tacacs timeout> seconds, if the tacacs servers are unreachable.
* Skip tacacs server access for local non-tacacs users.
Revert the order of 'compat tacplus' to original 'tacplus compat' as tacplus
access is required for all tacacs users, who also get created locally.
- Add ebtables package, and install some filter rules:
1. ebtables -A FORWARD -d BGA -j DROP
2. ebtables -A FORWARD -p ARP -j DROP
Basically, we let the ARP packets in the VLAN being forwarded by the ASIC,
kernel gets a copy of these ARP packets and the forwarding from Kenerl gets
dropped. So there is always only one copy of ARP/response in the VLAN.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
SWSS clears DB tables, if teamd is not started after swss, there is a
race condition that swss might clear vital teamd information.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
After warm reboot is done, we need to disable warm reboot flag and
tear down anything setup for warm reboot and persisted across.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Lossy traffic does not need to be mapped to different ingress PGs. They can all share the same ingress PG.
Signed-off-by: Wenda Ni <wenni@microsoft.com>
start() is called by service startPre method, which is blocking. Starting
syncd service here is causing deadlock.
attach() is called by service start method, which is non-blocking.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* [services] Ensure swss and syncd services start before dependent services
* Add 'attach' functions to scripts which get installed to /usr/local/bin so that services only reference the one script each
* Add 'After=swss.service' to syncd.service
need to flush asic db in swss.sh instead of syncd.sh
orchagent might already started in swss.sh and put commands
into asic db before asic db is flushed in syncd.sh. This
causes race condition such as INIT_VIEW not passing to syncd.
Signed-off-by: Guohan Lu <gulv@microsoft.com>
* Add a log message for each notification of add/del TACACS server.
Signed-off-by: Renuka Manavalan <remanava@microsoft.com>
* Moved another syslog message from DEBUG to INFO to be able to see those notifications.
All these changes are to help with a one-time-seen-bug, that hostcfgd did not act upon changes to redis for TACACS servers. We could not repro the bug.
Signed-off-by: Renuka Manavalan <remanava@microsoft.com>
* [updategraph] After system upgrade, restore files/directories with
original attributes etc.
Restore a few more files that was missed before.
Restore FRR configuration directory if exists on old system
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
* Removed deployment_id_asn_map.yml from copy list
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
* QoS config change: 1) DSCP mapping; 2) link pg/queue 6 to lossy buffer;
3) redistribute scheduler
Signed-off-by: Wenda <wenni@microsoft.com>
* Add scheduling weight to queue 2
Signed-off-by: Wenda <wenni@microsoft.com>
* Link pg/queue 2 to lossy buffer
Signed-off-by: Wenda <wenni@microsoft.com>
* Update the pg headroom for a7060-D48C8 50G
Signed-off-by: Wenda <wenni@microsoft.com>
* Update config gen test for qos
Signed-off-by: Wenda <wenni@microsoft.com>
* Update pg headroom size, and update egress lossy pool size accordingly
Signed-off-by: Wenda <wenni@microsoft.com>
* Update headroom pool size; Update ingress service pool and egress lossy
pool sizes accordingly;
Signed-off-by: Wenda <wenni@microsoft.com>
* a7260: update headroom pool size; Update ingress service pool and egress lossy pool sizes accordingly;
Signed-off-by: Wenda <wenni@microsoft.com>
* Update config gen test for buffer
Signed-off-by: Wenda <wenni@microsoft.com>
* [reboot cause] Move reboot-cause files to /host directory so they persist across SONiC upgrades
* [sonic-utilities] Update submodule to include related changes