Commit Graph

182 Commits

Author SHA1 Message Date
Renuka Manavalan
da7db51259 corefile uploader: Updates per review comments offline (#3915)
* Updates per review comments
1) core_uploader service waits for syslog.service
2) core_uploader service enabled for restart on failure
3) Use mtime instead of file size + ample time to be robust.

* Avoid reloading already uploaded file, by marking the names with a prefix.

* Updated failing path.
1) If rc file is missing or required data missing, it periodically logs error in forever loop.
2) If upload fails, retry every hour with a error log, forever.

* Fix few bugs

* The binary update_json.py will come from sonic-utilities.
2020-01-06 21:03:40 +00:00
Renuka Manavalan
6db0c76a06 Corefile uploader service (#3887)
* Corefile uploader service

1) A service is added to watch /var/core and upload to Azure storage
2) The service is disabled on boot. One may enable explicitly.
3) The .rc file to be updated with acct credentials and http proxy to use.
4) If service is enabled with no credentials, it would sleep, with periodic log messages
5) For any update in .rc, the service has to be restarted to take effect.

* Remove rw permission for .rc file for group & others.

* Changes per review comments.
Re-ordered .rc file per JSON.dump order.
Added a script to enable partial update of .rc, which HWProxy would use to add acct key.

* Azure storage upload requires python module futures, hence added it to install list.

* Removed trailing spaces.

* A mistake in name corrected.
Copy the .rc updater script to /usr/bin.
2020-01-06 21:02:14 +00:00
Joe LeVeque
9ee8eba77c [monit] Build from source and patch to use MemAvailable value if available on system (#3875) 2020-01-06 20:59:32 +00:00
Stephen Sun
49869aa6fa [process-reboot-cause]Address the issue: Incorrect reboot cause returned when warm reboot follows a hardware caused reboot (#3880)
* [process-reboot-cause]Address the issue: Incorrect reboot cause returned when warm reboot follows a hardware caused reboot
1. check whether /proc/cmdline indicates warm/fast reboot.
   if yes the software reboot cause file will be treated as the reboot cause.
   finish
2. check whether platform api returns a reboot cause.
   if yes it is treated as the reboot cause.
   finish.
3. check whether /hosts/reboot-cause contains a cause.
   if yes it is treated as the cause otherwise return unknown.

* [process-reboot-cause]Fix review comments

* [process-reboot-cause]address comments
1. use "with" statement
2. update fast/warm reboot BOOT_ARG

* [process-reboot-cause]address comments

* refactor the code flow

* Remove escape

* Remove extra ':'
2019-12-14 17:44:02 +00:00
Sujin Kang
0510fc7258 Correct the watch-control service to call the right script (#3906)
* Correct the watch-control service to call the right script

* make watchdog-control.sh executable (chmod +x)
2019-12-14 09:42:36 -08:00
Ying Xie
ca1c5bc0c4 [hostcfgd] avoid in place editing config file contents (#3904)
In place editing (sed -i) seems having some issues with filesystem
interaction. It could leave 0 size file or corrupted file behind.

It would be safer to sed the file contents into a new file and switch
new file with the old file.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-12-14 03:27:39 +00:00
Sujin Kang
aea18165a8
Add watchdog-control service to disable watchdog during bootup (#3877)
* Add watchdog-control service to disable watchdog during bootup

Disable only if it's applicable and the watchdog is enabled.

* Address the review comment

* Correct the watchdog start script name

* Change to call common watchdog api instead of platform specific

* Start watchdog control service after swss starts

* advance sonic-utility submodule
2019-12-13 12:44:11 -08:00
Neetha John
6d23e4c8d7 [pfcwd]: Do not start pfc watchdog on Management Tor (#3719)
Signed-off-by: Neetha John <nejo@microsoft.com>
2019-11-07 21:41:32 +00:00
lguohan
9167f9da46 [aboot]: preserve snmp.yml and acl.json for eos to sonic fast reboot (#3716) 2019-11-07 21:40:20 +00:00
Ying Xie
f764a167ac [hostname-config] improve hostname-config process (#3676)
We noticed in tests/production that there is a low probability failure
where /etc/hosts could have some garbage characters before the entry for
local host name. The consequence is that all sudo command would be very
slow. In extreme cases it would prevent some services from starting
properly.

I suspect that the /etc/hosts file might be opened by some process causing
the issue. Editing contents with new file level and replace the whole file
should be safer.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-10-29 15:42:23 +00:00
Prabhu Sreenivasan
ff137a8e56 [baseimage]: Avoid removing localhost entry from /etc/hosts file (#2452)
- What I did
This fix removes the possibility of 'localhost' entry getting removed from /etc/hosts file by hostname-config service.

Without this change, whenever we change the hostname from 'localhost' to any other name on the config_db.json and reload the config, /etc/hosts file will only have the new hostname on it. But there are multiple sonic utilities (eg: swssconfig) which relies on the hard coded 'localhost' name and they tend to stop working.

- How I did it
Added a new check on hostname-config.sh script to avid blindly deleting the line containing the old hostname from /etc/hosts file. Now it will delete the old hostname only if its not localhost or when the hostname is not changing.

- How to verify it

Bring up SONiC on a device with hostname as localhost
Edit /etc/sonic/config_db.json to update the 'hostname' filed under DEVICE_METADATA from "hostname" : "localhost" --> "hostname" : "sonic"
run config reload -y to reflect the hostname change done on config_db.json file.
cat /etc/hosts and check whether both 127.0.0.1 localhost and 127.0.0.1 sonic entry are present on the file.
ping localhost should work fine.
- Description for the changelog
Make hostname-config service more robust in handling SONiC hostname change from localhost to anything else.
2019-10-29 15:42:04 +00:00
Danny Allen
818ab7fdaa [core_cleanup] Fix issue where core_cleanup job runs too frequently (#3659)
Signed-off-by: Danny Allen <daall@microsoft.com>
2019-10-24 17:04:16 +00:00
Ying Xie
c7a096b6b9
[201811][ntp] removed undefined filter (#3594)
pfx_filter is not defined in 201811 branch.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-10-11 19:46:14 -07:00
pavel-shirshov
53ec9124bc [ntp]: Use loopback address when we don't have MGMT interface (#3566)
Added configuration to use Loopback ip if a switch doesn't have MGMT_PORT.
2019-10-07 16:56:00 +00:00
Ying Xie
37b78826ee [updategraph] enhance update graph handling (#3549)
- after reloading minigraph, write latest version string in the DB.
- if old config_db.json file exists, use it and migrate to latest version.
- only reload minigraph when config_db.json doesn't exist and minigraph
  exists.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-10-02 21:04:39 +00:00
Ying Xie
e4f8a3946c [first boot] sync file system after moving/copying files (#3550)
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-10-02 21:04:39 +00:00
Prince Sunny
4ef5ce74e4 Install Iptables rules to set TCPMSS for 'lo' interface (#3452)
* Install Iptables rules to set TCPMSS for lo interface
* Moved implementation to hostcfgd to maintain at one place
2019-09-19 01:08:44 +00:00
Danny Allen
ba77de12ac [cron.d] Add cron job to periodically clean-up core files (#3449)
* [cron.d] Create cron job to periodically clean-up core files
* Create script to scan /var/core and clean-up older core files
* Create cron job to run clean-up script

Signed-off-by: Danny Allen <daall@microsoft.com>

* Update interval for running cron job

* Respond to feedback

* Change syslog id
2019-09-13 17:52:10 +00:00
lguohan
87cb1e307e [baseimage]: fix monit configuration (#3448)
- monit config broke by one monit upgrade
- abandon sed approach since it is suspestible to monit config changes
- use unixsocket instead of httpd due to a bug in 5.20.0
2019-09-13 06:08:30 +00:00
sridhar-ravindran
d4758afdde [DELL] S6100 Add PowerCycle Support for Last Reset Reason (#3402)
* [DELL] S6100 Add PowerCycle Support for Last Reset Reason

* handle first time boot properly

* S6000 Last Reboot Reason Fix
2019-09-09 22:33:32 -07:00
Joe LeVeque
aee7d86fc9 [201811] Log message containing SONiC version to syslog at boot (#3417) 2019-09-08 12:33:08 -07:00
Ying Xie
2b8eca5ebb [control plane assistant] stop control plane assistant after warm reboot (#3337)
Delay saving configuration so that the control assistant configurations
won't be persisted.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-08-15 20:28:42 +00:00
Renuka Manavalan
b80d60c277 Fix to ensure that tacacs servers are ordered (reverse) by priority in pam.d's config. (#3322)
Present: Servers are listed in the same order as in redis-db
Fix: Save the sort o/p, hence use sorted list to write into pam.d's conf.
     As well convert priority to integer for use by sort.
2019-08-14 21:20:01 +00:00
Joe LeVeque
da57e8db36 Revert back to 'import sonic_platform' (#3249) 2019-07-31 16:44:17 -07:00
Joe LeVeque
29bbd86862
[services] Restart SwSS service upon unexpected critical process exit (#2845) (#2852) 2019-07-29 18:10:26 -07:00
Ying Xie
7cf90ec441 [warm reboot] save configuration after warm reboot (#3200)
* [warm reboot] save configuration after warm reboot

After warm reboot, save a copy of in memory database to config_db.json,
upgrade procedure might have removed config_db.json to force new image
to reload minigraph. However, reload minigraph is skipped during warm
reboot. Missing config_db.json would cause device to fault in next
non-upgrading cold/fast reboot.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* Update finalize-warmboot.sh
2019-07-24 17:45:07 +00:00
Stephen Sun
7a9d04ee73 [Mellanox] Backporting reboot cause to 201811 (#3198)
* backport new platform api to 201811, reboot cause part

* install new platform api on host

* 1. remove chassis's dependency on sonic_platform_daemon.
2. add some mellanox-specific hardware reboot causes.
3. fix typo in files/image_config/process-reboot-cause/process-reboot-cause.

* 1. add dependency of sonic_platform for base image
2. handle the case of reboot cause file not found

* adjust log message.
2019-07-23 07:05:35 -07:00
zzhiyuan
0869fd3925 [baseimage]: Fix process-reboot-cause possibly throwing OSError (#3159)
In case of going from previous iteration of SONiC, and the last reboot
was hardware, REBOOT_CAUSE_FILE may not be present and the service may
throw an error.
2019-07-16 21:38:46 +00:00
Joe LeVeque
c3932e501b [process-reboot-cause] Handle case if platform does not yet have sonic_platform implementation (#3126) 2019-07-10 23:06:43 +00:00
Joe LeVeque
1115c8431d [reboot-cause]: Move reboot cause processing to its own service, 'process-reboot-cause' (#3102) 2019-07-10 23:02:57 +00:00
Joe LeVeque
02fc1306b0 [baseimage]: Increase TMOUT for serial port connections to 15 minutes (#3032)
Increase TMOUT value in order to close inactive serial console connections after 900 seconds (15 minutes) of inactivity
2019-06-19 19:07:36 +00:00
Joe LeVeque
8ae67c4c5d [logrotate] Enhance robustness (#2942)
* [logrotate] Decrease frequency to every 10 minutes; kill any lingering logrotate processes

* [logrotate] Delete all *.1.gz files as firstaction; Remove note about init-system-helpers < 1.47 workaround

However, continue to send SIGHUP directly to rsyslogd process
because 'service rsyslog rotate' still doesn't work properly with
init-system-helpers version 1.48
2019-05-29 00:53:13 +00:00
Ying Xie
5975a9c25b [updategraph] set DB version after minigraph reload (#2917)
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-05-20 19:05:29 +00:00
Renuka Manavalan
238db1e06a [tacacs]: skip accessing tacacs servers for local non-tacacs users (#2843)
* Switch the nss look up order as "compat" followed by "tacplus".
This helps use the legacy passwd file for user info and go to tacacs only if not found.
This means, we never contact tacacs for local users like "admin".
This isolates local users from any issues with tacacs servers.
W/o this fix, the sudo commands by local users could take <count of servers> * <tacacs timeout> seconds, if the tacacs servers are unreachable.

* Skip tacacs server access for local non-tacacs users.
Revert the order of 'compat tacplus' to original 'tacplus compat' as tacplus
access is required for all tacacs users, who also get created locally.
2019-05-20 18:59:26 +00:00
Ying Xie
dc2fb747a5 [ebtables] install ebtables in base image and install filter rules
- Add ebtables package, and install some filter rules:
  1. ebtables -A FORWARD -d BGA -j DROP
  2. ebtables -A FORWARD -p ARP -j DROP

Basically, we let the ARP packets in the VLAN being forwarded by the ASIC,
kernel gets a copy of these ARP packets and the forwarding from Kenerl gets
dropped. So there is always only one copy of ARP/response in the VLAN.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-05-06 22:13:03 +00:00
Joe LeVeque
cc90d7f5ee [sudoers] Add /usr/bin/teamshow to READ_ONLY_CMDS (#2846) 2019-05-01 15:51:13 +00:00
Renuka Manavalan
6c1a0ce58c [hostcfgd] -- Fix the default for failthrough as false.
This implies that by default, if TACACS is configured properly and it reported auth_err, then don't try fail through to traditional unix authentication through /etc/passwd.

If this failthrough is intended, make it explicit through "sudo config aaa authentication failthrough enable"

Removed an unused variable "aaa.fallback"

Tested manually. Note the presence of 'auth_err=die' in all cases except when failthrough is explicitly enabled.

admin@str-s6000-acs-13:~$ sudo config aaa authentication failthrough default; date
Wed Apr  3 23:05:18 UTC 2019
admin@str-s6000-acs-13:~$ ls -lrt /etc/pam.d/common-auth-sonic ; grep 123 /etc/pam.d/common-auth-sonic
-rw-r--r-- 1 root root 1316 Apr  3 23:05 /etc/pam.d/common-auth-sonic
auth    [success=done new_authtok_reqd=done default=ignore auth_err=die]        pam_tacplus.so server=100.127.20.22:49 secret=testing123 login=login timeout=5 try_first_pass
auth    [success=done new_authtok_reqd=done default=ignore auth_err=die]        pam_tacplus.so server=100.127.20.21:49 secret=testing123 login=login timeout=5 try_first_pass

admin@str-s6000-acs-13:~$ sudo config aaa authentication failthrough enable; date ; h4 "AAA|authentication"
Wed Apr  3 23:06:37 UTC 2019
admin@str-s6000-acs-13:~$ ls -lrt /etc/pam.d/common-auth-sonic ; grep 123 /etc/pam.d/common-auth-sonic
-rw-r--r-- 1 root root 1294 Apr  3 23:06 /etc/pam.d/common-auth-sonic
auth    [success=done new_authtok_reqd=done default=ignore]     pam_tacplus.so server=100.127.20.22:49 secret=testing123 login=login timeout=5 try_first_pass
auth    [success=done new_authtok_reqd=done default=ignore]     pam_tacplus.so server=100.127.20.21:49 secret=testing123 login=login timeout=5 try_first_pass

admin@str-s6000-acs-13:~$ sudo config aaa authentication failthrough disable; date ; h4 "AAA|authentication"
Wed Apr  3 23:07:09 UTC 2019
admin@str-s6000-acs-13:~$ ls -lrt /etc/pam.d/common-auth-sonic ; grep 123 /etc/pam.d/common-auth-sonic
-rw-r--r-- 1 root root 1321 Apr  3 23:07 /etc/pam.d/common-auth-sonic
auth    [success=done new_authtok_reqd=done default=ignore auth_err=die]        pam_tacplus.so server=100.127.20.22:49 secret=testing123 login=login timeout=5 try_first_pass
auth    [success=done new_authtok_reqd=done default=ignore auth_err=die]        pam_tacplus.so server=100.127.20.21:49 secret=testing123 login=login timeout=5 try_first_pass
2019-04-08 23:41:51 +00:00
Ying Xie
681e34a2b1
[service] add warmboot finializer service (#2725)
After warm reboot is done, we need to disable warm reboot flag and
tear down anything setup for warm reboot and persisted across.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-04-01 14:16:31 -07:00
Renuka Manavalan
def2780f18 [hostcfgd]: Promote logs for update-notifications-from-DB from DEBUG to INFO (#2576)
* Add a log message for each notification of add/del TACACS server.

Signed-off-by: Renuka Manavalan <remanava@microsoft.com>

* Moved another syslog message from DEBUG to INFO to be able to see those notifications.

All these changes are to help with a one-time-seen-bug, that hostcfgd did not act upon changes to redis for TACACS servers. We could not repro the bug.

Signed-off-by: Renuka Manavalan <remanava@microsoft.com>
2019-02-21 18:14:04 +00:00
Ying Xie
4faa5f2f92
[warm boot] cherry-pick PR #2538 and advance related sub-modules (#2569)
PR#2538 cannot merge due to master branch status. It has been tested
against 201811 branch.

Submodule src/sonic-sairedis 21f4a49..d57222a:
  > Add more specific logic for ingress ACL and buffer profile (#421)
  > Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABLE_TABLE (#418)
  > Add support for vlan tagged frames in virtual switch (#417)

Submodule src/sonic-swss 1590030..584490c:
  > Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABLE_TABLE (#786)
  > [vstest]: Potential fix for timing issue in warm_reboot's routing UT (#788)

Submodule src/sonic-swss-common 594f4e8..286ef34:
  > Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABLE_TABLE (#260)

Submodule src/sonic-utilities c6666e2..b44b462:
  > Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABL… (#458)
  > [aclshow] output only counters per table/rule (#442)

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

[PR 2538] Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABLE_TABLE

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
2019-02-14 12:12:55 -08:00
zhenggen-xu
4a24103206 [updategraph] After system upgrade, restore files/directories with original attributes etc. (#2368)
* [updategraph] After system upgrade, restore files/directories with
original attributes etc.
Restore a few more files that was missed before.
Restore FRR configuration directory if exists on old system

Signed-off-by: Zhenggen Xu <zxu@linkedin.com>

* Removed deployment_id_asn_map.yml from copy list

Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
2019-02-02 20:54:58 +00:00
Joe LeVeque
38c08dfac9 [reboot cause] Move reboot-cause files to /host directory so they persist across SONiC upgrades (#2490)
* [reboot cause] Move reboot-cause files to /host directory so they persist across SONiC upgrades

* [sonic-utilities] Update submodule to include related changes
2019-02-02 19:29:52 +00:00
Joe LeVeque
2acfac712c [caclmgrd] Don't crash if we find empty/null rule_props (#2475)
* [caclmgrd] Don't crash if we find empty/null rule_props
2019-01-25 21:10:52 +00:00
Ying Xie
6ba93acd9c
[update graph] adapt to warm reboot scenario (#2353)
* [update graph] adapt to warm reboot scenario

When migrating configuration, always copy config files from old_config
to /etc/sonic. But if warm reboot is detected, then skip configuration
operations.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* log file copies and misses
2018-12-06 10:24:50 -08:00
kannankvs
a9a7ce1091 tacacs management vrf changes (#2217) 2018-12-04 10:22:48 -08:00
Joe LeVeque
298d2ad8f4
[boot] Refactor: All services which start Docker containers start before ntp-config service (#2335) 2018-12-03 16:01:44 -08:00
Ying Xie
84bde1511a
[sonic boot] disable dhcp during boot up, until updategraph service is running (#2316)
* [sonic] disable management port eth0 during boot up

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [updategraph] enable dhcp client on management port eth0

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2018-11-29 08:34:22 -08:00
Joe LeVeque
d1c9b0cb77 [boot] Start ntp-config service after all Docker containers are started (#2303) 2018-11-28 00:12:03 -08:00
Ying Xie
873df9d8e8
[bde driver] black list linux_kernel_bde driver (#2284)
This driver should be loaded by sonic service. If kernel tries to load
it, the driver would be loaded with default parameters, which is not
right for sonic.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2018-11-21 08:08:37 -08:00
Joe LeVeque
f126000cc9
[sudoers] Add 'SONIC_CLI_IFACE_MODE' to env_keep to ensure variable is made available to sudo calls (#2249) 2018-11-15 15:16:06 -08:00