Commit Graph

432 Commits

Author SHA1 Message Date
Joe LeVeque
d9b8bed916 [caclmgrd] Don't limit connection tracking to TCP (#4796)
Don't limit iptables connection tracking to TCP protocol; allow connection tracking for all protocols. This allows services like NTP, which is UDP-based, to receive replies from an NTP server even if the port is blocked, as long as it is in reply to a request sent from the device itself.
2020-06-19 04:33:50 +00:00
Ying Xie
4cd54ed58c [ntp] disable ntp long jump (#4748)
Found another syncd timing issue related to clock going backwards.
To be safe disable the ntp long jump.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2020-06-11 22:03:22 +00:00
Joe LeVeque
7ae30d7898 [caclmgrd] Get first VLAN host IP address via next() (#4685)
I found that with IPv4Network types, calling list(ip_ntwrk.hosts()) is reliable. However, when doing the same with an IPv6Network, I found that the conversion to a list can hang indefinitely. This appears to me to be a bug in the ipaddress.IPv6Network implementation. However, I could not find any other reports on the web.

This patch changes the behavior to call next() on the ip_ntwrk.hosts() generator instead, which returns the IP address of the first host.
2020-06-09 16:30:45 +00:00
Joe LeVeque
494701a0ee [caclmgrd] Allow more ICMP types (#4625) 2020-06-09 16:07:51 +00:00
yozhao101
aa949cdc74 [docker-syncd] Add timeout to force stop syncd container (#4617)
**- Why I did it**
When I tested auto-restart feature of swss container by manually killing one of critical processes in it, swss will be stopped. Then syncd container as the peer container should also be
stopped as expected. However, I found sometimes syncd container can be stopped, sometimes
it can not be stopped. The reason why syncd container can not be stopped is the process
(/usr/local/bin/syncd.sh stop) to execute the stop() function will be stuck between the lines 164 –167. Systemd will wait for 90 seconds and then kill this process.

164 # wait until syncd quit gracefully
165 while docker top syncd$DEV | grep -q /usr/bin/syncd; do
166 sleep 0.1
167 done

The first thing I did is to profile how long this while loop will spin if syncd container can be
normally stopped after swss container is stopped. The result is 5 seconds or 6 seconds. If syncd
container can be normally stopped, two messages will be written into syslog:

str-a7050-acs-3 NOTICE syncd#dsserve: child /usr/bin/syncd exited status: 134
str-a7050-acs-3 INFO syncd#supervisord: syncd [5] child /usr/bin/syncd exited status: 134

The second thing I did was to add a timer in the condition of while loop to ensure this while loop will be forced to exit after 20 seconds:

After that, the testing result is that syncd container can be normally stopped if swss is stopped
first. One more thing I want to mention is that if syncd container is stopped during 5 seconds or 6 seconds, then the two log messages can be still seen in syslog. However, if the execution
time of while loop is longer than 20 seconds and is forced to exit, although syncd container can be stopped, I did not see these two messages in syslog. Further, although I observed the auto-restart feature of swss container can work correctly right now, I can not make sure the issue which syncd container can not stopped will occur in future.

**- How I did it**
I added a timer around the while loop in stop() function. This while loop will exit after spinning
20 seconds.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2020-06-09 16:07:24 +00:00
Joe LeVeque
7da0c15af5 [caclmgrd] Ignore keys in interface-related tables if no IP prefix is present (#4581)
Since the introduction of VRF, interface-related tables in ConfigDB will have multiple entries, one of which only contains the interface name and no IP prefix. Thus, when iterating over the keys in the tables, we need to ignore the entries which do not contain IP prefixes.
2020-06-09 16:05:40 +00:00
Joe LeVeque
3ee9c5d1e3 [caclmgrd] Add some default ACCEPT rules and lastly drop all incoming packets (#4412)
Modified caclmgrd behavior to enhance control plane security as follows:

Upon starting or receiving notification of ACL table/rule changes in Config DB:
1. Add iptables/ip6tables commands to allow all incoming packets from established TCP sessions or new TCP sessions which are related to established TCP sessions
2. Add iptables/ip6tables commands to allow bidirectional ICMPv4 ping and traceroute
3. Add iptables/ip6tables commands to allow bidirectional ICMPv6 ping and traceroute
4. Add iptables/ip6tables commands to allow all incoming Neighbor Discovery Protocol (NDP) NS/NA/RS/RA messages
5. Add iptables/ip6tables commands to allow all incoming IPv4 DHCP packets
6. Add iptables/ip6tables commands to allow all incoming IPv6 DHCP packets
7. Add iptables/ip6tables commands to allow all incoming BGP traffic
8. Add iptables/ip6tables commands for all ACL rules for recognized services (currently SSH, SNMP, NTP)
9. For all services which we did not find configured ACL rules, add iptables/ip6tables commands to allow all incoming packets for those services (allows the device to accept SSH connections before the device is configured)
10. Add iptables rules to drop all packets destined for loopback interface IP addresses
11. Add iptables rules to drop all packets destined for management interface IP addresses
12. Add iptables rules to drop all packets destined for point-to-point interface IP addresses
13. Add iptables rules to drop all packets destined for our VLAN interface gateway IP addresses
14. Add iptables/ip6tables commands to allow all incoming packets with TTL of 0 or 1 (This allows the device to respond to tools like tcptraceroute)
15. If we found control plane ACLs in the configuration and applied them, we lastly add iptables/ip6tables commands to drop all other incoming packets
2020-06-09 04:21:27 +00:00
lguohan
8e014bb7e7 [baseimage]: pin down package version for azure-storage, watchdog and futures (#4575)
Signed-off-by: Guohan Lu <lguohan@gmail.com>
2020-05-13 05:05:29 +00:00
Ying Xie
f52e59a032
[ntp] enable/disable NTP long jump according to reboot type (#4582)
- Enable NTP long jump after cold reboot.
- Disable NTP long jump after warrm/fast reboot.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2020-05-12 12:23:47 -07:00
Neetha John
3d41c271a4 [qos]: Alpha and ECN settings change for Th (#4564)
Dynamic threshold setting changed to 0 and WRED profile green min threshold set to 250000 for Tomahawk devices

Changed the dynamic threshold settings in pg_profile_lookup.ini
Added a macro for WRED profiles in qos.json.j2 for Tomahawk devices
Necessary changes made in qos.config.j2 to use the macro if present

Signed-off-by: Neetha John <nejo@microsoft.com>
2020-05-09 18:25:17 +00:00
Joe LeVeque
ceb878414d [process-reboot-cause] If software reboot cause is unknown add note if first boot into new image (#4538) 2020-05-08 20:37:22 +00:00
Nazarii Hnydyn
096a0e1e18
[mellanox]: Add SSD FW update tool (#4352)
* [mellanox]: Add SSD FW update tool.

Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>

* [mellanox]: Update SSD tool.

Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>
2020-04-13 18:12:16 +03:00
SuvarnaMeenakshi
fba321ae6c [ntp]: Add "tinker panic 0" in ntp.conf to avoid ntpd from panic (#4263)
- What I did
Add configuration to avoid ntpd from panic and exit if the drift between new time and current system time is large.

- How I did it
Added "tinker panic 0" in ntp.conf file.

- How to verify it
[this assumes that there is a valid NTP server IP in config_db/ntp.conf]

Change the current system time to a bad time with a large drift from time in ntp server; drift should be greater than 1000s.
Reboot the device.
Before the fix:
3. upon reboot, ntp-config service comes up fine, ntp service goes to active(exited) state without any error message. This is because the offset between new time (from ntp server) and the current system time is very large, ntpd goes to panic mode and exits. The system continues to show the bad time.

After the fix:
3. Upon reboot, ntp-config comes up fine, ntp services comes up from and stays in active (running) state. The system clock gets synced with the ntp server time.
2020-04-03 19:42:17 +00:00
Joe LeVeque
cbf7c7d80d [rsyslog] Suppress duplicate messages from base image and all Docker containers (#2497) 2020-04-02 21:42:01 +00:00
Stepan Blyshchak
a4dd0aa09f
[mellanox] add hardware watchdog script (#4274)
admin@sonic:~$ sudo hw-management-wd.sh
Usage: hw-management-wd.sh start [timeout] | stop | tleft | check_reset | help
start - start watchdog
        timeout is optional. Default value will be used in case if it's omitted
        timeout provided in seconds
stop - stop watchdog
tleft - check watchdog timeout left
check_reset - check if previous reset was caused by watchdog
        Prints only in case of watchdog reset
help -this help

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
2020-03-31 20:34:55 -07:00
yozhao101
1cc6141a93 [Monit] Delay start of monitoring for 5 minutes (#4281) 2020-03-19 22:49:04 +00:00
zhenggen-xu
19c1ad36a5
[201811] [interfaces-config.sh] Flush the loopback interface addresses (#4234)
* [interfaces-config.sh] Flush the loopback interface before configure it

Without this, you may end up with more and more ip addresses
on loopback interface after you change the loopback ip and do config reload

Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
2020-03-09 16:14:59 -07:00
Prince Sunny
320dcf2008 Sleep done before mismatch handler (#4165)
* Sleep done before mismatch handler
2020-02-25 16:39:33 +00:00
byu343
50db98e2b3 [arista]: Fix convertfs condition for booting from EOS (#4139)
Fix the issue of incorrectly skipping the convertfs hook when fast-reboot from EOS, by adding an extra kernel cmdline param "prev_os" to differentiate fast-reboot from EOS and from SONiC.

This is because we still do disk conversion for fast reboot from eos to sonic, like format the disk.
2020-02-25 16:38:56 +00:00
Stephen Sun
726fecaf8b [process-reboot-cause]Clean up the process-reboot-cause as reqired in issue 3927 (#4128) 2020-02-14 19:37:30 +00:00
Joe LeVeque
4af3e5066d
[interfaces-config.sh] Force lo interface down (#4149)
Force "lo" interface down in interfaces-config.sh to prevent interface-config.service from failing with the following error:

```
-- The result is failed.
systemd[1]: networking.service: Unit entered failed state.
systemd[1]: networking.service: Failed with result 'exit-code'.
interfaces-config.sh[29232]: Job for networking.service failed because the control process exited with error code.
interfaces-config.sh[29232]: See "systemctl status networking.service" and "journalctl -xe" for details.
interfaces-config.sh[29232]: ifdown: interface lo not configured
interfaces-config.sh[29232]: RTNETLINK answers: File exists
interfaces-config.sh[29232]: ifup: failed to bring up lo
systemd[1]: interfaces-config.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: Failed to start Update interfaces configuration.
-- Subject: Unit interfaces-config.service has failed
```

Failure to bring down the interface will result in a failure to subsequently bring the interface back up.
2020-02-13 22:38:21 -08:00
Prince Sunny
53a2934fc5
Added timeout to ping command (#4123) 2020-02-06 17:41:38 -08:00
Prince Sunny
c53f09684a
Update arp_update to refresh neighbor entries from APP_DB (#4102)
* Update arp_update to refresh neighbor entries from APP_DB
2020-02-05 15:42:15 -08:00
Joe LeVeque
2e43e6bc6c [caclmgrd] Fix application of IPv6 service ACL rules (part 2) (#4036) 2020-01-18 01:44:42 +00:00
Sujin Kang
956b8fd7c7 [reboot cause]: Delay process-reboot-cause service until network connection is stable (#4003) 2020-01-11 01:09:08 +00:00
yozhao101
27a2e0692b [Monit] Change the monitoring period from 120 seconds to 60 seconds. (#3974)
* [Monit] Change the monitoring period of monit from 120 seconds to 60
seconds and also at the same time double the interval for existing sonic monit config file in
host.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2020-01-11 01:01:34 +00:00
Joe LeVeque
0eab6a4c25 [201811][apt] Instruct apt-get to NOT check the "Valid Until" date in Release files (#3975) 2020-01-08 08:34:45 -08:00
Ying Xie
5ea7372dbe
[201811][monit] address build issue: hard code ARCH to amd64 (#3982)
* [201811][monit] address build issue: hard code ARCH to amd64

- also hard code the debian package path as in 201811 branch.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2020-01-07 07:41:40 -08:00
Joe LeVeque
640023ec57 [caclmgrd] Fix application of IPv6 service ACL rules (#3917) 2020-01-06 21:04:52 +00:00
Renuka Manavalan
da7db51259 corefile uploader: Updates per review comments offline (#3915)
* Updates per review comments
1) core_uploader service waits for syslog.service
2) core_uploader service enabled for restart on failure
3) Use mtime instead of file size + ample time to be robust.

* Avoid reloading already uploaded file, by marking the names with a prefix.

* Updated failing path.
1) If rc file is missing or required data missing, it periodically logs error in forever loop.
2) If upload fails, retry every hour with a error log, forever.

* Fix few bugs

* The binary update_json.py will come from sonic-utilities.
2020-01-06 21:03:40 +00:00
Renuka Manavalan
6db0c76a06 Corefile uploader service (#3887)
* Corefile uploader service

1) A service is added to watch /var/core and upload to Azure storage
2) The service is disabled on boot. One may enable explicitly.
3) The .rc file to be updated with acct credentials and http proxy to use.
4) If service is enabled with no credentials, it would sleep, with periodic log messages
5) For any update in .rc, the service has to be restarted to take effect.

* Remove rw permission for .rc file for group & others.

* Changes per review comments.
Re-ordered .rc file per JSON.dump order.
Added a script to enable partial update of .rc, which HWProxy would use to add acct key.

* Azure storage upload requires python module futures, hence added it to install list.

* Removed trailing spaces.

* A mistake in name corrected.
Copy the .rc updater script to /usr/bin.
2020-01-06 21:02:14 +00:00
Joe LeVeque
9ee8eba77c [monit] Build from source and patch to use MemAvailable value if available on system (#3875) 2020-01-06 20:59:32 +00:00
Samuel Angebault
e9e6bc58a7 [arista] Improve platform detection mechanism (#3921)
Rely on platform= and sid= on the command line to detect the platform rather than the eeprom
The platform will now properly initialize even if the system eeprom died or is unreachable.

Add support for the 7260CX3-64E
This is a variant of the 7260CX3-64 with no real difference for software.
2019-12-18 22:46:26 -08:00
Ying Xie
9583a74b47 [swss service] flush fast-reboot enabled flag upon swss stopping (#3908)
If we need to stop swss during fast-reboot procedure on the boot up path,
it means that something went wrong, like syncd/orchagent crashed already,
we are stopping and restarting swss/syncd to re-initialize. In this case,
we should proceed as if it is a cold reboot.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-12-16 16:04:10 +00:00
Stephen Sun
49869aa6fa [process-reboot-cause]Address the issue: Incorrect reboot cause returned when warm reboot follows a hardware caused reboot (#3880)
* [process-reboot-cause]Address the issue: Incorrect reboot cause returned when warm reboot follows a hardware caused reboot
1. check whether /proc/cmdline indicates warm/fast reboot.
   if yes the software reboot cause file will be treated as the reboot cause.
   finish
2. check whether platform api returns a reboot cause.
   if yes it is treated as the reboot cause.
   finish.
3. check whether /hosts/reboot-cause contains a cause.
   if yes it is treated as the cause otherwise return unknown.

* [process-reboot-cause]Fix review comments

* [process-reboot-cause]address comments
1. use "with" statement
2. update fast/warm reboot BOOT_ARG

* [process-reboot-cause]address comments

* refactor the code flow

* Remove escape

* Remove extra ':'
2019-12-14 17:44:02 +00:00
Sujin Kang
0510fc7258 Correct the watch-control service to call the right script (#3906)
* Correct the watch-control service to call the right script

* make watchdog-control.sh executable (chmod +x)
2019-12-14 09:42:36 -08:00
Ying Xie
ca1c5bc0c4 [hostcfgd] avoid in place editing config file contents (#3904)
In place editing (sed -i) seems having some issues with filesystem
interaction. It could leave 0 size file or corrupted file behind.

It would be safer to sed the file contents into a new file and switch
new file with the old file.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-12-14 03:27:39 +00:00
Sujin Kang
aea18165a8
Add watchdog-control service to disable watchdog during bootup (#3877)
* Add watchdog-control service to disable watchdog during bootup

Disable only if it's applicable and the watchdog is enabled.

* Address the review comment

* Correct the watchdog start script name

* Change to call common watchdog api instead of platform specific

* Start watchdog control service after swss starts

* advance sonic-utility submodule
2019-12-13 12:44:11 -08:00
pavel-shirshov
b28dd1db7b [fast-reboot]: Save fast-reboot state into the db [Nov] (#3892)
- Port changes #3741
2019-12-13 06:07:13 -08:00
Ying Xie
ba88f9c0ae Revert "[swss.sh] When starting, call 'systemctl restart' on dependents, not (#3807)" (#3835)
This reverts commit 351410ea8c.
2019-12-02 23:56:04 +00:00
Joe LeVeque
3920ac2368 [services] Remove explicit dependencies from dhcp_relay service file, control in swss.sh (#3823) 2019-11-27 02:21:00 +00:00
Joe LeVeque
8e86a157ff [swss.sh] When starting, call 'systemctl restart' on dependents, not (#3807)
'systemctl start'
2019-11-24 03:26:03 +00:00
Wenda Ni
8788f4f783 cherry-picking diff between #3628 and #3561
Revert "Configure buffer profile to all ports (#3561)" (#3628)
Configure buffer profile to all ports (#3561)

This reverts commit 8861cbe98e.

Signed-off-by: Wenda Ni <wenni@microsoft.com>
2019-11-08 03:12:59 +00:00
Neetha John
6d23e4c8d7 [pfcwd]: Do not start pfc watchdog on Management Tor (#3719)
Signed-off-by: Neetha John <nejo@microsoft.com>
2019-11-07 21:41:32 +00:00
lguohan
9167f9da46 [aboot]: preserve snmp.yml and acl.json for eos to sonic fast reboot (#3716) 2019-11-07 21:40:20 +00:00
Wenda Ni
0ea82d8735 Fix syntax error for qos_config template (#3619)
Signed-off-by: Wenda Ni <wenni@microsoft.com>
2019-11-07 00:22:50 +00:00
Wenda Ni
f616cec7f4 Adopt per-port buffer and qos profile (#3542)
Signed-off-by: Wenda Ni <wenni@microsoft.com>
2019-11-07 00:21:52 +00:00
lguohan
d16dbbb1d3
[bgp]: start bgp service after interfaces-config service (#3702)
interfaces-config service configures lo address. If bgp service
starts before lo address is configured, then following config
in zebra will not be applied.

route-map RM_SET_SRC permit 10
 set src 10.1.0.32

The adds a few seconds delay in bgp service start
2019-11-04 22:09:00 -08:00
Ying Xie
f764a167ac [hostname-config] improve hostname-config process (#3676)
We noticed in tests/production that there is a low probability failure
where /etc/hosts could have some garbage characters before the entry for
local host name. The consequence is that all sudo command would be very
slow. In extreme cases it would prevent some services from starting
properly.

I suspect that the /etc/hosts file might be opened by some process causing
the issue. Editing contents with new file level and replace the whole file
should be safer.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-10-29 15:42:23 +00:00
Prabhu Sreenivasan
ff137a8e56 [baseimage]: Avoid removing localhost entry from /etc/hosts file (#2452)
- What I did
This fix removes the possibility of 'localhost' entry getting removed from /etc/hosts file by hostname-config service.

Without this change, whenever we change the hostname from 'localhost' to any other name on the config_db.json and reload the config, /etc/hosts file will only have the new hostname on it. But there are multiple sonic utilities (eg: swssconfig) which relies on the hard coded 'localhost' name and they tend to stop working.

- How I did it
Added a new check on hostname-config.sh script to avid blindly deleting the line containing the old hostname from /etc/hosts file. Now it will delete the old hostname only if its not localhost or when the hostname is not changing.

- How to verify it

Bring up SONiC on a device with hostname as localhost
Edit /etc/sonic/config_db.json to update the 'hostname' filed under DEVICE_METADATA from "hostname" : "localhost" --> "hostname" : "sonic"
run config reload -y to reflect the hostname change done on config_db.json file.
cat /etc/hosts and check whether both 127.0.0.1 localhost and 127.0.0.1 sonic entry are present on the file.
ping localhost should work fine.
- Description for the changelog
Make hostname-config service more robust in handling SONiC hostname change from localhost to anything else.
2019-10-29 15:42:04 +00:00