Commit Graph

631 Commits

Author SHA1 Message Date
judyjoseph
ce86621399 [multi-ASIC] BGP internal neighbor table support (#5520)
* Initial commit for BGP internal neighbor table support.
  > Add new template named "internal" for the internal BGP sessions
  > Add a new table in database "BGP_INTERNAL_NEIGHBOR"
  > The internal BGP sessions will be stored in this new table "BGP_INTERNAL_NEIGHBOR"

* Changes in template generation tests with the introduction of internal neighbor template files.
2020-11-10 12:52:58 -08:00
abdosi
65cc37cadf [multi-asic] teamdctl support for multi-asic (#5851)
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2020-11-09 12:33:41 -08:00
Junchao-Mellanox
1070d024bc [thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test (#5633)
Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.
2020-11-03 08:19:19 -08:00
abdosi
0fad6bdc7f [monit] Adding patch to enhance syslog error message generation for monit alert action when status is failed. (#5720)
Why/How I did:

Make sure first error syslog is triggered based on FAULT TOLERANCE condition.

Added support of repeat clause with alert action. This is used as trigger
for generation of periodic syslog error messages if error is persistent

Updated the monit conf files with repeat every x cycles for the alert action
2020-11-01 10:27:10 -08:00
shlomibitton
97f2cafe0b [LLDP] Fix for LLDP advertisements being sent with wrong information. (#5493)
* Fix for LLDP advertisments being sent with wrong information.
Since lldpd is starting before lldpmgr, some advertisment packets might sent with default value, mac address as Port ID.
This fix hold the packets from being sent by the lldpd until all interfaces are well configured by the lldpmgrd.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>

* Fix comments

* Fix unit-test output caused a failure during build

* Add 'run_cmd' function and use it

* Resume lldpd even if port init timeout reached
2020-10-30 09:06:23 -07:00
pavel-shirshov
2eec3b3254 [bgpcfgd]: Dynamic BBR support (#5626)
**- Why I did it**
To introduce dynamic support of BBR functionality into bgpcfgd.
BBR is adding  `neighbor PEER_GROUP allowas-in 1' for all BGP peer-groups which points to T0
Now we can add and remove this configuration based on CONFIG_DB entry 

**- How I did it**
I introduced a new CONFIG_DB entry:
 - table name: "BGP_BBR"
 - key value: "all". Currently only "all" is supported, which means that all peer-groups which points to T0s will be updated
 - data value: a dictionary: {"status": "status_value"}, where status_value could be either "enabled" or "disabled"

Initially, when bgpcfgd starts, it reads initial BBR status values from the [constants.yml](https://github.com/Azure/sonic-buildimage/pull/5626/files#diff-e6f2fe13a6c276dc2f3b27a5bef79886f9c103194be4fcb28ce57375edf2c23cR34). Then you can control BBR status by changing "BGP_BBR" table in the CONFIG_DB (see examples below).

bgpcfgd knows what peer-groups to change fron [constants.yml](https://github.com/Azure/sonic-buildimage/pull/5626/files#diff-e6f2fe13a6c276dc2f3b27a5bef79886f9c103194be4fcb28ce57375edf2c23cR39). The dictionary contains peer-group names as keys, and a list of address-families as values. So when bgpcfgd got a request to change the BBR state, it changes the state only for peer-groups listed in the constants.yml dictionary (and only for address families from the peer-group value).

**- How to verify it**
Initially, when we start SONiC FRR has BBR enabled for PEER_V4 and PEER_V6:
```
admin@str-s6100-acs-1:~$ vtysh -c 'show run' | egrep 'PEER_V.? allowas'
  neighbor PEER_V4 allowas-in 1
  neighbor PEER_V6 allowas-in 1
```

Then we apply following configuration to the db:
```
admin@str-s6100-acs-1:~$ cat disable.json                
{
        "BGP_BBR": {
            "all": {
                "status": "disabled"
            }
        }
}


admin@str-s6100-acs-1:~$ sonic-cfggen -j disable.json -w 
```
The log output are:
```
Oct 14 18:40:22.450322 str-s6100-acs-1 DEBUG bgp#bgpcfgd: Received message : '('all', 'SET', (('status', 'disabled'),))'
Oct 14 18:40:22.450620 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-f', '/tmp/tmpmWTiuq']'.
Oct 14 18:40:22.681084 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V4 soft in']'.
Oct 14 18:40:22.904626 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V6 soft in']'.
```

Check FRR configuraiton and see that no allowas parameters are there:
```
admin@str-s6100-acs-1:~$ vtysh -c 'show run' | egrep 'PEER_V.? allowas' 
admin@str-s6100-acs-1:~$
```

Then we apply enabling configuration back:
```
admin@str-s6100-acs-1:~$ cat enable.json 
{
        "BGP_BBR": {
            "all": {
                "status": "enabled"
            }
        }
}

admin@str-s6100-acs-1:~$ sonic-cfggen -j enable.json -w 
```
The log output:
```
Oct 14 18:40:41.074720 str-s6100-acs-1 DEBUG bgp#bgpcfgd: Received message : '('all', 'SET', (('status', 'enabled'),))'
Oct 14 18:40:41.074720 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-f', '/tmp/tmpDD6SKv']'.
Oct 14 18:40:41.587257 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V4 soft in']'.
Oct 14 18:40:42.042967 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V6 soft in']'.
```


Check FRR configuraiton and see that the BBR configuration is back:
```
admin@str-s6100-acs-1:~$ vtysh -c 'show run' | egrep 'PEER_V.? allowas'
  neighbor PEER_V4 allowas-in 1
  neighbor PEER_V6 allowas-in 1
```

*** The test coverage ***
Below is the test coverage
```
---------- coverage: platform linux2, python 2.7.12-final-0 ----------
Name                             Stmts   Miss  Cover
----------------------------------------------------
bgpcfgd/__init__.py                  0      0   100%
bgpcfgd/__main__.py                  3      3     0%
bgpcfgd/config.py                   78     41    47%
bgpcfgd/directory.py                63     34    46%
bgpcfgd/log.py                      15      3    80%
bgpcfgd/main.py                     51     51     0%
bgpcfgd/manager.py                  41     23    44%
bgpcfgd/managers_allow_list.py     385     21    95%
bgpcfgd/managers_bbr.py             76      0   100%
bgpcfgd/managers_bgp.py            193    193     0%
bgpcfgd/managers_db.py               9      9     0%
bgpcfgd/managers_intf.py            33     33     0%
bgpcfgd/managers_setsrc.py          45     45     0%
bgpcfgd/runner.py                   39     39     0%
bgpcfgd/template.py                 64     11    83%
bgpcfgd/utils.py                    32     24    25%
bgpcfgd/vars.py                      1      0   100%
----------------------------------------------------
TOTAL                             1128    530    53%
```

**- Which release branch to backport (provide reason below if selected)**

- [ ] 201811
- [x] 201911
- [x] 202006
2020-10-30 08:58:27 -07:00
pavel-shirshov
84405ab953 [bgp]: Enable next-hop-tracking through default (#5600)
**- Why I did it**
FRR introduced [next hop tracking](http://docs.frrouting.org/projects/dev-guide/en/latest/next-hop-tracking.html) functionality.
That functionality requires resolving BGP neighbors before setting BGP connection (or explicit ebgp-multihop command). Sometimes (BGP MONITORS) our neighbors are not directly connected and sessions are IBGP. In this case current configuration prevents FRR to establish BGP connections.  Reason would be "waiting for NHT". To fix that we need either add static routes for each not-directly connected ibgp neighbor, or enable command `ip nht resolve-via-default`

**- How I did it**
Put `ip nht resolve-via-default` into the config

**- How to verify it**
Build an image. Enable BGP_MONITOR entry and check that entry is Established or Connecting in FRR

Co-authored-by: Pavel Shirshov <pavel.contrib@gmail.com>

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2020-10-13 22:42:29 -07:00
abdosi
9202b1c7eb
Fix monit complaining of snmp on 201911 branch. (#5612)
There is difference between master and 201911
how sonic_ax_impl is started.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2020-10-13 17:17:43 -07:00
Mahesh Maddikayala
f354a20d94 [ECMP][Multi-ASIC] Have different ECMP seed value on each ASIC (#5357)
* Calculate ECMP hash seed based on ASIC ID on multi ASIC platform. Each ASIC will have a unique ECMP hash seed value.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2020-10-13 09:48:57 -07:00
pavel-shirshov
437ad95646 [bgp] Add 'allow list' manager feature (#5513)
implements a new feature: "BGP Allow list."

This feature allows us to control which IP prefixes are going to be advertised via ebgp from the routes received from EBGP neighbors.
2020-10-06 11:15:19 -07:00
abdosi
3a29249e04 [Multi-asic] Fixed Default Route to be BGP (#5548)
Learned and not docker default route for multi-asic platforms.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2020-10-06 06:04:31 +00:00
Nazarii Hnydyn
f456f1fd03 [monit]: Fix process checker. (#5480)
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2020-09-30 00:25:37 +00:00
Stephen Sun
e9c2fdbf4a
[watermark] Fix error: BUFFER_POOL_WATERMARK isn't enabled by default (#4882) (#5455)
* Fix error: watermarkstat -t buffer_pool doesn't work

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2020-09-29 13:59:26 -07:00
arlakshm
c8f92232ef Vtysh support for multi asic (#5479)
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2020-09-29 19:40:37 +00:00
Abhishek Dosi
04725bc030 Revert "[bgp] Add 'allow list' manager feature (#5309)"
This reverts commit b5d33b39de.
2020-09-29 15:39:04 +00:00
judyjoseph
4dbe391b9a [multi-Asic] Add support for multi-asic to swssloglevel (#5316)
* Support for multi-asic platform for swssloglevel command

admin@str-acs-1:~$ swssloglevel 
Usage: /usr/bin/swssloglevel -n [0 to 3] [OPTION]... 

* Update to use the env file to get the PLATFORM string.
2020-09-28 21:15:44 +00:00
Tamer Ahmed
2cc98b4bac [platform] Add Support For Environment Variable File (#5010)
* [platform] Add Support For Environment Variable

This PR adds the ability to read environment file from /etc/sonic.
the file contains immutable SONiC config attributes such as platform,
hwsku, version, device_type. The aim is to minimize calls being made
into sonic-cfggen during boot time.

singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2020-09-28 21:14:39 +00:00
pavel-shirshov
b5d33b39de [bgp] Add 'allow list' manager feature (#5309)
implements a new feature: "BGP Allow list."

This feature allows us to control which IP prefixes are going to be advertised via ebgp from the routes received from EBGP neighbors.
2020-09-28 16:20:27 +00:00
Sumukha Tumkur Vani
d6856aa424 Update conf DB with CA cert & rename ca_crt field (#5448) 2020-09-28 16:19:27 +00:00
Tamer Ahmed
dd87bf7f7c [swss] Start Restore Neighbor After SWSS Config (#5451)
SWSS config script restore ARP/FDB/Routes. Restore neighbor script
uses config DB ARP information to restore ARP entries and so needs
to be started after swssconfig exits.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2020-09-28 16:15:19 +00:00
Joe LeVeque
b70c6f72b2 [dockers][supervisor] Increase event buffer size for dependent-startup (#5247)
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:

```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```

This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.

Resolves https://github.com/Azure/sonic-buildimage/issues/5241
2020-09-28 16:12:53 +00:00
yozhao101
7580c846ad
[201911][Monit] Unmonitor processes in disabled containers (#5462)
We want to let Monit to unmonitor the processes in containers which are disabled in `FEATURE` table such that
Monit will not generate false alerting messages into the syslog.

- Backport of https://github.com/Azure/sonic-buildimage/pull/5153 to the 201911 branch

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2020-09-25 00:30:41 -07:00
gechiang
6ae77f87cc
Renamed sonic-bgpcfgd/bgpmon_proj directory to sonic-bfgcfgd/bgpmon so it is in sync with master branch naming change. Also made bgpmon auto restart enabled (#5453)
synch up the changes from master branch where bgpmon_proj is renamed to bgpmon.
Added bgpmon to be autorestart enabled by supervisord
2020-09-24 08:57:55 -07:00
gechiang
7168fc8c07
Add bgpmon under sonic-bgpcfgd to be started as a new daemon under BGP docker (#5426)
This is to port the same set of changes from master branch to 201911 branch for the bgpmon daemon running under bgp docker.
2020-09-22 12:13:37 -07:00
lguohan
13d28f9d19
[docker-base-stretch]: install rsyslog from stretch-backports (#5411)
Install a newer version of rsyslog from stretch-backports to support -iNONE

Previous backport from master use -iNONE option which is only
available after v8.32.0

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2020-09-21 02:08:30 -07:00
Prince Sunny
20f627044f Add new DB for Restapi to database config (#5350) 2020-09-19 14:08:36 -07:00
Tamer Ahmed
56cab18501 [dhcpmon] Print Both Snapshot And Current Counters (#5374)
Printing both snapshot and current counter sets will make it easier to pinpoint
which message type(s) is/are not being relayed. This PR prints both counter sets.
Also, this PR defines gnu11 as a C standard to compile with in order to avoid
making changes when porting to 201811 branch.

singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2020-09-19 14:06:25 -07:00
Tamer Ahmed
b27ba0630c [dhcpmon] Monitor Mgmt Interface For DHCP Packets (#5317)
When BGP routes are missing, DHCP packets get relayed over mgmt
interface. This results in dhcpmon alerting that DHCP packets are
not being relayed. This is PR include mgmt interface as uplink
device, and so, if DHCP packet gets relayed over mgmt interface,
regular dhcpmon alert will not be issues. Instead, dhcpmon will
check the mgmt interface counts and issue a separate alert regarding
packets travelling through mgmt network.

In addition, this PR includes the following enhancements:
1. Add SIGUSR1 handler that prints out current packet counts
2. Increase alert grace window to 3 minutes from currently 2 minutes
3. Time is now computed more accurately
4. Print vlan name before counters

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2020-09-19 14:05:49 -07:00
Joe LeVeque
c3117bc35e [lldpmgrd] Inherit DaemonBase class from sonic-py-common package (#5370)
Eliminate duplicate logging and signal handling code by inheriting from DaemonBase class in sonic-py-common package.
2020-09-19 13:59:01 -07:00
Tamer Ahmed
4f7c346c53 [swss] Start Arp Update Process (#5391)
Arp update process was not being started due to an issue with
the directory name having an extra 'd' in supervisor as in
'/etc/supervisord/conf.d/arp_update.conf'.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2020-09-19 13:52:18 -07:00
Joe LeVeque
1ee4fa5a40 [docker-radv] Fix startup issues (#5230)
**- Why I did it**

PR https://github.com/Azure/sonic-buildimage/pull/4599 introduced two bugs in the startup of the router advertiser container:

1. References to the `wait_for_intf.sh` script were changed to `wait_for_link.sh`, but the actual script was not renamed
2. The `ipv6_found` Jinja2 variable added to the supervisor config file goes out of scope before it is read.

**- How I did it**
1. Rename the `wait_for_intf.sh` script to `wait_for_link.sh`
2. Use the Jinja2 "namespace" construct to fix the scope issue

**- How to verify it**

Ensure all processes in the radv container start properly under the correct conditions (i.e., whether or not there is at least one VLAN with an IPv6 address assigned).
2020-09-04 21:20:08 +00:00
abdosi
e564142df2 Fix the issue as reported in (#5315)
https://github.com/Azure/sonic-buildimage/issues/5255

Root Cause: Waiting on Restore count != 0 can lead to race condition
between orchagent process and swssconfig.sh.

Ideally check of  Restore count != 0 is not needed as the State DB
cannot be flushed as if it was flushed then Warm Restart or swss-restart
should not be true also.
2020-09-04 21:10:39 +00:00
Prince Sunny
b1acfb60a7 Skip vnet-vxlan interfaces from generating networks (#5251)
* Skip Vnet interface from generating networks
2020-09-03 15:49:59 -07:00
arlakshm
15a2195236 [Multi-ASIC]:Update the template to add ipinip entry for Loopback4096 (#5235)
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
The following changes are done. 
- Multi asic platform have 2 Loopback interfaces, Loopback0 and Loopback4096. IPinIP decap entries need to be added for both of them. Update the ipinip.json.j2 template to add decap entries for Loopback4096.
- Add corressponding unit test
2020-09-03 15:48:39 -07:00
zhenggen-xu
a949cf004e
[Build] pin down setuptools for build issues (#5281)
Pin down setuptools version to fix build issues. See: https://github.com/Azure/sonic-buildimage/issues/5279

Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
2020-08-31 20:44:39 -07:00
pra-moh
c43a994486 [docker-ptf] add gnmi python client (#4928)
For telemetry regression test we need gnmi client to be present on ptfdocker. Gnmi-server will be present on SONiC DuT. Further, we can access gnmi_get from ptfdocker inside pytest to verify gnmi server streaming data successfully or not.
2020-08-27 08:05:41 -07:00
Mykola F
c243b8a9f5
[201911] Update SAI-Implementation submodule and enable port in/out dropped pkts stats (#5093)
- Enable port buffer drops by default
- Update SAI submodule

Signed-off-by: Mykola Faryma <mykolaf@mellanox.com>
2020-08-25 08:20:05 -07:00
RayWang910012
4810db8447 [monit]: monit_telemetry which will have error when telemetry is in secure mode (#4286)
When telemetry is in secure mode ,the monitor will have error log of the match string "--insecure". So I modify to be compatiable with insecure mode and secure mode.

Co-authored-by: Ubuntu <ubuntu@ip-10-5-1-21.ap-south-1.compute.internal>
2020-08-24 10:22:25 -07:00
Tamer Ahmed
9514932ed5 [telemetry] Fix telemetry vars template path (#4938)
The template is referenced relative to the script path and this could
results in errors in case script is run from root. Add explicit
path to the template file name.
Also, moving telemetry_var template to template dir.
And remove double quotes from around json dict.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2020-08-19 16:59:32 -07:00
lguohan
92270544c5 [docker-orchagent]: start portsyncd before orchagent (#4845)
when portsyncd starts, it first enumerates all front panel ports
and marks them as old interfaces. Then, for new front panel ports
it checks if their indexes exist in previous sets. If yes, it will
treats them as old interfaces and ignore them.

The reason we have this check is because broadcom SAI only removes
front panel ports after sai switch init.

So, if portsyncd starts after orchagent, new interfaces could be
created before portsyncd and treated as old interface.

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2020-08-16 08:25:36 -07:00
Joe LeVeque
802e77c3f1 [docker-pmon] Fix copy of fancontrol config file (#5037)
Copy proper fancontrol config file to the proper destination. Also some minor refactoring for code reuse to help prevent issues like this in the future.

Fixes a bug introduced by #4599
2020-08-15 22:35:02 -07:00
Guohan Lu
42f9be1de3 [docker-database]: do not generate pidfile for rsyslogd
Signed-off-by: Guohan Lu <lguohan@gmail.com>
2020-08-15 22:32:25 -07:00
Guohan Lu
569766f698 [docker-snmp-sv2]: use service dependency in supervisord to start services
Signed-off-by: Guohan Lu <lguohan@gmail.com>
2020-08-15 22:32:19 -07:00
Guohan Lu
b378b4d249 [docker-dhcp-relay]: use service dependency in supervisord to start services 2020-08-15 22:25:52 -07:00
Guohan Lu
7158ccd30d [docker-teamd]: use service dependency in supervisord to start services 2020-08-15 22:25:46 -07:00
Guohan Lu
1b6b6055e7 [docker-mgmt-framework]: use service dependency in supervisord to start services
Signed-off-by: Guohan Lu <lguohan@gmail.com>
2020-08-15 22:25:38 -07:00
Guohan Lu
4d2f9d1245 [docker-telemetry]: use service dependency in supervisord to start services
Signed-off-by: Guohan Lu <lguohan@gmail.com>
2020-08-15 22:25:32 -07:00
Guohan Lu
9f5c5c7a4a [docker-restapi]: use service dependency in supervisord to start services 2020-08-15 22:25:24 -07:00
Guohan Lu
763673993e [docker-pmon]: use service dependency in supervisord to start services 2020-08-15 22:23:50 -07:00
Guohan Lu
aa0b875b03 [docker-sflow]: use service dependency in supervisord to start services 2020-08-15 22:22:00 -07:00