* Initial commit for BGP internal neighbor table support.
> Add new template named "internal" for the internal BGP sessions
> Add a new table in database "BGP_INTERNAL_NEIGHBOR"
> The internal BGP sessions will be stored in this new table "BGP_INTERNAL_NEIGHBOR"
* Changes in template generation tests with the introduction of internal neighbor template files.
Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.
Why/How I did:
Make sure first error syslog is triggered based on FAULT TOLERANCE condition.
Added support of repeat clause with alert action. This is used as trigger
for generation of periodic syslog error messages if error is persistent
Updated the monit conf files with repeat every x cycles for the alert action
* Fix for LLDP advertisments being sent with wrong information.
Since lldpd is starting before lldpmgr, some advertisment packets might sent with default value, mac address as Port ID.
This fix hold the packets from being sent by the lldpd until all interfaces are well configured by the lldpmgrd.
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
* Fix comments
* Fix unit-test output caused a failure during build
* Add 'run_cmd' function and use it
* Resume lldpd even if port init timeout reached
**- Why I did it**
To introduce dynamic support of BBR functionality into bgpcfgd.
BBR is adding `neighbor PEER_GROUP allowas-in 1' for all BGP peer-groups which points to T0
Now we can add and remove this configuration based on CONFIG_DB entry
**- How I did it**
I introduced a new CONFIG_DB entry:
- table name: "BGP_BBR"
- key value: "all". Currently only "all" is supported, which means that all peer-groups which points to T0s will be updated
- data value: a dictionary: {"status": "status_value"}, where status_value could be either "enabled" or "disabled"
Initially, when bgpcfgd starts, it reads initial BBR status values from the [constants.yml](https://github.com/Azure/sonic-buildimage/pull/5626/files#diff-e6f2fe13a6c276dc2f3b27a5bef79886f9c103194be4fcb28ce57375edf2c23cR34). Then you can control BBR status by changing "BGP_BBR" table in the CONFIG_DB (see examples below).
bgpcfgd knows what peer-groups to change fron [constants.yml](https://github.com/Azure/sonic-buildimage/pull/5626/files#diff-e6f2fe13a6c276dc2f3b27a5bef79886f9c103194be4fcb28ce57375edf2c23cR39). The dictionary contains peer-group names as keys, and a list of address-families as values. So when bgpcfgd got a request to change the BBR state, it changes the state only for peer-groups listed in the constants.yml dictionary (and only for address families from the peer-group value).
**- How to verify it**
Initially, when we start SONiC FRR has BBR enabled for PEER_V4 and PEER_V6:
```
admin@str-s6100-acs-1:~$ vtysh -c 'show run' | egrep 'PEER_V.? allowas'
neighbor PEER_V4 allowas-in 1
neighbor PEER_V6 allowas-in 1
```
Then we apply following configuration to the db:
```
admin@str-s6100-acs-1:~$ cat disable.json
{
"BGP_BBR": {
"all": {
"status": "disabled"
}
}
}
admin@str-s6100-acs-1:~$ sonic-cfggen -j disable.json -w
```
The log output are:
```
Oct 14 18:40:22.450322 str-s6100-acs-1 DEBUG bgp#bgpcfgd: Received message : '('all', 'SET', (('status', 'disabled'),))'
Oct 14 18:40:22.450620 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-f', '/tmp/tmpmWTiuq']'.
Oct 14 18:40:22.681084 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V4 soft in']'.
Oct 14 18:40:22.904626 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V6 soft in']'.
```
Check FRR configuraiton and see that no allowas parameters are there:
```
admin@str-s6100-acs-1:~$ vtysh -c 'show run' | egrep 'PEER_V.? allowas'
admin@str-s6100-acs-1:~$
```
Then we apply enabling configuration back:
```
admin@str-s6100-acs-1:~$ cat enable.json
{
"BGP_BBR": {
"all": {
"status": "enabled"
}
}
}
admin@str-s6100-acs-1:~$ sonic-cfggen -j enable.json -w
```
The log output:
```
Oct 14 18:40:41.074720 str-s6100-acs-1 DEBUG bgp#bgpcfgd: Received message : '('all', 'SET', (('status', 'enabled'),))'
Oct 14 18:40:41.074720 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-f', '/tmp/tmpDD6SKv']'.
Oct 14 18:40:41.587257 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V4 soft in']'.
Oct 14 18:40:42.042967 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V6 soft in']'.
```
Check FRR configuraiton and see that the BBR configuration is back:
```
admin@str-s6100-acs-1:~$ vtysh -c 'show run' | egrep 'PEER_V.? allowas'
neighbor PEER_V4 allowas-in 1
neighbor PEER_V6 allowas-in 1
```
*** The test coverage ***
Below is the test coverage
```
---------- coverage: platform linux2, python 2.7.12-final-0 ----------
Name Stmts Miss Cover
----------------------------------------------------
bgpcfgd/__init__.py 0 0 100%
bgpcfgd/__main__.py 3 3 0%
bgpcfgd/config.py 78 41 47%
bgpcfgd/directory.py 63 34 46%
bgpcfgd/log.py 15 3 80%
bgpcfgd/main.py 51 51 0%
bgpcfgd/manager.py 41 23 44%
bgpcfgd/managers_allow_list.py 385 21 95%
bgpcfgd/managers_bbr.py 76 0 100%
bgpcfgd/managers_bgp.py 193 193 0%
bgpcfgd/managers_db.py 9 9 0%
bgpcfgd/managers_intf.py 33 33 0%
bgpcfgd/managers_setsrc.py 45 45 0%
bgpcfgd/runner.py 39 39 0%
bgpcfgd/template.py 64 11 83%
bgpcfgd/utils.py 32 24 25%
bgpcfgd/vars.py 1 0 100%
----------------------------------------------------
TOTAL 1128 530 53%
```
**- Which release branch to backport (provide reason below if selected)**
- [ ] 201811
- [x] 201911
- [x] 202006
**- Why I did it**
FRR introduced [next hop tracking](http://docs.frrouting.org/projects/dev-guide/en/latest/next-hop-tracking.html) functionality.
That functionality requires resolving BGP neighbors before setting BGP connection (or explicit ebgp-multihop command). Sometimes (BGP MONITORS) our neighbors are not directly connected and sessions are IBGP. In this case current configuration prevents FRR to establish BGP connections. Reason would be "waiting for NHT". To fix that we need either add static routes for each not-directly connected ibgp neighbor, or enable command `ip nht resolve-via-default`
**- How I did it**
Put `ip nht resolve-via-default` into the config
**- How to verify it**
Build an image. Enable BGP_MONITOR entry and check that entry is Established or Connecting in FRR
Co-authored-by: Pavel Shirshov <pavel.contrib@gmail.com>
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Calculate ECMP hash seed based on ASIC ID on multi ASIC platform. Each ASIC will have a unique ECMP hash seed value.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
implements a new feature: "BGP Allow list."
This feature allows us to control which IP prefixes are going to be advertised via ebgp from the routes received from EBGP neighbors.
* Support for multi-asic platform for swssloglevel command
admin@str-acs-1:~$ swssloglevel
Usage: /usr/bin/swssloglevel -n [0 to 3] [OPTION]...
* Update to use the env file to get the PLATFORM string.
* [platform] Add Support For Environment Variable
This PR adds the ability to read environment file from /etc/sonic.
the file contains immutable SONiC config attributes such as platform,
hwsku, version, device_type. The aim is to minimize calls being made
into sonic-cfggen during boot time.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
implements a new feature: "BGP Allow list."
This feature allows us to control which IP prefixes are going to be advertised via ebgp from the routes received from EBGP neighbors.
SWSS config script restore ARP/FDB/Routes. Restore neighbor script
uses config DB ARP information to restore ARP entries and so needs
to be started after swssconfig exits.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:
```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```
This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.
Resolves https://github.com/Azure/sonic-buildimage/issues/5241
We want to let Monit to unmonitor the processes in containers which are disabled in `FEATURE` table such that
Monit will not generate false alerting messages into the syslog.
- Backport of https://github.com/Azure/sonic-buildimage/pull/5153 to the 201911 branch
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Install a newer version of rsyslog from stretch-backports to support -iNONE
Previous backport from master use -iNONE option which is only
available after v8.32.0
Signed-off-by: Guohan Lu <lguohan@gmail.com>
Printing both snapshot and current counter sets will make it easier to pinpoint
which message type(s) is/are not being relayed. This PR prints both counter sets.
Also, this PR defines gnu11 as a C standard to compile with in order to avoid
making changes when porting to 201811 branch.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
When BGP routes are missing, DHCP packets get relayed over mgmt
interface. This results in dhcpmon alerting that DHCP packets are
not being relayed. This is PR include mgmt interface as uplink
device, and so, if DHCP packet gets relayed over mgmt interface,
regular dhcpmon alert will not be issues. Instead, dhcpmon will
check the mgmt interface counts and issue a separate alert regarding
packets travelling through mgmt network.
In addition, this PR includes the following enhancements:
1. Add SIGUSR1 handler that prints out current packet counts
2. Increase alert grace window to 3 minutes from currently 2 minutes
3. Time is now computed more accurately
4. Print vlan name before counters
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Arp update process was not being started due to an issue with
the directory name having an extra 'd' in supervisor as in
'/etc/supervisord/conf.d/arp_update.conf'.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
**- Why I did it**
PR https://github.com/Azure/sonic-buildimage/pull/4599 introduced two bugs in the startup of the router advertiser container:
1. References to the `wait_for_intf.sh` script were changed to `wait_for_link.sh`, but the actual script was not renamed
2. The `ipv6_found` Jinja2 variable added to the supervisor config file goes out of scope before it is read.
**- How I did it**
1. Rename the `wait_for_intf.sh` script to `wait_for_link.sh`
2. Use the Jinja2 "namespace" construct to fix the scope issue
**- How to verify it**
Ensure all processes in the radv container start properly under the correct conditions (i.e., whether or not there is at least one VLAN with an IPv6 address assigned).
https://github.com/Azure/sonic-buildimage/issues/5255
Root Cause: Waiting on Restore count != 0 can lead to race condition
between orchagent process and swssconfig.sh.
Ideally check of Restore count != 0 is not needed as the State DB
cannot be flushed as if it was flushed then Warm Restart or swss-restart
should not be true also.
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
The following changes are done.
- Multi asic platform have 2 Loopback interfaces, Loopback0 and Loopback4096. IPinIP decap entries need to be added for both of them. Update the ipinip.json.j2 template to add decap entries for Loopback4096.
- Add corressponding unit test
For telemetry regression test we need gnmi client to be present on ptfdocker. Gnmi-server will be present on SONiC DuT. Further, we can access gnmi_get from ptfdocker inside pytest to verify gnmi server streaming data successfully or not.
When telemetry is in secure mode ,the monitor will have error log of the match string "--insecure". So I modify to be compatiable with insecure mode and secure mode.
Co-authored-by: Ubuntu <ubuntu@ip-10-5-1-21.ap-south-1.compute.internal>
The template is referenced relative to the script path and this could
results in errors in case script is run from root. Add explicit
path to the template file name.
Also, moving telemetry_var template to template dir.
And remove double quotes from around json dict.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
when portsyncd starts, it first enumerates all front panel ports
and marks them as old interfaces. Then, for new front panel ports
it checks if their indexes exist in previous sets. If yes, it will
treats them as old interfaces and ignore them.
The reason we have this check is because broadcom SAI only removes
front panel ports after sai switch init.
So, if portsyncd starts after orchagent, new interfaces could be
created before portsyncd and treated as old interface.
Signed-off-by: Guohan Lu <lguohan@gmail.com>
Copy proper fancontrol config file to the proper destination. Also some minor refactoring for code reuse to help prevent issues like this in the future.
Fixes a bug introduced by #4599