In the multi asic platforms all the ASIC are advertising the same IPv6 /64 network from Loopback4096.
Therefore, the IPv6 loopback address of backend asic is not learnt on the frontend asic.
Change this to advertise the Loopback4096 address as /128
* Support readonly vtysh for sudoers (#7383)
Why I did it
Support readonly version of the command vtysh
How I did it
Check if the command starting with "show", and verify only contains single command in script.
* Fix the type issue in rvtysh
This PR aims to monitor the memory usage of streaming telemetry container and restart streaming telemetry container if memory usage is larger than the pre-defined threshold.
#### Why I did it
MSN4700 A1/A0 used different sensor chip but keep the existing platform name *x86_64-mlnx_msn4700-r0*, this is a workaround to replace the sensor conf on MSN4700 A1/A0
#### How I did it
Use a shell script to get the sensor conf path and copy that files to /etc/sensors.d/sensors.conf
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
This PR aims to monitor the critical processes in PMon container by Monit in 201911 branch.
How I did it
I created a template configuration file of Monit and it will be rendered to generate Monit configuration file of PMon container
by a service generate_monit_config.service.
How to verify it
I verified this on a Mellanox device str-msn2700-03 and an Arista device str-a7050-acs-1.
Which release branch to backport (provide reason below if selected)
201811
[x ] 201911
202006
202012
Issue is get_pip.py is moved to pip 21.1 (https://github.com/pypa/get-pip/commits/main) which is not compatible with 3.6.
Issue of pip itself is fixed as part of 21.1.1 in pip community (pypa/pip#9835).
However get-pip.py is still not updated to latest pip. Also get.pip.py does not support python 3.6 version explicitly (pypa/get-pip#88)
Step 15/29 : RUN curl https://bootstrap.pypa.io/get-pip.py | python3.6
---> Running in bece31f49267
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 1891k 100 1891k 0 0 9564k 0 --:--:-- --:--:-- --:--:-- 9600k
Traceback (most recent call last):
File "<stdin>", line 24298, in <module>
File "<stdin>", line 139, in main
File "<stdin>", line 115, in bootstrap
File "<stdin>", line 96, in monkeypatch_for_cert
File "/tmp/tmp5fnxrz0a/pip.zip/pip/_internal/commands/__init__.py", line 9, in <module>
File "/tmp/tmp5fnxrz0a/pip.zip/pip/_internal/cli/base_command.py", line 12, in <module>
File "/tmp/tmp5fnxrz0a/pip.zip/pip/_internal/cli/cmdoptions.py", line 30, in <module>
File "/tmp/tmp5fnxrz0a/pip.zip/pip/_internal/utils/hashes.py", line 2, in <module>
ImportError: cannot import name 'NoReturn'
The command '/bin/sh -c curl https://bootstrap.pypa.io/get-pip.py | python3.6' returned a non-zero code: 1
How I did:
Got the file from https://github.com/pypa/get-pip/tree/21.0 and added to the buildimage
pin pip to the previous release 21.0.1. (Similar is done in other public repos eg: grpc/grpc-java#8115)
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
#### Why I did it
Since we will have multiple `dhcrelay` processes if there exists different VLANs in the table `VLAN_INTERFACE` of `CONIFG_DB`,
we should use unique service name for each `dhcrelay` process in Monit configuration file. Otherwise, Monit service will fail to work.
#### How I did it
I append the VLAN name to the end of each service name such that they are unique.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
This PR aims to monitor critical processes in router advertiser and dhcp_relay containers by Monit.
How I did it
Router advertiser container only ran on T0 device and the T0 device should have at least one VLAN interface
which was configured an IPv6 address. At the same time, router advertiser container will not run on devices of which
the deployment type is 8.
As such, I created a service which will dynamically generate Monit configuration file of router advertiser from a
template.
Similarly Monit configuration file of dhcp_relay was also generated from a template since the number of dhcrelay process in dhcp_relay container is depended on number of VLANs.
How to verify it
I verified this implementation on a DuT.
With the latest 201911 image, the following error was seen on staging devices with TSB command ( for both single asic, multi asic ). Though this err message doesn't affect the TSB functionality, it is good to fix.
admin@STG01-0101-0102-01T1:~$ TSB
BGP0 : % Could not find route-map entry TO_TIER0_V4 20
line 1: Failure to communicate[13] to zebra, line: no route-map TO_TIER0_V4 permit 20
% Could not find route-map entry TO_TIER0_V4 30
line 2: Failure to communicate[13] to zebra, line: no route-map TO_TIER0_V4 deny 30
In addition, in this PR I am fixing the message displayed to user when there are no BGP neighbors configured on that BGP instance. In multi-asic device there could be case where there are no BGP neighbors configured on a particular ASIC.
Why I did it
It was observed that on a multi-asic DUT bootup, the BGP internal sessions between ASIC's was taking more time to get ESTABLISHED than external BGP sessions. The internal sessions was coming up almost exactly 120 secs later.
In multi-asic platform the bgp dockers ( which is per ASIC ) on switch start are bring brought up around the same time and they try to make the bgp sessions with neighbors (in peer ASIC's) which may be not be completely up. This results in BGP connect fail and the retry happens after 120sec which is the default Connect Retry Timer
How I did it
Add the command to set the bgp neighboring session retry timer to 10sec for internal bgp neighbors.
It is possible to have DHCP relay configuration with no servers/
helpers which result in DHCP container to crash. This PR fixes this
issue by not starting DHCP relay for vlans with no DHCP helpers.
resolves: #6931closes: #6931
Do not add program group for dhcp relay with not dhcp helpers
Unit test
This PR is cherry-pick of master
https://github.com/Azure/sonic-buildimage/pull/6920
Why I did it
Add support for BGP Monitors on multi asic SONiC platforms.
How I did it
On multi ASIC SONiC platforms, BGP monitor session will be established from Backend ASIC.
To achieve this following changes are done
Add BGP monitor configuration on the backend ASIC.
The BGP monitor configuration is present in the DPG of the device in minigraph.xml of multi-ASIC device, so this configuration will be added to the config_db of the host, when the minigraph is loaded.
To add configuration for this in the Backend ASIC, a new class MultiAsicBgpMonCfg is added to the hostcfgd service to update the config_db of the backend ASIC when the BGP_MONITOR table of the host config_db is updated.
This way incremental BGP_MONITOR configuration can also be handled.
Changes to establish BGP session with bgp monitor.
Add route in host main routing table to go to one of pre-define backend asic
Add IP table rule on front asic to mark the BGP packets with destination as IPv4 Loopback.
Add IP rule in front asic namespace to match mark BGP packet and lookup default table
Program the default route in FrontEnd asic name space docker default table as part of start.sh of the BGP container.
It need to be done as part of start.sh otherwise FRR default route will get over-written.
How to verify it
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Co-authored-by: Arvind <arlakshm@microsoft.com>
1. Made the command next-hop-self force only applicable on back-end asic bgp. This is done so that BGPL iBGP session running on backend can send e-BGP learn nexthop. Back end asic FRR is able to recursively resolve the eBGP nexthop in its routing table since it knows about all the connected routes advertise from front end asic.
2. Made all front-end asic bgp use global loopback ip (Loopback0) as router id and back end asic bgp use Loopbacl4096 as ruter-id and originator id for Route-Reflector. This is done so that routes learnt by external peer do not see Loopback4096 as router id in show ip bgp <route-prerfix> output.
3. To handle above change need to pass Loopback4096 from BGP manager for jinja2 template generation. This was missing and this change/fix is needed for this also https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-fpm-frr/frr/bgpd/templates/dynamic/instance.conf.j2#L27
4. Enhancement to add mult_asic specific bgpd template generation unit test cases.
Enable BBR config allowas-in 1 for internal peers
Why I did:
To advertise BBR routes learnt via e-BGP peer in one asic/namespace to another iBGP asic/namespace via Route Reflector.
- Introduced TS common file in docker as well and moved common functions.
- TSA/B/C scripts run only in BGP instances for front end ASICs.
In addition skip enforcing it on route maps used between internal BGP sessions.
admin@str--acs-1:~$ sudo /usr/bin/TSA
System Mode: Normal -> Maintenance
and in case of Multi-ASIC
admin@str--acs-1:~$ sudo /usr/bin/TSA
BGP0 : System Mode: Normal -> Maintenance
BGP1 : System Mode: Normal -> Maintenance
BGP2 : System Mode: Normal -> Maintenance
Fix#5812
LLDP conf Jinja2 Template does not verify IPv4 address and can use IPv6 version. This issue does not effect control LLDP daemon. Issue can be reproduced via `test_snmp_lldp` test. LLDP conf Jinja2 Template selects first item from the list of mgmt interfaces.
TESTBED_1 LLDP conf
```
configure ports eth0 lldp portidsubtype local eth0
configure system ip management pattern FC00:3::32
configure system hostname dut-1
```
TESTBED_2 LLDP conf
```
configure ports eth0 lldp portidsubtype local eth0
configure system ip management pattern 10.22.24.61
configure system hostname dut-2
```
TESTBED_1 MGMT_INTERFACE
```
$ redis-cli -n 4 keys "*" | grep MGMT_INTERFACE
MGMT_INTERFACE|eth0|10.22.24.53/23
MGMT_INTERFACE|eth0|FC00:3::32/64
```
TESTBED_2 MGMT_INTERFACE
```
$ redis-cli -n 4 keys "*" | grep MGMT_INTERFACE
MGMT_INTERFACE|eth0|FC00:3::32/64
MGMT_INTERFACE|eth0|10.22.24.61/23
```
Signed-off-by: Petro Bratash <petrox.bratash@intel.com>
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan arlakshm@microsoft.com
- Why I did it
This PR has the changes to support having different swss.rec and sairedis.rec for each asic.
The logrotate script is updated as well
- How I did it
Update the orchagent.sh script to use the logfile name options in these PRs(Azure/sonic-swss#1546 and Azure/sonic-sairedis#747)
In multi asic platforms the record files will be different for each asic, with the format swss.asic{x}.rec and sairedis.asic{x}.rec
Update the logrotate script for multiasic platform .
Recent changes brought l2 vlan concept which do not have DHCP
clients behind them and so DHCP relay is not required. Also,
dhcpmon fails to launch on those vlans as their interfaces
lack IP addresses. This PR limit launch of both DHCP relay
and dhcpmon to L3 vlans only.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
**- Why I did it**
Earlier today we found a bug in the SONiC TSA implementation.
TSC shows incorrect output (see below) in case we have a route-map which contains TSA route-map as a prefix.
```
admin@str-s6100-acs-1:~$ TSC
Traffic Shift Check:
System Mode: Not consistent
```
The reason is that TSC implementation has too loose regexps in TSA utilities, which match wrong route-map entries:
For example, current TSC matches following
```
route-map TO_BGP_PEER_V4 permit 200
route-map TO_BGP_PEER_V6 permit 200
```
But it should match only
```
route-map TO_BGP_PEER_V4 permit 20
route-map TO_BGP_PEER_V4 deny 30
route-map TO_BGP_PEER_V6 permit 20
route-map TO_BGP_PEER_V6 deny 30
```
**- How I did it**
I fixed it by using egrep with `^` and `$` regexp markers which match begin and end of the line.
**- How to verify it**
1. Add follwing entry to FRR config:
```
str-s6100-acs-1#
str-s6100-acs-1# conf t
str-s6100-acs-1(config)# route-map TO_BGP_PEER_V4 permit 200
str-s6100-acs-1(config-route-map)# end
```
2. Use the TSC command and check output. It should show normal.
```
admin@str-s6100-acs-1:~$ TSC
Traffic Shift Check:
System Mode: Normal```
* Use 20 and 30 route-map entries instead of 2 and 3 for TSA
* Added support for dynamic "Allow list" default action.
Co-authored-by: Pavel Shirshov <pavel.contrib@gmail.com>
Optimizing number of calls made to sonic-cfggen during service
start up as it adds to total system boot up time.
***-Test 1***
there is an average saving of 1 to 1.5 sec between old script and new script
```
root@str-s6000-acs-14:/# time /usr/bin/rest-server-old.sh
Generating temporary TLS server certificate ...
2020/07/09 19:03:33 wrote cert.pem
2020/07/09 19:03:33 wrote key.pem
REST_SERVER_ARGS = -ui /rest_ui -logtostderr -cert /tmp/cert.pem -key /tmp/key.pem
/usr/sbin/rest_server -ui /rest_ui -logtostderr -cert /tmp/cert.pem -key /tmp/key.pem
real 0m8.790s
user 0m7.993s
sys 0m0.584s
root@str-s6000-acs-14:/# time /usr/bin/rest-server-new.sh
Generating temporary TLS server certificate ...
2020/07/09 19:03:45 wrote cert.pem
2020/07/09 19:03:45 wrote key.pem
REST_SERVER_ARGS = -ui /rest_ui -logtostderr -cert /tmp/cert.pem -key /tmp/key.pem
/usr/sbin/rest_server -ui /rest_ui -logtostderr -cert /tmp/cert.pem -key /tmp/key.pem
real 0m6.940s
user 0m5.670s
sys 0m0.386s
```
***-Test 2***
Built an image with this change and rest server is running with params as described in test 1 above
```
admin@str-s6000-acs-14:~$ ps -ef | grep rest_server
root 3301 2866 2 02:09 pts/0 00:00:10 /usr/sbin/rest_server -ui /rest_ui -logtostderr -cert /tmp/cert.pem -key /tmp/key.pem
```
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to one call during startup when starting swss service.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to two calls during startup when starting frr service.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to one call during startup when starting radv service.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to one call during startup when starting dhcp-relay
service.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to once calle during snmp startup
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
This PR limited the number of calls to sonic-cfggen to one call
per iteration instead of current 3 calls per iteration.
The PR also installs jq on host for future scripts if needed.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Treat devices that are ToRRouters (ToRRouters and BackEndToRRouters) the same when rendering templates
Except for BackEndToRRouters belonging to a storage cluster, since these devices have extra sub-interfaces created
Treat devices that are LeafRouters (LeafRouters and BackEndLeafRouters) the same when rendering templates
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Fixed TSA bugs:
1. TSA didn't advertise Loopback ipv6 address
2. TSA and TSB changed BGP dynamic and BGP monitors sessions
**- How to verify it**
Build an image and run on your DUT.
```
admin@str-s6100-acs-1:~$ TSA
System Mode: Normal -> Maintenance
admin@str-s6100-acs-1:~$ vtysh -c 'show bgp ipv4 neighbors 10.0.0.1 advertised-routes'
BGP table version is 6, local router ID is 10.1.0.32, vrf id 0
Default local pref 100, local AS 64601
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> 10.1.0.32/32 0.0.0.0 0 32768 i
Total number of prefixes 1
admin@str-s6100-acs-1:~$ vtysh -c 'show bgp ipv6 neighbors fc00::a advertised-routes'
BGP table version is 6, local router ID is 10.1.0.32, vrf id 0
Default local pref 100, local AS 64601
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> fc00:1::/64 :: 0 32768 i
Total number of prefixes 1
admin@str-s6100-acs-1:~$ TSB
System Mode: Maintenance -> Normal
```
Co-authored-by: Pavel Shirshov <pavel.contrib@gmail.com>
- Why I did it
Update the routine is_bgp_session_internal() by checking the BGP_INTERNAL_NEIGHBOR table.
Additionally to address the review comment #5520 (comment)
Add timer settings as will in the internal session templates and keep it minimal as these sessions which will always be up.
Updates to the internal tests data + add all of it to template tests.
- How I did it
Updated the APIs and the template files.
- How to verify it
Verified the internal BGP sessions are displayed correctly with show commands with this API is_bgp_session_internal()
* Initial commit for BGP internal neighbor table support.
> Add new template named "internal" for the internal BGP sessions
> Add a new table in database "BGP_INTERNAL_NEIGHBOR"
> The internal BGP sessions will be stored in this new table "BGP_INTERNAL_NEIGHBOR"
* Changes in template generation tests with the introduction of internal neighbor template files.
Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.
Why/How I did:
Make sure first error syslog is triggered based on FAULT TOLERANCE condition.
Added support of repeat clause with alert action. This is used as trigger
for generation of periodic syslog error messages if error is persistent
Updated the monit conf files with repeat every x cycles for the alert action