Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
In the multi asic platforms all the ASIC are advertising the same IPv6 /64 network from Loopback4096.
Therefore, the IPv6 loopback address of backend asic is not learnt on the frontend asic.
Change the bgpd.conf.main.conf.j2 template file to advertise the Loopback4096 ipv6 address as /128
Why I did it
To enable test support for BFD-related features, the PTF docker needs to have the proper support for BFD. This PR aims to add BFD support in ptf docker.
How I did it
Clone and build OpenBFDD for PTF docker.
How to verify it
Build locally and verify BFD is supported.
This is to save about 40MB of disk space, since 5 containers
individually install this package.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
(cherry picked from commit bd479cad29)
#### Why I did it
resolves https://github.com/Azure/sonic-buildimage/issues/8779
snmpd writes the below error message in syslog :
snmp#snmpd[27]: truncating integer value > 32 bits
This message is written in syslog when the hrSystemUptime(1.3.6.1.2.1.25.1.1.0 / system uptime) or sysUpTime(1.3.6.1.2.1.1.3 network management portion or snmpd uptime) is queried when either of these counters overflow beyond 32 bit value. This happens the device uptime or snmpd uptime is more than 497 days.
#### How I did it
Reference: https://access.redhat.com/solutions/367093 and https://linux.die.net/man/1/snmpcmd
To avoid seeing this message if the counter grows, the snmpd error log level is changed to display LOG_EMERG, LOG_ALERT, LOG_CRIT, and LOG_DEBUG.
Without this change, LOG_ERR and LOG_WARNING would also be logged in syslog.
#### How to verify it
On a device which is up for more than 497 days, modify supervisord.conf with the change and restart snmp.
Query 1.3.6.1.2.1.1.3 and verify that log message is not seen.
Why I did it
There are scenarios that End-of-RIB comes from a part of the peers arrives after reconciliation. In such scenarios, if the route selection deferral timer has the default value of 360 seconds, FRR would not set up routes and all routes would be removed after reconciliation. This PR reduces the route selection deferral timer so that at least routes to parts of the peers get restored at the point of reconciliation.
Fix#7488
How I did it
Reduce route selection deferral timer for bgp graceful restart to 15 seconds.
- Create a script in the orchagent docker container which listens for these encapsulated packets which are trapped to CPU (indicating that they cannot be routed/no neighbor info exists for the inner packet). When such a packet is received, the script will issue a ping command to the packet's inner destination IP to start the neighbor learning process.
- This script is also resilient to portchannel status changes (i.e. interface going up or down). An interface going down does not affect traffic sniffing on interfaces which are still up. When an interface comes back up, we restart the sniffer to start capturing traffic on that interface again.
**- Why I did it**
I'm updating the jinja2 template to support getting SNMP information from the redis configdb.
I'm using the format approved here:
https://github.com/Azure/SONiC/pull/718
This will pave the way for us to decrement using the snmp.yml in the future.
Right now we will still be using both the snmp.yml and configdb to get variable information in order to create the snmpd.conf via the sonic-cfggen tool.
**- How I did it**
I first updated the SNMP Schema in PR #718 to get that approved as a standardized format.
Then I verified I could add snmp configs to the configdb using this standard schema. Once the configs were added to the configdb then I updated the snmpd.conf.j2 file to support the updates via the configdb while still using the variables in the snmp.yml file in parallel. This way we will have backward compatibility until we can fully migrate to the configdb only.
By updating the snmpd.conf.j2 template and running the sonic-cfggen tool the snmpd.conf gets generated with using the values in both the configdb and snmp.yml file.
Co-authored-by: trvanduy <trvanduy@microsoft.com>
Why I did it
resolves#8979 and #9055
How I did it
Remove the file static.conf.j2,which adds the default route on eth0 from bgp docker
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
- Why I did it
This is to update the common sonic-buildimage infra for reclaiming buffer.
- How I did it
Render zero_profiles.j2 to zero_profiles.json for vendors that support reclaiming buffer
The zero profiles will be referenced in PR [Reclaim buffer] Reclaim unused buffers by applying zero buffer profiles #8768 on Mellanox platforms and there will be test cases to verify the behavior there.
Rendering is done here for passing azure pipeline.
Load zero_profiles.json when the dynamic buffer manager starts
Generate inactive port list to reclaim buffer
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Linkmgrd monitors link status, mux status, and link state. Has
the link becomes unhealthy, linkmgrd will trigger mux switchover
on a standby ToR ensuring uninterrupted service to servers/blades.
This PR is initial implementation of linkmgrd.
Also, docker-mux container hold packages related to maintaining and managing
mux cable. It currently runs linkmgrd binary that monitor and switches
the mux if needed.
This PR also introduces mux-container and starts linkmgrd as startup when
build is configured with INCLUDE_MUX=y
Edit: linkmgrd PR will follow.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Related work items: #2315, #3146150
Why I did it
During swss container startup, if ndppd starts up before/with vlanmgrd, ndppd will be pinned at nearly 100% CPU usage.
How I did it
Only start ndppd after vlanmgrd is running. Also, call ndppd directly instead of through bash for improved logging and to prevent orphaned processes.
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Fix the check used to wait for interfaces to come up. The group name in
the supervisor config files has changed from isc-dhcp-relay to
dhcp-relay.
Also, in the wait script, wait 10 additional seconds after the vlans,
port channels, and any interfaces are up. This is because dhcrelay
listens on all interfaces (in addition to port channels and vlans), and
to ensure that it stays in a clean state during runtime, wait some extra
time to make sure that those interfaces are created as well.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Why I did it
Pcied running by python 2.
How I did it
dropped python2 support and add python3 support for pcied in file docker-pmon.supervisord.conf.j2
How to verify it
docker exec pmon supervisorctl status
Why I did it
Support to build armhf/arm64 platforms on arm based system without qemu simulator.
When building the armhf/arm64 on arm based system, it is not necessary to use qemu simulator.
How I did it
Build armhf on armhf system, or build arm64 on arm64 system, by default, qemu simulator will not be used.
When building armhf on arm64, and you have enabled armhf docker, then it will build images without simulator automatically. It is based how the docker service is run.
Docker base image change:
For amd64, change from debian:to amd64/debian:
For arm64, change from multiarch/debian-debootstrap:arm64- to arm64v8/debian:
For armhf, change from multiarch/debian-debootstrap:armhf- to arm32v7/debian:
See https://github.com/docker-library/official-images#architectures-other-than-amd64
The mapping relations:
arm32v6 --- armel
arm32v7 --- armhf
arm64v8 --- arm64
Docker image armhf deprecated info: https://hub.docker.com/r/armhf/debian, using arm32v7 instead.
Enable Autorestart of the daemons in PMON for unexpected exit
Remove the daemon list from the critical_process which prevent the PMON
from restarting when the individual daemon crashes.
#### Why I did it
The process of config generation (sonic-cfggen) fails, but the services continue to run with invalid config
#### How I did it
* add exit with error on errors in start.sh script (because supervisord relies on start.sh return code).
* fix jinja template. Jinja use common python expressions under the hood and `has_key` method was removed from dict in py3, so use check by `in` operator as it is supported by both py2 and py3.
#### How to verify it
* compile sonic with enabled iccp.
* add mclag config to CONFIG_DB.
```
'MC_LAG|1' => {
"local_ip": "10.0.0.2",
"peer_ip": "10.0.0.3",
"peer_link": "Ethernet8",
"mclag_interface": "Ethernet12"
}
* unmaks, enable and start swss and iccpd services in sonic.
* log in into the iccpd container and check the config file `/etc/iccpd/iccpd.conf`
* expected config:
```
mclag_id:1
local_ip:10.0.0.2
peer_ip:10.0.0.3
peer_link:Ethernet8
mclag_interface:Ethernet12
system_mac:YOUR_SYSTEM_MAC
#### Description for the changelog
Fixed initial iccpd startup configuration.
#### Why I did it
ethtool can be used to query and change settings such as speed, auto- negotiation and checksum offload on many network devices, especially Ethernet devices.
#### How I did it
add package extension to docker-platform-monitor/Dockerfile.j2
#### Why I did it
The libpci library provides portable access to configuration registers of devices connected to the PCI bus.
#### How I did it
update dockers/docker-platform-monitor/Dockerfile.j2
A recent version of contextlib2 (https://pypi.org/project/contextlib2/21.6.0/#history) has broken Python2 compatibility,
so the version picked up by netaddr when using Python2 must be specified, or else builds fail
Co-authored-by: Tom Zhu <tom.zhu@metaswitch.com>
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.
How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.
How to verify it
I verified this on the device str-7260cx3-acs-1.
#### Why I did it
To avoid the following error
```
Traceback (most recent call last):
File "/usr/local/bin/flush_unused_database", line 10, in <module>
if 'PONG' in output:
TypeError: a bytes-like object is required, not 'str'
```
`communicate` method returns the strings if streams were opened in text mode; otherwise, bytes.
In our case text arg in Popen is not true and that means that `communicate` return the bytes
#### How I did it
Set `text=True` to get strings instead of bytes
#### How to verify it
run `/usr/local/bin/flush_unused_database` inside database container
Why I did it
ndppd by default reads /proc/net/ipv6_route ever 30 seconds. Since T1s advertise so many routes to ToRs, this file is extremely large, and reading it causes ndppd's CPU usage to spike every 30 seconds
How I did it
Increase the delay for reading this file to the maximum possible value (max integer value), which will result in CPU spikes every ~24 days instead of every 30 seconds
How to verify it
Start ndppd with the new config file, confirm that no CPU spikes are seen except at startup
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
This PR aims to monitor the memory usage of streaming telemetry container and restart streaming telemetry container if memory usage is larger than the pre-defined threshold.
How I did it
I borrowed the system tool Monit to run a script memory_checker which will periodically check the memory usage of streaming telemetry container. If the memory usage of telemetry container is larger than the pre-defined threshold for 10 times during 20 cycles, then an alerting message will be written into syslog and at the same time Monit will run the script restart_service to restart the streaming telemetry container.
How to verify it
I verified this implementation on device str-7260cx3-acs-1.
#### Why I did it
To avoid the following logs
```
Mar 15 15:52:04.599302 igk-dut-04 INFO database#/supervisord: flushdb /bin/bash: /usr/local/bin/flush_unused_database: /usr/bin/python: bad interpreter: No such file or directory
Mar 15 15:52:04.599947 igk-dut-04 INFO database#supervisord 2021-03-15 15:52:04,599 INFO exited: flushdb (exit status 126; not expected)
```
#### How I did it
Fix shebang
#### How to verify it
Check the logs
1. Made the command next-hop-self force only applicable on back-end asic bgp. This is done so that BGPL iBGP session running on backend can send e-BGP learn nexthop. Back end asic FRR is able to recursively resolve the eBGP nexthop in its routing table since it knows about all the connected routes advertise from front end asic.
2. Made all front-end asic bgp use global loopback ip (Loopback0) as router id and back end asic bgp use Loopbacl4096 as ruter-id and originator id for Route-Reflector. This is done so that routes learnt by external peer do not see Loopback4096 as router id in show ip bgp <route-prerfix> output.
3. To handle above change need to pass Loopback4096 from BGP manager for jinja2 template generation. This was missing and this change/fix is needed for this also https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-fpm-frr/frr/bgpd/templates/dynamic/instance.conf.j2#L27
4. Enhancement to add mult_asic specific bgpd template generation unit test cases.