* [Build] Fix the PyYang python package installation issue (#15890)
Why I did it
Fix the armhf build failure.
How to reproduce the issue:
docker run -it debain:bullseye bash
apt-get update && apt-get install -y python3-pip
pip3 install PyYAML==5.4.1
Error message:
Collecting PyYAML==5.4.1
Installing build dependencies ... done
Getting requirements to build wheel ... error
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3 /tmp/tmp6xabslgb_in_process.py get_requires_for_build_wheel /tmp/tmp_er01ztl
....
raise AttributeError(attr)
AttributeError: cython_sources
----------------------------------------
WARNING: Discarding d63f2d7597/PyYAML-5.4.1.tar.gz (sha256)=607774cbba28732bfa802b54baa7484215f530991055bb562efbed5b2f20a45e (from https://pypi.org/simple/pyyaml/) (requires-python:>=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*, !=3.5.*). Command errored out with exit status 1: /usr/bin/python3 /tmp/tmp6xabslgb_in_process.py get_requires_for_build_wheel /tmp/tmp_er01ztl Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement PyYAML==5.4.1
ERROR: No matching distribution found for PyYAML==5.4.1
root@fa2fa92edcfd:/#
But if adding the option --no-build-isolation, then it is good, see fix.
install "PyYAML==5.4.1" --no-build-isolation
The same error can be found in the multiple builds.
Work item tracking
Microsoft ADO (number only): 24567457
How I did it
Add a build option --no-build-isolation.
* Fix docker-platform-monitor python2 issue
* Fix wheel dependency issue
* Fix wheel dependency issue
* Fix the pip2 PyYAML in host image
#### Why I did it
Cherry-pick commits from master to support the snapshot based mirror, and fix the code conflicts. And add the last commit to fix the build broken issue according to the mirror change.
ad162ae0e [Build] Optimize the version control for Debian packages (https://github.com/sonic-net/sonic-buildimage/pull/14557)
38c5d7fce [Build] Support j2 template for debian sources for docker ptf (https://github.com/sonic-net/sonic-buildimage/pull/13198)
5e4826ebf [Ci] Support to use the same snapshot for all platform builds (#13913)
820692563 [Build] Change the default mirror version config file (#13786)
5e4a866e3 [Build] Support Debian snapshot mirror to improve build stability (#13097)
ac5d89c6a [Build] Support j2 template for debian sources (#12557)
#### Why I did it
Bug in script that was passing in null as log level value if missing from config_db
#### How I did it
Added more robust conditional statement
#### How to verify it
1) Remove log_level from config db
2) config reload -y
3) telemetry should not crash
#### Why I did it
This is part of a corresponding change to the pcie daemon that enables it to verify PCI peripherals on a platform against a preconfigured YAML file, and enables the pcied daemon to call the system commands needed for PCI peripheral verification
#### How I did it
Adding aforementioned libraries to the Dockerfile.j2 file
#### How to verify it
run 'which setpci' from the pmon docker - would show the path of the binary
#### Description for the changelog
Modified pmon's Dockerfile.j2 to include pciutils and libpci libraries.
**cherry-pick of SHA: 7de04504c9518d68aa00c304b7376fdff4e1d318**
#### Why I did it
Remove dialout as critical process as it is no longer used in prod. As part of future work, can remove dialout completely
#### How I did it
Remove from critical process list
Why I did it
Fix the docker-base-stretch build issue in nephos platform.
Collecting supervisord-dependent-startup==1.4.0
Downloading d386c3d2cf/supervisord_dependent_startup-1.4.0-py2.py3-none-any.whl
Collecting toposort>=1.5 (from supervisord-dependent-startup==1.4.0)
Downloading 44e51b4216/toposort-1.9.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '/tmp/pip-build-LnROQE/toposort/setup.py'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-LnROQE/toposort/
The other platforms have been upgraded to docker-base-buster, not impacted.
How I did it
Pin the toposort version to 1.7, the package supervisord-dependent-startup has dependency on it.
The toposort>=1.8 only for python3, is not applicable to python2.
What I did:
Updated Jinja Template to enable BGP Graceful Restart based on device role. By default it will be enable only if the device role type is TorRouter.
Why I did:-
By default FRR is configured in Graceful Helper mode. Graceful Restart is needed on T0/TorRouter only since the device can go for warm-reboot. For T1/LeafRouter it need to be in Helper mode only
Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>
Why I did it
Arista 7060 platform has a rare and unreproduceable PCIe timeout that could possibly be solved with increasing the switch PCIe timeout value. To do this we'll call a script for this platform to increase the PCIe timeout on boot-up.
No issues would be expected from the setpci command. From the PCIe spec:
"Software is permitted to change the value in this field at any
time. For Requests already pending when the Completion
Timeout Value is changed, hardware is permitted to use either
the new or the old value for the outstanding Requests, and is
permitted to base the start time for each Request either on when
this value was changed or on when each request was issued. "
How I did it
Add "platform-init" support in swss docker similar to how "hwsku-init" is called, only this would be for any device belonging to a platform. Then the script would reside in device data folder.
Additionally, add pciutils dependency to docker-orchagent so it can run the setpci commands.
How to verify it
On bootup of an Arista 7060, can execute:
lspci -vv -s 01:00.0 | grep -i "devctl2"
In order to check that the timeout has changed.
- Skip the interface status check if the interface does not exist. In the future, when the interface is created/comes up this check will be triggered again.
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
- Why I did it
Fixes#11431
- How I did it
dhcp6relay binds to ipv6 addresses configured on these vlan interfaces
Thus check if they are ready before launching dhcp6relay
- How to verify it
Unit Tests
Tested on a live device
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
At SWSS docker init time, check the device subtype and enable tunnel packet handler only if it is dualtor
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.
Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.
The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.
Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:
Program 'container_memory_telemetry'
status Status ok
monitoring status Monitored
monitoring mode active
on reboot start
last exit value 0
last output -
data collected Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:
Program 'container_memory_telemetry'
status Status failed
monitoring status Monitored
monitoring mode active
on reboot start
last exit value 0
last output -
data collected Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.
How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.
How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
Why I did it
Running warm-reboot in a loop for 500 times leads to this error on 318-th iteration:
Apr 2 15:56:27.346747 sonic INFO swss#/supervisord: restore_neighbors Traceback (most recent call last):
Apr 2 15:56:27.346747 sonic INFO swss#/supervisord: restore_neighbors File "/usr/bin/restore_neighbors.py", line 24, in <module>
Apr 2 15:56:27.346747 sonic INFO swss#/supervisord: restore_neighbors from scapy.all import conf, in6_getnsma, inet_pton, inet_ntop, in6_getnsmac, get_if_hwaddr, Ether, ARP, IPv6, ICMPv6ND_NS, ICMPv6NDOptSrcLLAddr
Apr 2 15:56:27.346795 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/all.py", line 25, in <module>
Apr 2 15:56:27.346956 sonic INFO swss#/supervisord: restore_neighbors from scapy.route import *
Apr 2 15:56:27.346995 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/route.py", line 205, in <module>
Apr 2 15:56:27.347089 sonic INFO swss#/supervisord: restore_neighbors conf.iface = get_working_if()
Apr 2 15:56:27.347129 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/arch/linux.py", line 128, in get_working_if
Apr 2 15:56:27.347213 sonic INFO swss#/supervisord: restore_neighbors ifflags = struct.unpack("16xH14x", get_if(i, SIOCGIFFLAGS))[0]
Apr 2 15:56:27.347250 sonic INFO swss#/supervisord: restore_neighbors File "/usr/local/lib/python3.7/dist-packages/scapy/arch/common.py", line 31, in get_if
Apr 2 15:56:27.347345 sonic INFO swss#/supervisord: restore_neighbors return ioctl(sck, cmd, struct.pack("16s16x", iff.encode("utf8")))
Apr 2 15:56:27.347365 sonic INFO swss#/supervisord: restore_neighbors OSError: [Errno 19] No such device
The issue was reported to scapy devs secdev/scapy#3369, the fix is secdev/scapy#3371, however there is no released scapy version with this fix right now, thus decided to build scapy v2.4.5 from sources and apply the fix in a form of a patch.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Reduce the disk space taken up during bootup and runtime.
1. Remove python package cache from the base image and from the containers.
2. During bootup, if logs are to be stored in memory, then don't create the `var-log.ext4` file just to delete it later during bootup.
3. For the partition containing `/host`, don't reserve any blocks for just the root user. This just makes sure all disk space is available for all users, if needed during upgrades (for example).
* Remove pip2 and pip3 caches from some containers
Only containers which appeared to have a significant pip cache size are
included here.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* Don't create var-log.ext4 if we're storing logs in memory
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* Run tune2fs on the device containing /host to not reserve any blocks for just the root user
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
(cherry picked from commit 5617b1ae3e)
Python 2 isn't installed by default in Buster and Bullseye containers,
and the scripts/modules can be used with Python 3, so make sure Python 3
is used.
Why I did it
After the Buster and Bullseye upgrade for the restapi container, processes will no longer start because supervisord is trying to call python and python2, both of which are unavailable.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Why I did it
Radvd.conf.j2 template creates two copies of the vlan interface when there are more than one ipv6 address assigned to a single vlan interface. Changed the format to add prefixes under the same vlan interface block.
How I did it
Modifies radvd.conf.j2 and added unit tests
How to verify it
Configure multiple ipv6 address to the same vlan, start radvd
Unit test will check if radvd.conf with multiple ipv6 addresses is formed correctly
Why I did it
Add TSA/B/C dualtor support
Signed-off-by: Longxiang Lyu lolv@microsoft.com
How I did it
For TSA, toggle all the mux to standby if the device type is dualtor and there are active mux ports.
For TSC, add mux status output.
How to verify it
Run TSA/B/C on a dualtor setup
202012 version of #10119
Why I did it
If the VLAN interface is not up when ndppd starts, it will fail to enable allmulti mode on the interface and be unable to process received NDP packets
The following logs are seen:
/var/log/syslog.33.gz:Feb 18 10:33:12.825406 sonic INFO swss#/supervisord: ndppd (error) Failed to set allmulti: No such device
How I did it
Use the wait_for_link script currently used by radv to delay ndppd startup until the vlan interface is ready
How to verify it
Apply the changes to a device. config reload the device and confirm that the above error logs are not observed when ndppd starts. Run the arp/test_arp_dualtor.py::test_proxy_arp test case and verify it passes.
* [PTF-SAIv2]Add ptf dockre for sai-ptf (saiv2)
Base on current ptf docker create a new docker for sai-ptf(saiv2)
upgrade related package
use the latest ptf and install it
test done:
NOJESSIE=1 NOSTRETCH=1 NOBULLSEYE=1 ENABLE_SYNCD_RPC=y make target/docker-ptf-sai.gz
BLDENV=buster make -f Makefile.work target/docker-ptf-sai.gz
* upgrade the thrift to 014
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
In the multi asic platforms all the ASIC are advertising the same IPv6 /64 network from Loopback4096.
Therefore, the IPv6 loopback address of backend asic is not learnt on the frontend asic.
Change the bgpd.conf.main.conf.j2 template file to advertise the Loopback4096 ipv6 address as /128
Why I did it
To enable test support for BFD-related features, the PTF docker needs to have the proper support for BFD. This PR aims to add BFD support in ptf docker.
How I did it
Clone and build OpenBFDD for PTF docker.
How to verify it
Build locally and verify BFD is supported.
This is to save about 40MB of disk space, since 5 containers
individually install this package.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
(cherry picked from commit bd479cad29)
#### Why I did it
resolves https://github.com/Azure/sonic-buildimage/issues/8779
snmpd writes the below error message in syslog :
snmp#snmpd[27]: truncating integer value > 32 bits
This message is written in syslog when the hrSystemUptime(1.3.6.1.2.1.25.1.1.0 / system uptime) or sysUpTime(1.3.6.1.2.1.1.3 network management portion or snmpd uptime) is queried when either of these counters overflow beyond 32 bit value. This happens the device uptime or snmpd uptime is more than 497 days.
#### How I did it
Reference: https://access.redhat.com/solutions/367093 and https://linux.die.net/man/1/snmpcmd
To avoid seeing this message if the counter grows, the snmpd error log level is changed to display LOG_EMERG, LOG_ALERT, LOG_CRIT, and LOG_DEBUG.
Without this change, LOG_ERR and LOG_WARNING would also be logged in syslog.
#### How to verify it
On a device which is up for more than 497 days, modify supervisord.conf with the change and restart snmp.
Query 1.3.6.1.2.1.1.3 and verify that log message is not seen.
Why I did it
There are scenarios that End-of-RIB comes from a part of the peers arrives after reconciliation. In such scenarios, if the route selection deferral timer has the default value of 360 seconds, FRR would not set up routes and all routes would be removed after reconciliation. This PR reduces the route selection deferral timer so that at least routes to parts of the peers get restored at the point of reconciliation.
Fix#7488
How I did it
Reduce route selection deferral timer for bgp graceful restart to 15 seconds.
- Create a script in the orchagent docker container which listens for these encapsulated packets which are trapped to CPU (indicating that they cannot be routed/no neighbor info exists for the inner packet). When such a packet is received, the script will issue a ping command to the packet's inner destination IP to start the neighbor learning process.
- This script is also resilient to portchannel status changes (i.e. interface going up or down). An interface going down does not affect traffic sniffing on interfaces which are still up. When an interface comes back up, we restart the sniffer to start capturing traffic on that interface again.
**- Why I did it**
I'm updating the jinja2 template to support getting SNMP information from the redis configdb.
I'm using the format approved here:
https://github.com/Azure/SONiC/pull/718
This will pave the way for us to decrement using the snmp.yml in the future.
Right now we will still be using both the snmp.yml and configdb to get variable information in order to create the snmpd.conf via the sonic-cfggen tool.
**- How I did it**
I first updated the SNMP Schema in PR #718 to get that approved as a standardized format.
Then I verified I could add snmp configs to the configdb using this standard schema. Once the configs were added to the configdb then I updated the snmpd.conf.j2 file to support the updates via the configdb while still using the variables in the snmp.yml file in parallel. This way we will have backward compatibility until we can fully migrate to the configdb only.
By updating the snmpd.conf.j2 template and running the sonic-cfggen tool the snmpd.conf gets generated with using the values in both the configdb and snmp.yml file.
Co-authored-by: trvanduy <trvanduy@microsoft.com>
Why I did it
resolves#8979 and #9055
How I did it
Remove the file static.conf.j2,which adds the default route on eth0 from bgp docker
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>