Signed-off-by: maipbui <maibui@microsoft.com>
#### Why I did it
`subprocess` is used with `shell=True`, which is very dangerous for shell injection.
`os` - not secure against maliciously constructed input and dangerous if used to evaluate dynamic content
#### How I did it
remove `shell=True`, use `shell=False`
Replace `os` by `subprocess`
#### Why I did it
Currently at the Azure build system, the P4RT container is disabled by default at the build time. Here the goal is to include the P4RT container at the build time while disabling it at the runtime. The user can enable/disable the p4rt app through the config based on the preference.
#### How I did it
Changed the config in rules/config and init-cfg.json.j2
* Fix to improve hostname handling
If config_db.json is missing hostname entry, hostname-config.sh ends
up deleting existing entry too and hostname changes to default 'localhost'
* default hostname to 'sonic` if missing in config file
Signed-off-by: Mariusz Stachura <mariusz.stachura@intel.com>
What I did
Adding the dynamic headroom calculation support for Barefoot platforms.
Why I did it
Enabling dynamic mode for barefoot case.
How I verified it
The community tests are adjusted and pass.
* Add smartmontools to pmon docker
* Set smartmontools to install version 7.2-1 in pmon to match host; clean up smartmontools build files
* Add comments on smartmontools version for both host and pmon
Why I did it
BGP service has always been starting after interface-config. However, recently we discovered an issue where some BGP sessions are unable to establish due to BGP daemon not able to read the interface IP.
This issue was clearly observed after upgrading to FRR 8.2.2. See more details in #12380.
How I did it
Delaying starting BGP seems to be a workaround for this issue.
However, caution is that this delay might impact warm reboot timing and other timing sequences.
This workaround is reducing the probability of hitting the issue by close to 100X. However, this workaround is not bulletproof as test shows. It is still preferrable to have a proper FRR fix and revert this change in the future.
How to verify it
Continuously issuing config reload and check BGP session status afterwards.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Remove swsssdk from sonic OS image and docker image
#### Why I did it
swsssdk is deprecated, so need remove from image.
#### How I did it
Update config file to remove swsssdk from image.
#### How to verify it
Pass all test case.
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
#### Description for the changelog
Remove swsssdk from sonic OS image and docker image
#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
#### A picture of a cute animal (not mandatory but encouraged)
There's an odd crash that intermittently happens after the teamd container
exits, and a signal is raised to the main thread to exit. This thread (watching
teamd) continues execution because it's in a `while True`. The subsequent wait
call on the teamd container very likely returns immediately, and it calls
`is_warm_restart_enabled` and `is_fast_reboot_enabled`. In either of these
cases, sometimes, there is a crash in the transition from C code to Python code
(after the function gets executed). Python sees that this thread got a signal
to exit, because the main thread is exiting, and tells pthread to exit the
thread. However, during the stack unwinding, _something_ is telling the
unwinder to call `std::terminate`. The reason is unknown.
This then results in a python3 SIGABRT, and systemd then doesn't call the stop
script to actually stop the container (possibly because the main process exited
with a SIGABRT, so it's a hard crash). This means that the container doesn't
actually get stopped or restarted, resulting in an inconsistent state
afterwards.
The workaround appears to be that if we know the main thread needs to exit,
just return here, and don't continue execution. This at least tries to avoid it
from getting into the problematic code path. However, it's still feasible to
get a SIGABRT, depending on thread/process timings (i.e. teamd exits, signals
the main thread to exit, and then syncd exits, and syncd calls one of the two C
functions, potentially hitting the issue).
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
- Why I did it
interfaces-config service restarts networking service, during the restart loopback interface address is being removed and reassigned back, leaving loopback without an ipv4 address for a while.
On SONiC startup and config reload interfaces-config and bgp services start in parallel and sometimes
fpmsyncd in bgp attempts bind to loopback while it does not have an address, fails with the log
Exception "Cannot assign requested address" had been thrown in daemon
and exits with rc 0.
root@sonic:/# supervisorctl status
fpmsyncd EXITED Jul 20 05:04 AM
zebra RUNNING pid 35, uptime 6:15:05
zsocket EXITED Jul 20 05:04 AM
docker logs bgp
INFO exited: fpmsyncd (exit status 0; expected)
With fpmsyncd dead, configured routes do not appear in the database.
- How I did it
Added ordering dependency on interfaces-config service into bgp.config
- How to verify it
Itself the issue reproduces quite rarely, but one can gain the time interval between networking down and networking up in interfaces-config.sh like this:
diff --git a/files/image_config/interfaces/interfaces-config.sh b/files/image_config/interfaces/interfaces-config.sh
index f6aa4147a..87caceeff 100755
--- a/files/image_config/interfaces/interfaces-config.sh
+++ b/files/image_config/interfaces/interfaces-config.sh
@@ -63,7 +63,11 @@ done
# Read sysctl conf files again
sysctl -p /etc/sysctl.d/90-dhcp6-systcl.conf
-systemctl restart networking
+# systemctl restart networking
+
+systemctl start networking
+sleep 10
+systemctl stop networking
# Clean-up created files
rm -f /tmp/ztp_input.json /tmp/ztp_port_data.json
with this change the issue reproduces on every config reload.
Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>
* Make client indentity by AME cert
* Join k8s cluster by ipv6
* Change join test cases
* Test case bug fix
* Improve read node label func
* Configure kubelet and change test cases
* For kubernetes version 1.22.2
* Fix undefine issue
Signed-off-by: Yun Li <yunli1@microsoft.com>
Multi-asic Docker instances are created behind Docker's default bridge
which doesn't allow talking to other Docker instances that are in the
host network (like database-chassis).
On linecards, we configure midplane interfaces to let per-asic docker
containers talk to CHASSIS_DB on the supervisor through internal chassis
network.
On the supervisor we don't need to use chassis internal network, but we
still need a similar setup in order to allow fabric containers to talk
to database-chassis
The timer execution may fail if triggered during a config reload
(when the sonic.target is stopped). This might happen in a rare
situation if config reload is executed after reboot in a small
time slot (for 0 to 30 seconds) before the tacacs-config timer
is triggered. To ensure that timer execution will be resumed after
a config reload the WantedBy section of the systemd service is updated
to describe relation to sonic.target.
Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
Why I did it
If the SWSS services was restarted, the MACsec service should also be restarted. Otherwise the data in wpa_supplicant and orchagent will not be consistent.
How I did it
Add dependency in docker-macsec.mk.
How to verify it
Manually check by 'sudo service swss restart'.
The MACsec container should be started after swss, the syslog will look like
Sep 8 14:36:29.562953 sonic INFO swss.sh[9661]: Starting existing swss container with HWSKU Force10-S6000
Sep 8 14:36:30.024399 sonic DEBUG container: container_start: BEGIN
...
Sep 8 14:36:33.391706 sonic INFO systemd[1]: Starting macsec container...
Sep 8 14:36:33.392925 sonic INFO systemd[1]: Starting Management Framework container...
Signed-off-by: Ze Gan <ganze718@gmail.com>
It could happen that a container has already crashed but docker-wait-any
will wait forever till it starts. It should, however, immediately exit
to make the serivce restart.
#### Why I did it
It is observed in some circumstances that the auto-restart mechanism does not work. Specifically for ```swss.service```, ```orchagent``` had crashed before ```docker-wait-any``` started in ```swss.sh```. This led ```docker-wait-any``` wait forever for ```swss``` to be in ```"Running"``` state and it results in:
```
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1abef1ecebff bcbca2b74df6 "/usr/local/bin/supe…" 22 hours ago Up 22 hours what-just-happened
3c924d405cd5 docker-lldp:latest "/usr/bin/docker-lld…" 22 hours ago Up 22 hours lldp
eb2b12a98c13 docker-router-advertiser:latest "/usr/bin/docker-ini…" 22 hours ago Up 22 hours radv
d6aac4a46974 docker-sonic-mgmt-framework:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours mgmt-framework
d880fd07aab9 docker-platform-monitor:latest "/usr/bin/docker_ini…" 22 hours ago Up 22 hours pmon
75f9e22d4fdd docker-snmp:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours snmp
76d570a4bd1c docker-sonic-telemetry:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours telemetry
ee49f50344b3 docker-syncd-mlnx:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours syncd
1f0b0bab3687 docker-teamd:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours teamd
917aeeaf9722 docker-orchagent:latest "/usr/bin/docker-ini…" 22 hours ago Exited (0) 22 hours ago swss
81a4d3e820e8 docker-fpm-frr:latest "/usr/bin/docker_ini…" 22 hours ago Up 22 hours bgp
f6eee8be282c docker-database:latest "/usr/local/bin/dock…" 22 hours ago Up 22 hours database
```
The check for ```"Running"``` state is not needed because for cold boot case we do ```start_peer_and_dependent_services``` and for warm boot case the loop will retry to wait for container if this container is doing warm boot:
d01a91a569/files/image_config/misc/docker-wait-any (L56)
#### How I did it
Removed the check for ```"Running"```.
#### How to verify it
Kill swss before ```docker-wait-any``` is reached and verify auto restart will restart swss serivce.
With this PR in, you flap BGP and use events_tool to see the published events.
With telemetry PR #111 in and corresponding submodule update done in buildimage, one could run gnmi_cli to capture BGP flap events.
* [mux] skip mux operations during warm shutdown
- Enhance write_standby.py script to skip actions during warm shutdown.
- Expand the support to BGP service.
- MuX support was added by a previous PR.
- don't skip action during warm recovery
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
After pinging any failed IPv6 neighbor entries, set the remaining failed/incomplete entries to a permanent INCOMPLETE state. This manual setting to INCOMPLETE prevents these entries from automatically transitioning to FAILED state, and since they are now incomplete any subsequent NA messages for these neighbors is able to resolve the entry in the cache.
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
As part of PR #11754
Change was added to use variable SONIC_DB_NS_CLI for
namespace but that will not work since ./files/scripts/syncd_common.sh
uses SONIC_DB_CLI. So revert back to use SONIC_DB_CLI and define new
variable for SONIC_GLOBAL_DB_CLI for global/host db cli access
Also fixed DB_CLI not working for namespace.
#### Why I did it
To deprecate swsssdk, remove all dependency to it.
#### How I did it
Remove swsssdk from rules and build image scripts.
#### How to verify it
Pass all UT and E2E test case
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
#### Description for the changelog
Remove swsssdk from rules and build image scripts.
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
#### A picture of a cute animal (not mandatory but encouraged)
Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.
The monit 'container_checker' fails in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).
Why I did:
In case of multi-asic platforms gbsyncd is not getting added to Feature Table of Host Config DB. Without this container_checker complains of not needed gbsyncd container's are running.
How I did:
Update Both Host and Namespace config db when gbsyncd docker is starting.
How I verify:
Verified on Multi-asic platforms.
Change `sxdkernel start` to `sxdkernel restart`. If `syncd` service crashes in `ExecStartPre` systemd will not call `ExecStop` and thus will not call `sxdkernel stop`. Use of `sxdkernel restart` is more robust in terms of guarantees to restore the system after unexpected crashes.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
* Add k8s master feature
Signed-off-by: Yun Li <yunli1@microsoft.com>
* Update kubernetes version mistake and make variable passing clear
Signed-off-by: Yun Li <yunli1@microsoft.com>
* Add CRI-dockerd package
Signed-off-by: Yun Li <yunli1@microsoft.com>
* Update version variable passing logic
Signed-off-by: Yun Li <yunli1@microsoft.com>
* Upgrade the worker kubernetes version
Signed-off-by: Yun Li <yunli1@microsoft.com>
* Install xml file parse tool
Signed-off-by: Yun Li <yunli1@microsoft.com>
Signed-off-by: Yun Li <yunli1@microsoft.com>
bgp should be a per-asic service, and runs for each namespace on
multi-asic platforms. However, putting bgp in MULTI_INST_DEPENDENT
causes swss to be restarted as well as bgp. this is causing issues after #11000
Issue: #11653
This fix:
removes bgp from dependents list
adds a conditional that either adds bgp, or bgp@$DEV to separate
between single and multi-asic platforms
When using trap on SIGTERM the script will not react to the SIGTERM signal sent while a child is executing.
I.e, the following script does not react on SIGTERM sent to it if it is
waiting for sleep to finish:
```
trap "echo Handled SIGTERM" 0 2 3 15
echo "Before sleep"
sleep inf
echo "After sleep"
```
Instead, trap only on EXIT which covers also a scenario with exit on
SIGINT, SIGTERM.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
In arp_update, check for FAILED or INCOMPLETE kernel neighbor entries and manually ping them to try and resolve the neighbor
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Why I did it
The initial value has to be present for the state machines to work. In active-standby dual-tor scenario, or any hardware mux scenario, the value will be updtaed eventually with a delay.
However, in active-active dual-tor scenario, there is no other mechanism to initialize the value and get state machines started.
So this script will have to write something at start up time.
For active-active dualtor, 'active' is a more preferred initial value, the state machine will switch the state to standby soon if
link prober found link not in good state.
How I did it
Update the script to always provide initial values.
How to verify it
Tested on active-active dual-tor testbed.
Signed-off-by: Ying Xie ying.xie@microsoft.com
*Preventing ebtables rules to be applied on KVM image. The ebtables rules in SONiC are added to prevent ARP as well as L2 forwarding to be blocked in linux kernel since the hardware will take care of the actual L2 forward. However this is not the case with KVM where linux needs to forward even L2 packets
What I did:
Added bgp as a dependent of swss
Why I did it:
bgp container was not restarting on swss crash. When swss crashes, linkmgrd
doesn't initate a switchover because it cannot access the default route from
orchagent. Bringing down bgp with swss will isolate the ToR, causing linkmgrd
to initiate a switchover to the peer ToR avoiding significant packet loss.
How I did it:
Added bgp to DEPENDENT
Signed-off-by: Nikola Dancejic <ndancejic@microsoft.com>
Spanning from sonic-net/sonic-linkmgrd#76, this PR is to update warm restart finalizer to wait for linkmgrd to be reconciled.
sign-off: Jing Zhang zhangjing@microsoft.com
Why I did it
To make sure finalizer save config after linkmgrd's reconciliation.
How I did it
Add linkmgrd to the reconciliation wait list of warmboot finalizer.
How to verify it
Verified on lab device, linkmgrd reconciled as expected.
A change in sonic-utilities makes all cache files be saved into a
/tmp/cache. On swss restart this cache has to be removed in case swss
starts in cold or fast mode. A related cache restoration in the warmboot
finalizer script is also updated to use new location.
- Why I did it
To fix#9817. Clear the cache directory on swss.sh except for warm start.
Also, adopted finalize-warmboot script to take the cache directory.
- How I did it
A change in sonic-utilities makes all cache files be saved into a /tmp/cache. On swss restart this cache has to be removed in case swss starts in cold or fast mode. A related cache restoration in the warmboot finalizer script is also updated to use new location.
- How to verify it
Run togather with Azure/sonic-utilities#2232. Verify counters cache is removed on config reload, cold/fast reboots, swss restart.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Fix in Monit memory_checker plugin. Skip fetching running containers if docker engine is down (can happen in deinit).
This PR fixes issue #11472.
Signed-off-by: liora liora@nvidia.com
Why I did it
In the case where Monit runs during deinit flow, memory_checker plugin is fetching the running containers without checking if Docker service is still running. I added this check.
How I did it
Use systemctl is-active to check if Docker engine is still running.
How to verify it
Use systemctl to stop docker engine and reload Monit, no errors in log and relevant print appears in log.
Which release branch to backport (provide reason below if selected)
The fix is required in 202205 and 202012 since the PR that introduced the issue was cherry picked to those branches (#11129).
What I did:
Following changes done for packet based chassis:-
1> Run arp_update on LC's to resolve static route nexthops over backend
port-channel interfaces.
2> On Supervisor make sure arp_update exit gracefully
* Ported Marvell armhf build on x86 for debian buster to use cross-compilation instead of qemu emulation
Current armhf Sonic build on amd64 host uses qemu emulation. Due to the
nature of the emulation it takes a very long time, about 22-24 hours to
complete the build. The change I did to reduce the building time by
porting Sonic armhf build on amd64 host for Marvell platform for debian
buster to use cross-compilation on arm64 host for armhf target. The
overall Sonic armhf building time using cross-compilation reduced to
about 6 hours.
Signed-off-by: marvell <marvell@cpss-build3.marvell.com>
* Fixed final Sonic image build with dockers inside
* Update Dockerfile.j2
Fixed qemu-user-static:x86_64-aarch64-5.0.0-2 .
* Update cross-build-arm-python-reqirements.sh
Added support for both armhf and arm64 cross-build platform using $PY_PLAT environment variable.
* Update Makefile
Added TARGET=<cross-target> for armhf/arm64 cross-compilation.
* Reviewer's @qiluo-msft requests done
Signed-off-by: marvell <marvell@cpss-build3.marvell.com>
* Added new radius/pam patch for arm64 support
* Update slave.mk
Added missing back tick.
* Added libgtest-dev: libgmock-dev: to the buster Dockerfile.j2. Fixed arm perl version to be generic
* Added missing armhf/arm64 entries in /etc/apt/sources.list
* fix libc-bin core dump issue from xumia:fix-libc-bin-install-issue commit
* Removed unnecessary 'apt-get update' from sonic-slave-buster/Dockerfile.j2
* Fixed saiarcot895 reviewer's requests
* Fixed README and replaced 'sed/awk' with patches
* Fixed ntp build to use openssl
* Unuse sonic-slave-buster/cross-build-arm-python-reqirements.sh script (put all prebuilt python packages cross-compilation/install inside Dockerfile.j2). Fixed src/snmpd/Makefile to use -j1 in all cases
* Clean armhf cross-compilation build fixes
* Ported cross-compilation armhf build to bullseye
* Additional change for bullseye
* Set CROSS_BUILD_ENVIRON default value n
* Removed python2 references
* Fixes after merge with the upstream
* Deleted unused sonic-slave-buster/cross-build-arm-python-reqirements.sh file
* Fixed 2 @saiarcot895 requests
* Fixed @saiarcot895 reviewer's requests
* Removed use of prebuilt python wheels
* Incorporated saiarcot895 CC/CXX and other simplification/generalization changes
Signed-off-by: marvell <marvell@cpss-build3.marvell.com>
* Fixed saiarcot895 reviewer's additional requests
* src/libyang/patch/debian-packaging-files.patch
* Removed --no-deps option when installing wheels. Removed unnecessary lazy_object_proxy arm python3 package instalation
Co-authored-by: marvell <marvell@cpss-build3.marvell.com>
Co-authored-by: marvell <marvell@cpss-build2.marvell.com>
- Why I did it
To implement Syslog Source IP feature
In order to include the following commit: 8e5d478 [ssip]: Add CLI (#2191)
- How I did it
Updated syslog config template
Advanced submodule sonic-utilities
ea11b22 [sonic-bootchart] add sonic-bootchart (#2195)
8e5d478 [ssip]: Add CLI (#2191)
1dacb7f Replace pyswsssdk with swsscommon (#2251)
- How to verify it
make configure PLATFORM=mellanox
make target/sonic-mellanox.bin
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
- Why I did it
Support Mellanox-SN4600C-C64 as T1 switch in dual-ToR scenario
This is to port #11032 and #11299 from 202012 to master.
Support additional queue and PG in buffer templates, including both traditional and dynamic model
Support mapping DSCP 2/6 to lossless traffic in the QoS template.
Add macros to generate additional lossless PG in the dynamic model
Adjust the order in which the generic/dedicated (with additional lossless queues) macros are checked and called to generate buffer tables in common template buffers_config.j2
Buffer tables are rendered via using macros.
Both generic and dedicated macros are defined on our platform. Currently, the generic one is called as long as it is defined, which causes the generic one always being called on our platform. To avoid it, the dedicated macrio is checked and called first and then the generic ones.
Support MAP_PFC_PRIORITY_TO_PRIORITY_GROUP on ports with additional lossless queues.
On Mellanox-SN4600C-C64, buffer configuration for t1 is calculated as:
40 * 100G downlink ports with 4 lossless PGs/queues, 1 lossy PG, and 3 lossy queues
16 * 100G uplink ports with 2 lossless PGs/queues, 1 lossy PG, and 5 lossy queues
Signed-off-by: Stephen Sun <stephens@nvidia.com>