Commit Graph

1055 Commits

Author SHA1 Message Date
mssonicbld
abc92c6248
[ci/build]: Upgrade SONiC package versions (#12452) 2022-10-20 03:23:45 +08:00
mssonicbld
5d2db5068c
[ci/build]: Upgrade SONiC package versions (#12437) 2022-10-18 22:19:35 +08:00
mssonicbld
cfc9af71ef
[ci/build]: Upgrade SONiC package versions (#12418) 2022-10-16 22:24:10 +08:00
mssonicbld
b4e6a06d1a
[ci/build]: Upgrade SONiC package versions (#12409) 2022-10-14 23:51:03 +08:00
Ying Xie
a1365b44c3 [BGP] starting BGP service after swss (#12381)
Why I did it
BGP service has always been starting after interface-config. However, recently we discovered an issue where some BGP sessions are unable to establish due to BGP daemon not able to read the interface IP.

This issue was clearly observed after upgrading to FRR 8.2.2. See more details in #12380.

How I did it
Delaying starting BGP seems to be a workaround for this issue.

However, caution is that this delay might impact warm reboot timing and other timing sequences.

This workaround is reducing the probability of hitting the issue by close to 100X. However, this workaround is not bulletproof as test shows. It is still preferrable to have a proper FRR fix and revert this change in the future.

How to verify it
Continuously issuing config reload and check BGP session status afterwards.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2022-10-13 16:34:10 +00:00
mssonicbld
3435a8a305
[ci/build]: Upgrade SONiC package versions (#12372) 2022-10-13 02:58:26 +08:00
mssonicbld
1b5d61246a
[ci/build]: Upgrade SONiC package versions (#12324) 2022-10-09 21:44:14 +08:00
Stepan Blyshchak
06f8b1f98a
[auto-ts] add memory check (#10433) (#12291)
#### Why I did it

To support automatic techsupport invokation in case memory usage is too high.

#### How I did it

Implemented according to https://github.com/Azure/SONiC/pull/939

#### How to verify it

UT, manual test on the switch.

*DEPENDS* on https://github.com/Azure/sonic-utilities/pull/2116
2022-10-06 08:06:46 -07:00
Prince George
fab37239dd Disable brackted-paste mode off by default (#12285)
* Disable brackted-paste mode off by default

* address review comment
2022-10-06 14:58:46 +00:00
Saikrishna Arcot
ac19e2a8ba [docker-wait-any]: Exit worker thread if main thread is expected to exit (#12255)
There's an odd crash that intermittently happens after the teamd container
exits, and a signal is raised to the main thread to exit. This thread (watching
teamd) continues execution because it's in a `while True`. The subsequent wait
call on the teamd container very likely returns immediately, and it calls
`is_warm_restart_enabled` and `is_fast_reboot_enabled`. In either of these
cases, sometimes, there is a crash in the transition from C code to Python code
(after the function gets executed).  Python sees that this thread got a signal
to exit, because the main thread is exiting, and tells pthread to exit the
thread.  However, during the stack unwinding, _something_ is telling the
unwinder to call `std::terminate`.  The reason is unknown.

This then results in a python3 SIGABRT, and systemd then doesn't call the stop
script to actually stop the container (possibly because the main process exited
with a SIGABRT, so it's a hard crash). This means that the container doesn't
actually get stopped or restarted, resulting in an inconsistent state
afterwards.

The workaround appears to be that if we know the main thread needs to exit,
just return here, and don't continue execution. This at least tries to avoid it
from getting into the problematic code path. However, it's still feasible to
get a SIGABRT, depending on thread/process timings (i.e. teamd exits, signals
the main thread to exit, and then syncd exits, and syncd calls one of the two C
functions, potentially hitting the issue).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-10-06 14:57:53 +00:00
mssonicbld
204cf58221
[ci/build]: Upgrade SONiC package versions (#12278) 2022-10-05 20:38:20 +08:00
Ying Xie
76f7d7fa53 Revert "[auto-ts] add memory check (#10433)"
This reverts commit a2cd0f5d4c.
2022-10-04 21:53:45 +00:00
mssonicbld
1a08069d40
[ci/build]: Upgrade SONiC package versions (#12268) 2022-10-04 21:09:24 +08:00
Stepan Blyshchak
a2cd0f5d4c [auto-ts] add memory check (#10433)
#### Why I did it

To support automatic techsupport invokation in case memory usage is too high.

#### How I did it

Implemented according to https://github.com/Azure/SONiC/pull/939

#### How to verify it

UT, manual test on the switch.

*DEPENDS* on https://github.com/Azure/sonic-utilities/pull/2116
2022-10-03 18:58:38 +00:00
mssonicbld
89643d4717
[ci/build]: Upgrade SONiC package versions (#12245) 2022-10-02 21:13:07 +08:00
mssonicbld
a7d088c47c
[ci/build]: Upgrade SONiC package versions (#12191) 2022-09-28 23:25:55 +08:00
mssonicbld
1c5abca0a6
[ci/build]: Upgrade SONiC package versions (#12187) 2022-09-27 08:41:31 +08:00
mssonicbld
99f9c53d19
[ci/build]: Upgrade SONiC package versions (#12142) 2022-09-25 21:57:18 +08:00
Volodymyr Boiko
3d620370f7 [bgp][service] Start bgp service after interfaces-config service (#11827)
- Why I did it
interfaces-config service restarts networking service, during the restart loopback interface address is being removed and reassigned back, leaving loopback without an ipv4 address for a while.
On SONiC startup and config reload interfaces-config and bgp services start in parallel and sometimes
fpmsyncd in bgp attempts bind to loopback while it does not have an address, fails with the log
Exception "Cannot assign requested address" had been thrown in daemon
and exits with rc 0.

root@sonic:/# supervisorctl status
fpmsyncd                         EXITED    Jul 20 05:04 AM
zebra                            RUNNING   pid 35, uptime 6:15:05
zsocket                          EXITED    Jul 20 05:04 AM
docker logs bgp
INFO exited: fpmsyncd (exit status 0; expected)
With fpmsyncd dead, configured routes do not appear in the database.

- How I did it
Added ordering dependency on interfaces-config service into bgp.config

- How to verify it
Itself the issue reproduces quite rarely, but one can gain the time interval between networking down and networking up in interfaces-config.sh like this:

diff --git a/files/image_config/interfaces/interfaces-config.sh b/files/image_config/interfaces/interfaces-config.sh
index f6aa4147a..87caceeff 100755
--- a/files/image_config/interfaces/interfaces-config.sh
+++ b/files/image_config/interfaces/interfaces-config.sh
@@ -63,7 +63,11 @@ done
 # Read sysctl conf files again
 sysctl -p /etc/sysctl.d/90-dhcp6-systcl.conf

-systemctl restart networking
+# systemctl restart networking
+
+systemctl start networking
+sleep 10
+systemctl stop networking

 # Clean-up created files
 rm -f /tmp/ztp_input.json /tmp/ztp_port_data.json
with this change the issue reproduces on every config reload.

Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>
2022-09-21 21:15:08 +00:00
Maxime Lorrillere
458b12b4af [Chassis][Voq]Configure midplane network on supervisor (#11725)
Multi-asic Docker instances are created behind Docker's default bridge
which doesn't allow talking to other Docker instances that are in the
host network (like database-chassis).

On linecards, we configure midplane interfaces to let per-asic docker
containers talk to CHASSIS_DB on the supervisor through internal chassis
network.

On the supervisor we don't need to use chassis internal network, but we
still need a similar setup in order to allow fabric containers to talk
to database-chassis
2022-09-21 21:12:40 +00:00
mssonicbld
77b469d7c8
[ci/build]: Upgrade SONiC package versions (#12121) 2022-09-20 21:24:25 +08:00
Oleksandr Ivantsiv
c9ba827773
[202205] [services] Update "WantedBy=" section for tacacs-config.timer. (#11893) (#12080)
Manually cherry-picking #11893

- Why I did it
The timer execution may fail if triggered during a config reload (when the sonic.target is stopped). This might happen in a rare situation if config reload is executed after reboot in a small time slot (for 0 to 30 seconds) before the tacacs-config timer is triggered:

systemctl status tacacs-config.timer
tacacs-config.timer - Delays tacacs apply until SONiC has started
Loaded: loaded (/lib/systemd/system/tacacs-config.timer; enabled-runtime; vendor preset: enabled)
Active: failed (Result: resources) since Mon 2022-08-29 15:53:03 IDT; 1min 28s ago
Trigger: n/a
Triggers: tacacs-config.service

Aug 29 15:47:53 r-boxer-sw01 systemd[1]: Started Delays tacacs apply until SONiC has started.
Aug 29 15:53:03 r-boxer-sw01 systemd[1]: tacacs-config.timer: Failed to queue unit startup job: Transaction for tacacs-config.service/start is destructive (mgmt-framework.timer has 's>
Aug 29 15:53:03 r-boxer-sw01 systemd[1]: tacacs-config.timer: Failed with result 'resources'.

- How I did it
To ensure that timer execution will be resumed after a config reload the WantedBy section of the systemd service is updated to describe relation to sonic.target.

- How to verify it
Reboot the system
After reboot monitor tacacs-config.timer status. 30 seconds before timer activation run "config reload -y" command.
Check system status.

Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
2022-09-19 09:20:10 +03:00
mssonicbld
f361c029c5
[ci/build]: Upgrade SONiC package versions (#11980) 2022-09-19 12:31:16 +08:00
Aryeh Feigin
b8c6e2a45d
Use warm-boot infrastructure for fast-boot (#12026) 2022-09-14 21:23:34 +03:00
Saikrishna Arcot
f1243bad1b
Pin version of bazelisk to v1.13.0 (#12027)
* Pin version of bazelisk to v1.13.0

This tries to avoid builds failures due to the latest version of
bazelisk changing and causing hash mismatches.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-09-08 21:15:35 -07:00
Ying Xie
ee40402ab7 Revert "[build] Fix version of bazelist which is lost acccidently (#12012)"
This reverts commit 36c5787daf.
2022-09-09 04:14:59 +00:00
Liu Shilong
36c5787daf
[build] Fix version of bazelist which is lost acccidently (#12012)
Why I did it
bazelisk package with hash value 1227b24db77557d552701f6add122edc is deleted from github release.
Reproducible build only cached hash value. Package file didn't be cached. Because they are in different pipelines.
Using latest package hash instead.
2022-09-09 07:24:44 +08:00
Ze Gan
0a54c46a0d [docker-macsec]: Add dependencies of MACsec (#11770)
Why I did it
If the SWSS services was restarted, the MACsec service should also be restarted. Otherwise the data in wpa_supplicant and orchagent will not be consistent.

How I did it
Add dependency in docker-macsec.mk.

How to verify it
Manually check by 'sudo service swss restart'.

The MACsec container should be started after swss, the syslog will look like


Sep  8 14:36:29.562953 sonic INFO swss.sh[9661]: Starting existing swss container with HWSKU Force10-S6000
Sep  8 14:36:30.024399 sonic DEBUG container: container_start: BEGIN
...
Sep  8 14:36:33.391706 sonic INFO systemd[1]: Starting macsec container...
Sep  8 14:36:33.392925 sonic INFO systemd[1]: Starting Management Framework container...


Signed-off-by: Ze Gan <ganze718@gmail.com>
2022-09-08 15:50:06 +00:00
Ying Xie
b4bf4aca3f [mux] skip mux operations during warm shutdown (#11937)
* [mux] skip mux operations during warm shutdown

- Enhance write_standby.py script to skip actions during warm shutdown.
- Expand the support to BGP service.
- MuX support was added by a previous PR.
- don't skip action during warm recovery

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2022-09-08 15:48:56 +00:00
Lawrence Lee
12e6b89d80 [arp_update]: Set failed IPv6 neighbors to incomplete (#11919)
After pinging any failed IPv6 neighbor entries, set the remaining failed/incomplete entries to a permanent INCOMPLETE state. This manual setting to INCOMPLETE prevents these entries from automatically transitioning to FAILED state, and since they are now incomplete any subsequent NA messages for these neighbors is able to resolve the entry in the cache.

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-09-08 15:48:05 +00:00
Stepan Blyshchak
8431d3ab36 [docker-wait-any] immediately start to wait (#11595)
It could happen that a container has already crashed but docker-wait-any
will wait forever till it starts. It should, however, immediately exit
to make the serivce restart.

#### Why I did it

It is observed in some circumstances that the auto-restart mechanism does not work. Specifically for ```swss.service```, ```orchagent``` had crashed before ```docker-wait-any``` started in ```swss.sh```. This led ```docker-wait-any``` wait forever for ```swss``` to be in ```"Running"``` state and it results in:

```
CONTAINER ID   IMAGE                                COMMAND                  CREATED        STATUS                    PORTS     NAMES
1abef1ecebff   bcbca2b74df6                         "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         what-just-happened
3c924d405cd5   docker-lldp:latest                   "/usr/bin/docker-lld…"   22 hours ago   Up 22 hours                         lldp
eb2b12a98c13   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   22 hours ago   Up 22 hours                         radv
d6aac4a46974   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         mgmt-framework
d880fd07aab9   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   22 hours ago   Up 22 hours                         pmon
75f9e22d4fdd   docker-snmp:latest                   "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         snmp
76d570a4bd1c   docker-sonic-telemetry:latest        "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         telemetry
ee49f50344b3   docker-syncd-mlnx:latest             "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         syncd
1f0b0bab3687   docker-teamd:latest                  "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         teamd
917aeeaf9722   docker-orchagent:latest              "/usr/bin/docker-ini…"   22 hours ago   Exited (0) 22 hours ago             swss
81a4d3e820e8   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   22 hours ago   Up 22 hours                         bgp
f6eee8be282c   docker-database:latest               "/usr/local/bin/dock…"   22 hours ago   Up 22 hours                         database
```

The check for ```"Running"``` state is not needed because for cold boot case we do ```start_peer_and_dependent_services``` and for warm boot case the loop will retry to wait for container if this container is doing warm boot:
d01a91a569/files/image_config/misc/docker-wait-any (L56)

#### How I did it

Removed the check for ```"Running"```.

#### How to verify it

Kill swss before ```docker-wait-any``` is reached and verify auto restart will restart swss serivce.
2022-09-08 15:47:27 +00:00
mssonicbld
dc987ebd2c
[ci/build]: Upgrade SONiC package versions (#11951) 2022-09-05 14:42:32 +08:00
mssonicbld
613d3431d1
[ci/build]: Upgrade SONiC package versions (#11913)
Upgrade SONiC Versions
2022-09-01 15:47:48 +08:00
abdosi
72852cdd02 Address Review Comment to define SONIC_GLOBAL_DB_CLI in gbsyncd.sh (#11857)
As part of PR #11754

    Change was added to use variable SONIC_DB_NS_CLI for
    namespace but that will not work since ./files/scripts/syncd_common.sh
    uses SONIC_DB_CLI. So revert back to use SONIC_DB_CLI and define new
    variable for SONIC_GLOBAL_DB_CLI for global/host db cli access

   Also fixed DB_CLI not working for namespace.
2022-09-01 00:12:56 +00:00
Longxiang Lyu
d7f049ebf0 [mux] Exit to write standby state to active-active ports (#11821)
[mux] Exit to write standby state to `active-active` ports

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
2022-09-01 00:11:09 +00:00
andywongarista
0adfd724e6
[202205][Arista] Add initial support for 720DT-48S (#10656) (#11860)
Added initial set of config files to allow for booting and partial traffic testing in SONiC on the 720DT-48S.

How to verify it
- Switch boots
- show interfaces status shows links up on interfaces Ethernet24-51
- Traffic flows with no errors on interfaces Ethernet24-51
2022-08-30 12:39:26 +08:00
Stepan Blyshchak
c60d78dd1f [syncd.sh] 'sxdkernel start' => 'sxdkernel restart' (#11718)
Change `sxdkernel start` to `sxdkernel restart`. If `syncd` service crashes in `ExecStartPre` systemd will not call `ExecStop` and thus will not call `sxdkernel stop`. Use of `sxdkernel restart` is more robust in terms of guarantees to restore the system after unexpected crashes.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-08-27 16:16:17 +00:00
anamehra
a2bed2ae4a container_checker on supervisor should check containers based on asic presence (#11442)
Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

The monit 'container_checker' fails in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).
2022-08-26 20:50:24 +00:00
Saikrishna Arcot
91e9db005a
[202205]: Update package versions (#11801)
This was done manually, to try to get past a build error due to changing
package versions in Debian.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-08-21 15:23:44 -07:00
abdosi
0355caf20b Added support to add gbsyncd in Feature Table of Host Config DB (#11754)
Why I did:

In case of multi-asic platforms gbsyncd is not getting added to Feature Table of Host Config DB. Without this container_checker complains of not needed gbsyncd container's are running.

How I did:
Update Both Host and Namespace config db when gbsyncd docker is starting.

How I verify:
Verified on Multi-asic platforms.
2022-08-19 15:22:12 +00:00
Nikola Dancejic
f63dc738f9 [swss] Adding conditional for bgp when on multi ASIC platform (#11691)
bgp should be a per-asic service, and runs for each namespace on
multi-asic platforms. However, putting bgp in MULTI_INST_DEPENDENT
causes swss to be restarted as well as bgp. this is causing issues after #11000

Issue: #11653

This fix:

removes bgp from dependents list
adds a conditional that either adds bgp, or bgp@$DEV to separate
between single and multi-asic platforms
2022-08-17 17:10:29 +00:00
Hua Liu
6a2c540cba
[swsscommon] Add c++ version sonic-db-cli from sonic-swss-common (#10825) (#11713)
Cherry pick PR https://github.com/sonic-net/sonic-buildimage/pull/10825 to 202205 branch

#### Why I did it
    Fix sonic-db-cli high CPU usage on SONiC startup issue: https://github.com/sonic-net/sonic-buildimage/issues/10218
    ETA of this issue will be 2022/05/31

#### How I did it
    Re-write sonic-cli with c++ in sonic-swss-common: https://github.com/sonic-net/sonic-swss-common/pull/607
    Modify swss-common rules and slave.mk to install c++ version sonic-db-cli.
    

#### How to verify it
    Pass all E2E test scenario.

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111

#### Description for the changelog
    Build and install c++ version sonic-db-cli from swss-common.

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/SONiC/wiki/Configuration.
-->

#### A picture of a cute animal (not mandatory but encouraged)
2022-08-17 15:35:00 +08:00
mssonicbld
5c306cc2e5
[ci/build]: Upgrade SONiC package versions (#11679) 2022-08-15 05:50:59 +00:00
Lawrence Lee
15c80b207c [arp_update]: Resolve failed neighbors on dualtor (#11615)
In arp_update, check for FAILED or INCOMPLETE kernel neighbor entries and manually ping them to try and resolve the neighbor

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-08-11 16:19:25 +00:00
Stepan Blyshchak
3201dc93f6 [swss.sh/syncd.sh] Trap only on EXIT (#11590)
When using trap on SIGTERM the script will not react to the SIGTERM signal sent while a child is executing.
I.e, the following script does not react on SIGTERM sent to it if it is
waiting for sleep to finish:

```

trap "echo Handled SIGTERM" 0 2 3 15

echo "Before sleep"
sleep inf
echo "After sleep"
```

Instead, trap only on EXIT which covers also a scenario with exit on
SIGINT, SIGTERM.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-08-11 16:18:00 +00:00
Ying Xie
094745f06f [write_standby] update write_standby.py script (#11650)
Why I did it
The initial value has to be present for the state machines to work. In active-standby dual-tor scenario, or any hardware mux scenario, the value will be updtaed eventually with a delay.

However, in active-active dual-tor scenario, there is no other mechanism to initialize the value and get state machines started.
So this script will have to write something at start up time.

For active-active dualtor, 'active' is a more preferred initial value, the state machine will switch the state to standby soon if
link prober found link not in good state.

How I did it
Update the script to always provide initial values.

How to verify it
Tested on active-active dual-tor testbed.

Signed-off-by: Ying Xie ying.xie@microsoft.com
2022-08-09 23:02:09 +00:00
Sudharsan Dhamal Gopalarathnam
871a1c51d8 [vs]Preventing ebtables cfg to be applied on vs (#11585)
*Preventing ebtables rules to be applied on KVM image. The ebtables rules in SONiC are added to prevent ARP as well as L2 forwarding to be blocked in linux kernel since the hardware will take care of the actual L2 forward. However this is not the case with KVM where linux needs to forward even L2 packets
2022-08-08 20:45:28 +00:00
bingwang-ms
fda1290926 Support different DSCP_TO_TC_MAP for T1 in dualtor deployment (#11569)
* Support different DSCP_TO_TC_MAP for T1 in dualtor deployment
2022-08-08 20:44:32 +00:00
Stepan Blyshchak
29d29b9491 [swss.sh] clear counters cache folder on swss cold/fast reload (#11244)
A change in sonic-utilities makes all cache files be saved into a
/tmp/cache. On swss restart this cache has to be removed in case swss
starts in cold or fast mode. A related cache restoration in the warmboot
finalizer script is also updated to use new location.

- Why I did it
To fix #9817. Clear the cache directory on swss.sh except for warm start.
Also, adopted finalize-warmboot script to take the cache directory.

- How I did it
A change in sonic-utilities makes all cache files be saved into a /tmp/cache. On swss restart this cache has to be removed in case swss starts in cold or fast mode. A related cache restoration in the warmboot finalizer script is also updated to use new location.

- How to verify it
Run togather with Azure/sonic-utilities#2232. Verify counters cache is removed on config reload, cold/fast reboots, swss restart.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-08-08 20:42:54 +00:00
Nikola Dancejic
32fb4c7772 [swss] Adding bgp container as dependent of swss (#11000)
What I did:
Added bgp as a dependent of swss

Why I did it:
bgp container was not restarting on swss crash. When swss crashes, linkmgrd
doesn't initate a switchover because it cannot access the default route from
orchagent. Bringing down bgp with swss will isolate the ToR, causing linkmgrd
to initiate a switchover to the peer ToR avoiding significant packet loss.

How I did it:
Added bgp to DEPENDENT

Signed-off-by: Nikola Dancejic <ndancejic@microsoft.com>
2022-08-08 20:40:35 +00:00