Commit Graph

1105 Commits

Author SHA1 Message Date
mssonicbld
abc92c6248
[ci/build]: Upgrade SONiC package versions (#12452) 2022-10-20 03:23:45 +08:00
mssonicbld
5d2db5068c
[ci/build]: Upgrade SONiC package versions (#12437) 2022-10-18 22:19:35 +08:00
mssonicbld
cfc9af71ef
[ci/build]: Upgrade SONiC package versions (#12418) 2022-10-16 22:24:10 +08:00
mssonicbld
b4e6a06d1a
[ci/build]: Upgrade SONiC package versions (#12409) 2022-10-14 23:51:03 +08:00
Ying Xie
a1365b44c3 [BGP] starting BGP service after swss (#12381)
Why I did it
BGP service has always been starting after interface-config. However, recently we discovered an issue where some BGP sessions are unable to establish due to BGP daemon not able to read the interface IP.

This issue was clearly observed after upgrading to FRR 8.2.2. See more details in #12380.

How I did it
Delaying starting BGP seems to be a workaround for this issue.

However, caution is that this delay might impact warm reboot timing and other timing sequences.

This workaround is reducing the probability of hitting the issue by close to 100X. However, this workaround is not bulletproof as test shows. It is still preferrable to have a proper FRR fix and revert this change in the future.

How to verify it
Continuously issuing config reload and check BGP session status afterwards.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2022-10-13 16:34:10 +00:00
mssonicbld
3435a8a305
[ci/build]: Upgrade SONiC package versions (#12372) 2022-10-13 02:58:26 +08:00
mssonicbld
1b5d61246a
[ci/build]: Upgrade SONiC package versions (#12324) 2022-10-09 21:44:14 +08:00
Stepan Blyshchak
06f8b1f98a
[auto-ts] add memory check (#10433) (#12291)
#### Why I did it

To support automatic techsupport invokation in case memory usage is too high.

#### How I did it

Implemented according to https://github.com/Azure/SONiC/pull/939

#### How to verify it

UT, manual test on the switch.

*DEPENDS* on https://github.com/Azure/sonic-utilities/pull/2116
2022-10-06 08:06:46 -07:00
Prince George
fab37239dd Disable brackted-paste mode off by default (#12285)
* Disable brackted-paste mode off by default

* address review comment
2022-10-06 14:58:46 +00:00
Saikrishna Arcot
ac19e2a8ba [docker-wait-any]: Exit worker thread if main thread is expected to exit (#12255)
There's an odd crash that intermittently happens after the teamd container
exits, and a signal is raised to the main thread to exit. This thread (watching
teamd) continues execution because it's in a `while True`. The subsequent wait
call on the teamd container very likely returns immediately, and it calls
`is_warm_restart_enabled` and `is_fast_reboot_enabled`. In either of these
cases, sometimes, there is a crash in the transition from C code to Python code
(after the function gets executed).  Python sees that this thread got a signal
to exit, because the main thread is exiting, and tells pthread to exit the
thread.  However, during the stack unwinding, _something_ is telling the
unwinder to call `std::terminate`.  The reason is unknown.

This then results in a python3 SIGABRT, and systemd then doesn't call the stop
script to actually stop the container (possibly because the main process exited
with a SIGABRT, so it's a hard crash). This means that the container doesn't
actually get stopped or restarted, resulting in an inconsistent state
afterwards.

The workaround appears to be that if we know the main thread needs to exit,
just return here, and don't continue execution. This at least tries to avoid it
from getting into the problematic code path. However, it's still feasible to
get a SIGABRT, depending on thread/process timings (i.e. teamd exits, signals
the main thread to exit, and then syncd exits, and syncd calls one of the two C
functions, potentially hitting the issue).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-10-06 14:57:53 +00:00
mssonicbld
204cf58221
[ci/build]: Upgrade SONiC package versions (#12278) 2022-10-05 20:38:20 +08:00
Ying Xie
76f7d7fa53 Revert "[auto-ts] add memory check (#10433)"
This reverts commit a2cd0f5d4c.
2022-10-04 21:53:45 +00:00
mssonicbld
1a08069d40
[ci/build]: Upgrade SONiC package versions (#12268) 2022-10-04 21:09:24 +08:00
Stepan Blyshchak
a2cd0f5d4c [auto-ts] add memory check (#10433)
#### Why I did it

To support automatic techsupport invokation in case memory usage is too high.

#### How I did it

Implemented according to https://github.com/Azure/SONiC/pull/939

#### How to verify it

UT, manual test on the switch.

*DEPENDS* on https://github.com/Azure/sonic-utilities/pull/2116
2022-10-03 18:58:38 +00:00
mssonicbld
89643d4717
[ci/build]: Upgrade SONiC package versions (#12245) 2022-10-02 21:13:07 +08:00
mssonicbld
a7d088c47c
[ci/build]: Upgrade SONiC package versions (#12191) 2022-09-28 23:25:55 +08:00
mssonicbld
1c5abca0a6
[ci/build]: Upgrade SONiC package versions (#12187) 2022-09-27 08:41:31 +08:00
mssonicbld
99f9c53d19
[ci/build]: Upgrade SONiC package versions (#12142) 2022-09-25 21:57:18 +08:00
Volodymyr Boiko
3d620370f7 [bgp][service] Start bgp service after interfaces-config service (#11827)
- Why I did it
interfaces-config service restarts networking service, during the restart loopback interface address is being removed and reassigned back, leaving loopback without an ipv4 address for a while.
On SONiC startup and config reload interfaces-config and bgp services start in parallel and sometimes
fpmsyncd in bgp attempts bind to loopback while it does not have an address, fails with the log
Exception "Cannot assign requested address" had been thrown in daemon
and exits with rc 0.

root@sonic:/# supervisorctl status
fpmsyncd                         EXITED    Jul 20 05:04 AM
zebra                            RUNNING   pid 35, uptime 6:15:05
zsocket                          EXITED    Jul 20 05:04 AM
docker logs bgp
INFO exited: fpmsyncd (exit status 0; expected)
With fpmsyncd dead, configured routes do not appear in the database.

- How I did it
Added ordering dependency on interfaces-config service into bgp.config

- How to verify it
Itself the issue reproduces quite rarely, but one can gain the time interval between networking down and networking up in interfaces-config.sh like this:

diff --git a/files/image_config/interfaces/interfaces-config.sh b/files/image_config/interfaces/interfaces-config.sh
index f6aa4147a..87caceeff 100755
--- a/files/image_config/interfaces/interfaces-config.sh
+++ b/files/image_config/interfaces/interfaces-config.sh
@@ -63,7 +63,11 @@ done
 # Read sysctl conf files again
 sysctl -p /etc/sysctl.d/90-dhcp6-systcl.conf

-systemctl restart networking
+# systemctl restart networking
+
+systemctl start networking
+sleep 10
+systemctl stop networking

 # Clean-up created files
 rm -f /tmp/ztp_input.json /tmp/ztp_port_data.json
with this change the issue reproduces on every config reload.

Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>
2022-09-21 21:15:08 +00:00
Maxime Lorrillere
458b12b4af [Chassis][Voq]Configure midplane network on supervisor (#11725)
Multi-asic Docker instances are created behind Docker's default bridge
which doesn't allow talking to other Docker instances that are in the
host network (like database-chassis).

On linecards, we configure midplane interfaces to let per-asic docker
containers talk to CHASSIS_DB on the supervisor through internal chassis
network.

On the supervisor we don't need to use chassis internal network, but we
still need a similar setup in order to allow fabric containers to talk
to database-chassis
2022-09-21 21:12:40 +00:00
mssonicbld
77b469d7c8
[ci/build]: Upgrade SONiC package versions (#12121) 2022-09-20 21:24:25 +08:00
Oleksandr Ivantsiv
c9ba827773
[202205] [services] Update "WantedBy=" section for tacacs-config.timer. (#11893) (#12080)
Manually cherry-picking #11893

- Why I did it
The timer execution may fail if triggered during a config reload (when the sonic.target is stopped). This might happen in a rare situation if config reload is executed after reboot in a small time slot (for 0 to 30 seconds) before the tacacs-config timer is triggered:

systemctl status tacacs-config.timer
tacacs-config.timer - Delays tacacs apply until SONiC has started
Loaded: loaded (/lib/systemd/system/tacacs-config.timer; enabled-runtime; vendor preset: enabled)
Active: failed (Result: resources) since Mon 2022-08-29 15:53:03 IDT; 1min 28s ago
Trigger: n/a
Triggers: tacacs-config.service

Aug 29 15:47:53 r-boxer-sw01 systemd[1]: Started Delays tacacs apply until SONiC has started.
Aug 29 15:53:03 r-boxer-sw01 systemd[1]: tacacs-config.timer: Failed to queue unit startup job: Transaction for tacacs-config.service/start is destructive (mgmt-framework.timer has 's>
Aug 29 15:53:03 r-boxer-sw01 systemd[1]: tacacs-config.timer: Failed with result 'resources'.

- How I did it
To ensure that timer execution will be resumed after a config reload the WantedBy section of the systemd service is updated to describe relation to sonic.target.

- How to verify it
Reboot the system
After reboot monitor tacacs-config.timer status. 30 seconds before timer activation run "config reload -y" command.
Check system status.

Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
2022-09-19 09:20:10 +03:00
mssonicbld
f361c029c5
[ci/build]: Upgrade SONiC package versions (#11980) 2022-09-19 12:31:16 +08:00
Aryeh Feigin
b8c6e2a45d
Use warm-boot infrastructure for fast-boot (#12026) 2022-09-14 21:23:34 +03:00
Saikrishna Arcot
f1243bad1b
Pin version of bazelisk to v1.13.0 (#12027)
* Pin version of bazelisk to v1.13.0

This tries to avoid builds failures due to the latest version of
bazelisk changing and causing hash mismatches.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-09-08 21:15:35 -07:00
Ying Xie
ee40402ab7 Revert "[build] Fix version of bazelist which is lost acccidently (#12012)"
This reverts commit 36c5787daf.
2022-09-09 04:14:59 +00:00
Liu Shilong
36c5787daf
[build] Fix version of bazelist which is lost acccidently (#12012)
Why I did it
bazelisk package with hash value 1227b24db77557d552701f6add122edc is deleted from github release.
Reproducible build only cached hash value. Package file didn't be cached. Because they are in different pipelines.
Using latest package hash instead.
2022-09-09 07:24:44 +08:00
Ze Gan
0a54c46a0d [docker-macsec]: Add dependencies of MACsec (#11770)
Why I did it
If the SWSS services was restarted, the MACsec service should also be restarted. Otherwise the data in wpa_supplicant and orchagent will not be consistent.

How I did it
Add dependency in docker-macsec.mk.

How to verify it
Manually check by 'sudo service swss restart'.

The MACsec container should be started after swss, the syslog will look like


Sep  8 14:36:29.562953 sonic INFO swss.sh[9661]: Starting existing swss container with HWSKU Force10-S6000
Sep  8 14:36:30.024399 sonic DEBUG container: container_start: BEGIN
...
Sep  8 14:36:33.391706 sonic INFO systemd[1]: Starting macsec container...
Sep  8 14:36:33.392925 sonic INFO systemd[1]: Starting Management Framework container...


Signed-off-by: Ze Gan <ganze718@gmail.com>
2022-09-08 15:50:06 +00:00
Ying Xie
b4bf4aca3f [mux] skip mux operations during warm shutdown (#11937)
* [mux] skip mux operations during warm shutdown

- Enhance write_standby.py script to skip actions during warm shutdown.
- Expand the support to BGP service.
- MuX support was added by a previous PR.
- don't skip action during warm recovery

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2022-09-08 15:48:56 +00:00
Lawrence Lee
12e6b89d80 [arp_update]: Set failed IPv6 neighbors to incomplete (#11919)
After pinging any failed IPv6 neighbor entries, set the remaining failed/incomplete entries to a permanent INCOMPLETE state. This manual setting to INCOMPLETE prevents these entries from automatically transitioning to FAILED state, and since they are now incomplete any subsequent NA messages for these neighbors is able to resolve the entry in the cache.

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-09-08 15:48:05 +00:00
Stepan Blyshchak
8431d3ab36 [docker-wait-any] immediately start to wait (#11595)
It could happen that a container has already crashed but docker-wait-any
will wait forever till it starts. It should, however, immediately exit
to make the serivce restart.

#### Why I did it

It is observed in some circumstances that the auto-restart mechanism does not work. Specifically for ```swss.service```, ```orchagent``` had crashed before ```docker-wait-any``` started in ```swss.sh```. This led ```docker-wait-any``` wait forever for ```swss``` to be in ```"Running"``` state and it results in:

```
CONTAINER ID   IMAGE                                COMMAND                  CREATED        STATUS                    PORTS     NAMES
1abef1ecebff   bcbca2b74df6                         "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         what-just-happened
3c924d405cd5   docker-lldp:latest                   "/usr/bin/docker-lld…"   22 hours ago   Up 22 hours                         lldp
eb2b12a98c13   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   22 hours ago   Up 22 hours                         radv
d6aac4a46974   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         mgmt-framework
d880fd07aab9   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   22 hours ago   Up 22 hours                         pmon
75f9e22d4fdd   docker-snmp:latest                   "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         snmp
76d570a4bd1c   docker-sonic-telemetry:latest        "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         telemetry
ee49f50344b3   docker-syncd-mlnx:latest             "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         syncd
1f0b0bab3687   docker-teamd:latest                  "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         teamd
917aeeaf9722   docker-orchagent:latest              "/usr/bin/docker-ini…"   22 hours ago   Exited (0) 22 hours ago             swss
81a4d3e820e8   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   22 hours ago   Up 22 hours                         bgp
f6eee8be282c   docker-database:latest               "/usr/local/bin/dock…"   22 hours ago   Up 22 hours                         database
```

The check for ```"Running"``` state is not needed because for cold boot case we do ```start_peer_and_dependent_services``` and for warm boot case the loop will retry to wait for container if this container is doing warm boot:
d01a91a569/files/image_config/misc/docker-wait-any (L56)

#### How I did it

Removed the check for ```"Running"```.

#### How to verify it

Kill swss before ```docker-wait-any``` is reached and verify auto restart will restart swss serivce.
2022-09-08 15:47:27 +00:00
mssonicbld
dc987ebd2c
[ci/build]: Upgrade SONiC package versions (#11951) 2022-09-05 14:42:32 +08:00
mssonicbld
613d3431d1
[ci/build]: Upgrade SONiC package versions (#11913)
Upgrade SONiC Versions
2022-09-01 15:47:48 +08:00
abdosi
72852cdd02 Address Review Comment to define SONIC_GLOBAL_DB_CLI in gbsyncd.sh (#11857)
As part of PR #11754

    Change was added to use variable SONIC_DB_NS_CLI for
    namespace but that will not work since ./files/scripts/syncd_common.sh
    uses SONIC_DB_CLI. So revert back to use SONIC_DB_CLI and define new
    variable for SONIC_GLOBAL_DB_CLI for global/host db cli access

   Also fixed DB_CLI not working for namespace.
2022-09-01 00:12:56 +00:00
Longxiang Lyu
d7f049ebf0 [mux] Exit to write standby state to active-active ports (#11821)
[mux] Exit to write standby state to `active-active` ports

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
2022-09-01 00:11:09 +00:00
andywongarista
0adfd724e6
[202205][Arista] Add initial support for 720DT-48S (#10656) (#11860)
Added initial set of config files to allow for booting and partial traffic testing in SONiC on the 720DT-48S.

How to verify it
- Switch boots
- show interfaces status shows links up on interfaces Ethernet24-51
- Traffic flows with no errors on interfaces Ethernet24-51
2022-08-30 12:39:26 +08:00
Stepan Blyshchak
c60d78dd1f [syncd.sh] 'sxdkernel start' => 'sxdkernel restart' (#11718)
Change `sxdkernel start` to `sxdkernel restart`. If `syncd` service crashes in `ExecStartPre` systemd will not call `ExecStop` and thus will not call `sxdkernel stop`. Use of `sxdkernel restart` is more robust in terms of guarantees to restore the system after unexpected crashes.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-08-27 16:16:17 +00:00
anamehra
a2bed2ae4a container_checker on supervisor should check containers based on asic presence (#11442)
Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

The monit 'container_checker' fails in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).
2022-08-26 20:50:24 +00:00
Saikrishna Arcot
91e9db005a
[202205]: Update package versions (#11801)
This was done manually, to try to get past a build error due to changing
package versions in Debian.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-08-21 15:23:44 -07:00
abdosi
0355caf20b Added support to add gbsyncd in Feature Table of Host Config DB (#11754)
Why I did:

In case of multi-asic platforms gbsyncd is not getting added to Feature Table of Host Config DB. Without this container_checker complains of not needed gbsyncd container's are running.

How I did:
Update Both Host and Namespace config db when gbsyncd docker is starting.

How I verify:
Verified on Multi-asic platforms.
2022-08-19 15:22:12 +00:00
Nikola Dancejic
f63dc738f9 [swss] Adding conditional for bgp when on multi ASIC platform (#11691)
bgp should be a per-asic service, and runs for each namespace on
multi-asic platforms. However, putting bgp in MULTI_INST_DEPENDENT
causes swss to be restarted as well as bgp. this is causing issues after #11000

Issue: #11653

This fix:

removes bgp from dependents list
adds a conditional that either adds bgp, or bgp@$DEV to separate
between single and multi-asic platforms
2022-08-17 17:10:29 +00:00
Hua Liu
6a2c540cba
[swsscommon] Add c++ version sonic-db-cli from sonic-swss-common (#10825) (#11713)
Cherry pick PR https://github.com/sonic-net/sonic-buildimage/pull/10825 to 202205 branch

#### Why I did it
    Fix sonic-db-cli high CPU usage on SONiC startup issue: https://github.com/sonic-net/sonic-buildimage/issues/10218
    ETA of this issue will be 2022/05/31

#### How I did it
    Re-write sonic-cli with c++ in sonic-swss-common: https://github.com/sonic-net/sonic-swss-common/pull/607
    Modify swss-common rules and slave.mk to install c++ version sonic-db-cli.
    

#### How to verify it
    Pass all E2E test scenario.

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111

#### Description for the changelog
    Build and install c++ version sonic-db-cli from swss-common.

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/SONiC/wiki/Configuration.
-->

#### A picture of a cute animal (not mandatory but encouraged)
2022-08-17 15:35:00 +08:00
mssonicbld
5c306cc2e5
[ci/build]: Upgrade SONiC package versions (#11679) 2022-08-15 05:50:59 +00:00
Lawrence Lee
15c80b207c [arp_update]: Resolve failed neighbors on dualtor (#11615)
In arp_update, check for FAILED or INCOMPLETE kernel neighbor entries and manually ping them to try and resolve the neighbor

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-08-11 16:19:25 +00:00
Stepan Blyshchak
3201dc93f6 [swss.sh/syncd.sh] Trap only on EXIT (#11590)
When using trap on SIGTERM the script will not react to the SIGTERM signal sent while a child is executing.
I.e, the following script does not react on SIGTERM sent to it if it is
waiting for sleep to finish:

```

trap "echo Handled SIGTERM" 0 2 3 15

echo "Before sleep"
sleep inf
echo "After sleep"
```

Instead, trap only on EXIT which covers also a scenario with exit on
SIGINT, SIGTERM.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-08-11 16:18:00 +00:00
Ying Xie
094745f06f [write_standby] update write_standby.py script (#11650)
Why I did it
The initial value has to be present for the state machines to work. In active-standby dual-tor scenario, or any hardware mux scenario, the value will be updtaed eventually with a delay.

However, in active-active dual-tor scenario, there is no other mechanism to initialize the value and get state machines started.
So this script will have to write something at start up time.

For active-active dualtor, 'active' is a more preferred initial value, the state machine will switch the state to standby soon if
link prober found link not in good state.

How I did it
Update the script to always provide initial values.

How to verify it
Tested on active-active dual-tor testbed.

Signed-off-by: Ying Xie ying.xie@microsoft.com
2022-08-09 23:02:09 +00:00
Sudharsan Dhamal Gopalarathnam
871a1c51d8 [vs]Preventing ebtables cfg to be applied on vs (#11585)
*Preventing ebtables rules to be applied on KVM image. The ebtables rules in SONiC are added to prevent ARP as well as L2 forwarding to be blocked in linux kernel since the hardware will take care of the actual L2 forward. However this is not the case with KVM where linux needs to forward even L2 packets
2022-08-08 20:45:28 +00:00
bingwang-ms
fda1290926 Support different DSCP_TO_TC_MAP for T1 in dualtor deployment (#11569)
* Support different DSCP_TO_TC_MAP for T1 in dualtor deployment
2022-08-08 20:44:32 +00:00
Stepan Blyshchak
29d29b9491 [swss.sh] clear counters cache folder on swss cold/fast reload (#11244)
A change in sonic-utilities makes all cache files be saved into a
/tmp/cache. On swss restart this cache has to be removed in case swss
starts in cold or fast mode. A related cache restoration in the warmboot
finalizer script is also updated to use new location.

- Why I did it
To fix #9817. Clear the cache directory on swss.sh except for warm start.
Also, adopted finalize-warmboot script to take the cache directory.

- How I did it
A change in sonic-utilities makes all cache files be saved into a /tmp/cache. On swss restart this cache has to be removed in case swss starts in cold or fast mode. A related cache restoration in the warmboot finalizer script is also updated to use new location.

- How to verify it
Run togather with Azure/sonic-utilities#2232. Verify counters cache is removed on config reload, cold/fast reboots, swss restart.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-08-08 20:42:54 +00:00
Nikola Dancejic
32fb4c7772 [swss] Adding bgp container as dependent of swss (#11000)
What I did:
Added bgp as a dependent of swss

Why I did it:
bgp container was not restarting on swss crash. When swss crashes, linkmgrd
doesn't initate a switchover because it cannot access the default route from
orchagent. Bringing down bgp with swss will isolate the ToR, causing linkmgrd
to initiate a switchover to the peer ToR avoiding significant packet loss.

How I did it:
Added bgp to DEPENDENT

Signed-off-by: Nikola Dancejic <ndancejic@microsoft.com>
2022-08-08 20:40:35 +00:00
mssonicbld
f30e85358e
[ci/build]: Upgrade SONiC package versions (#11438)
Upgrade SONiC Versions
2022-08-07 11:29:11 +08:00
Jing Zhang
a71d5db05e Update WARM START FINALIZER to wait for linkmgrd to reconcile (#11477)
Spanning from sonic-net/sonic-linkmgrd#76, this PR is to update warm restart finalizer to wait for linkmgrd to be reconciled.

sign-off: Jing Zhang zhangjing@microsoft.com

Why I did it
To make sure finalizer save config after linkmgrd's reconciliation.

How I did it
Add linkmgrd to the reconciliation wait list of warmboot finalizer.

How to verify it
Verified on lab device, linkmgrd reconciled as expected.
2022-07-28 20:42:07 +00:00
Lior Avramov
ff3ad9ddd1 [memory_checker] Do not check memory usage of containers if docker daemon is not running (#11476)
Fix in Monit memory_checker plugin. Skip fetching running containers if docker engine is down (can happen in deinit).
This PR fixes issue #11472.

Signed-off-by: liora liora@nvidia.com

Why I did it
In the case where Monit runs during deinit flow, memory_checker plugin is fetching the running containers without checking if Docker service is still running. I added this check.

How I did it
Use systemctl is-active to check if Docker engine is still running.

How to verify it
Use systemctl to stop docker engine and reload Monit, no errors in log and relevant print appears in log.

Which release branch to backport (provide reason below if selected)
The fix is required in 202205 and 202012 since the PR that introduced the issue was cherry picked to those branches (#11129).
2022-07-28 20:37:22 +00:00
abdosi
eb56dc8b90 Enable ARP Update Script for Packet based chassis. (#11465)
What I did:

    Following changes done for packet based chassis:-
    1> Run arp_update on LC's to resolve static route nexthops over backend
    port-channel interfaces.
    2> On Supervisor make sure arp_update exit gracefully
2022-07-28 20:36:54 +00:00
tjchadaga
0c7f0aa9b7 Add load_minigraph option to include traffic-shift-away during config migration (#11403) 2022-07-28 20:34:39 +00:00
Stephen Sun
b4d8ee3fec [Mellanox] Support Mellanox-SN4600C-C64 as T1 switch in dual-ToR scenario (#11261)
- Why I did it
Support Mellanox-SN4600C-C64 as T1 switch in dual-ToR scenario
This is to port #11032 and #11299 from 202012 to master.

Support additional queue and PG in buffer templates, including both traditional and dynamic model
Support mapping DSCP 2/6 to lossless traffic in the QoS template.
Add macros to generate additional lossless PG in the dynamic model
Adjust the order in which the generic/dedicated (with additional lossless queues) macros are checked and called to generate buffer tables in common template buffers_config.j2
Buffer tables are rendered via using macros.
Both generic and dedicated macros are defined on our platform. Currently, the generic one is called as long as it is defined, which causes the generic one always being called on our platform. To avoid it, the dedicated macrio is checked and called first and then the generic ones.
Support MAP_PFC_PRIORITY_TO_PRIORITY_GROUP on ports with additional lossless queues.
On Mellanox-SN4600C-C64, buffer configuration for t1 is calculated as:

40 * 100G downlink ports with 4 lossless PGs/queues, 1 lossy PG, and 3 lossy queues
16 * 100G uplink ports with 2 lossless PGs/queues, 1 lossy PG, and 5 lossy queues

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-07-28 20:30:00 +00:00
tjchadaga
fc93871881 Changes to persist TSA/B state across reloads (#11257) 2022-07-28 20:29:45 +00:00
andywongarista
f377636747 Add gbsyncd container for broncos (#11154)
* Add docker-gbsyncd-broncos support
* Address review comments
* Add socket to gbsyncd
* Upgrade gbsyncd-broncos to bullseye
2022-07-28 20:27:21 +00:00
bingwang-ms
f7cc66ad4c Add flag to control the generation of PORT_QOS_MAP|global entry (#11448)
Why I did it
This PR is to add a flag to control whether to generate PORT_QOS_MAP|global entry or not.
It's because for some HWSKU, such as BackEndToRRouter and BackEndLeafRouter, there is no DSCP_TO_TC_MAP defined.
Hence, if the PORT_QOS_MAP|global entry is generated, OA will report some error because the DSCP_TO_TC_MAP map AZURE can not be found.

Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- saiObjectTypeQuery: invalid object id oid:0x7fddb43605d0
Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- meta_generic_validation_objlist: SAI_SWITCH_ATTR_QOS_DSCP_TO_TC_MAP:SAI_ATTR_VALUE_TYPE_OBJECT_ID object on list [0] oid 0x7fddb43605d0 is not valid, returned null object id
Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- applyDscpToTcMapToSwitch: Failed to apply DSCP_TO_TC QoS map to switch rv:-5
Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- doTask: Failed to process QOS task, drop it
This PR is to address the issue.

How I did it
Add a flag require_global_dscp_to_tc_map to control whether to generate the PORT_QOS_MAP|global entry. The default value for require_global_dscp_to_tc_map is true. If the device type is storage backend, the value is changed to false. Then the PORT_QOS_MAP|global entry is not generated.

How to verify it
Update the current test_qos_dscp_remapping_render_template to cover storage backend.
2022-07-17 03:20:20 +00:00
mssonicbld
63a3631d98
[ci/build]: Upgrade SONiC package versions (#11425)
Upgrade SONiC Versions
2022-07-13 07:08:33 +08:00
mssonicbld
0fb00ff90c
[ci/build]: Upgrade SONiC package versions (#11396) 2022-07-08 14:27:03 +00:00
Neetha John
73abb5c58a Add backend acl template (#11220)
Why I did it
Storage backend has all vlan members tagged. If untagged packets are received on those links, they are accounted as RX_DROPS which can lead to false alarms in monitoring tools. Using this acl to hide these drops.

How I did it
Created a acl template which will be loaded during minigraph load for backend. This template will allow tagged vlan packets and dropped untagged

How to verify it
Unit tests

Signed-off-by: Neetha John <nejo@microsoft.com>
2022-07-07 21:19:57 +00:00
mssonicbld
1d4cba69a4
[ci/build]: Upgrade SONiC package versions (#11334)
Upgrade SONiC Versions
2022-07-06 10:59:08 +08:00
judyjoseph
4657d8138e Revert "Update include_macsec flag if type is SpineRouter (#11141)" (#11306)
This reverts commit c9f36957db.
2022-07-05 16:12:17 +00:00
mssonicbld
0acc47ecc9
[ci/build]: Upgrade SONiC package versions (#11189)
Upgrade SONiC Versions
2022-07-04 20:19:43 +08:00
davidpil2002
f17d55dc67 Add support for Password Hardening (#10323)
- Why I did it
New security feature for enforcing strong passwords when login or changing passwords of existing users into the switch.

- How I did it
By using mainly Linux package named pam-cracklib that support the enforcement of user passwords, the daemon named hostcfgd, will support add/modify password policies that enforce and strengthen the user passwords.

- How to verify it
Manually Verification-
1. Enable the feature, using the new sonic-cli command passw-hardening or manually add the password hardening table like shown in HLD by using redis-cli command

2. Change password policies manually like in step 1.
Notes:
password hardening CLI can be found in sonic-utilities repo-
P.R: Add support for Password Hardening sonic-utilities#2121
code config path: config/plugins/sonic-passwh_yang.py
code show path: show/plugins/sonic-passwh_yang.py

3. Create a new user (using adduser command) or modify an existing password by using passwd command in the terminal. And it will now request a strong password instead of default linux policies.

Automatic Verification - Unitest:
This PR contained unitest that cover:
1. test default init values of the feature in PAM files
2. test all the types of classes policies supported by the feature in PAM files
3. test aging policy configuration in PAM files
2022-06-30 05:25:58 +00:00
Jing Zhang
f25a84c148 Avoid write_standby in warm restart context (#11283)
Avoid write_standby in warm restart context.

sign-off: Jing Zhang zhangjing@microsoft.com

Why I did it
In warm restart context, we should avoid mux state change.

How I did it
Check warm restart flag before applying changes to app db.

How to verify it
Ran write_standby in table missing, key missing, field missing scenarios.
Did a warm restart, app db changes were skipped. Saw this in syslog:
WARNING write_standby: Taking no action due to ongoing warmrestart.
2022-06-30 05:15:41 +00:00
bingwang-ms
d9cd1a1355 Add extra lossy PG profile for ports between T1 and T2 (#11157)
Signed-off-by: bingwang <wang.bing@microsoft.com>

Why I did it
This PR brings two changes

Add lossy PG profile for PG2 and PG6 on T1 for ports between T1 and T2.
After PR Update qos config to clear queues for bounced back traffic #10176 , the DSCP_TO_TC_MAP and TC_TO_PG_MAP is updated when remapping is enable

DSCP_TO_TC_MAP
Before	After	Why do this change
"2" : "1"	"2" : "2"	Only change for leaf router to map DSCP 2 to TC 2 as TC 2 will be used for lossless TC
"6" : "1"	"6" : "6"	Only change for leaf router to map DSCP 6 to TC 6 as TC 6 will be used for lossless TC

TC_TO_PRIORITY_GROUP_MAP
Before	After	Why do this change
"2" : "0"	"2" : "2"	Only change for leaf router to map TC 2 to PG 2 as PG 2 will be used for lossless PG
"6" : "0"	"6" : "6"	Only change for leaf router to map TC 6 to PG 6 as PG 6 will be used for lossless PG

So, we have two new lossy PGs (2 and 6) for the T2 facing ports on T1, and two new lossless PGs (2 and 6) for the T0 facing port on T1.
However, there is no lossy PG profile for the T2 facing ports on T1. The lossless PGs for ports between T1 and T0 have been handled by buffermgrd .Therefore, We need to add lossy PG profiles for T2 facing ports on T1.

We don't have this issue on T0 because PG 2 and PG 6 are lossless PGs, and there is no lossy traffic mapped to PG 2 and PG 6

Map port level TC7 to PG0
Before the PCBB change, DSCP48 -> TC 6 -> PG 0.
After the PCBB change, DSCP48 -> TC 7 -> PG 7
Actually, we can map TC7 to PG0 to save a lossy PG.

How I did it
Update the qos and buffer template.

How to verify it
Verified by UT.
2022-06-30 05:15:41 +00:00
judyjoseph
d7db8a285d Update include_macsec flag if type is SpineRouter (#11141)
Add the support to enable macsec when type is SpineRouter
2022-06-28 16:07:24 +00:00
xumia
e2bee174e1 [Build]: Support to use symbol links for lazy installation targets to reduce the image size (#10923)
Why I did it
Support to use symbol links in platform folder to reduce the image size.
The current solution is to copy each lazy installation targets (xxx.deb files) to each of the folders in the platform folder. The size will keep growing when more and more packages added in the platform folder. For cisco-8000 as an example, the size will be up to 2G, while most of them are duplicate packages in the platform folder.

How I did it
Create a new folder in platform/common, all the deb packages are copied to the folder, any other folders where use the packages are the symbol links to the common folder.

Why platform.tar?
We have implemented a patch for it, see #10775, but the problem is the the onie use really old unzip version, cannot support the symbol links.
The current solution is similar to the PR 10775, but make the platform folder into a tar package, which can be supported by onie. During the installation, the package.tar will be extracted to the original folder and removed.
2022-06-28 16:03:16 +00:00
Ying Xie
8249d0da80
[202205] add release tag file (#11222)
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2022-06-22 21:11:40 -07:00
Sudharsan Dhamal Gopalarathnam
379d77af42 [lldp]Fix lldp spawned after reboot when disabled (#11080)
- Why I did it
When LLDP is disabled through feature command, it gets spawned after reboot.

- How I did it
In syncd.sh check if the service is enabled before spawning automatically during cold reboot.

- How to verify it
Disable lldp feature. Perform cold reboot and verify its not spawned.
2022-06-22 23:08:05 +00:00
bingwang-ms
6f713419ba Add two extra lossless queues for bounced back traffic (#10496)
Signed-off-by: bingwang <bingwang@microsoft.com>

Why I did it
This PR is to add two extra lossless queues for bounced back traffic.
HLD sonic-net/SONiC#950

SKUs include
Arista-7050CX3-32S-C32
Arista-7050CX3-32S-D48C8
Arista-7260CX3-D108C8
Arista-7260CX3-C64
Arista-7260CX3-Q64

How I did it
Update the buffers.json.j2 template and buffers_config.j2 template to generate new BUFFER_QUEUE table.

For T1 devices, queue 2 and queue 6 are set as lossless queues on T0 facing ports.
For T0 devices, queue 2 and queue 6 are set as lossless queues on T1 facing ports.
Queue 7 is added as a new lossy queue as DSCP 48 is mapped to TC 7, and then mapped into Queue 7

How to verify it
Verified by UT
Verified by coping the new template and generate buffer config with sonic-cfggen
2022-06-22 23:05:14 +00:00
yozhao101
d63d16ba58 [memory_checker] Do not check memory usage of containers which are not created (#11129)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to fix an issue (#10088) by enhancing the script memory_checker.

Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage.

How I did it
In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage.

How to verify it
I tested on a lab device by following the steps:

Stops telemetry container with command sudo systemctl stop telemetry.service

Removes telemetry container with command docker rm telemetry

Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.
2022-06-19 08:01:18 +00:00
bingwang-ms
255d77e610 Generate switch level dscp_to_tc_map entry from qos_config template (#11087)
* Generate switch level dscp_to_tc_map

Signed-off-by: bingwang <wang.bing@microsoft.com>
2022-06-17 03:31:32 +00:00
shlomibitton
323aa791ec [Mellanox] [pmon] Fix for PMON service not starting when restarting SWSS service after fast/warm reboot (#10901)
- Why I did it
Recent change to delay PMON service in case of fast/warm reboot introduce an issue when restarting only SWSS service after fast/warm reboot for Nvidia platform.
Since the timer is triggered only when the system boot, in a scenario when the system is after a fast/warm reboot and the user restart SWSS service, as part of syncd.sh script, PMON service will stop but the timer will not start again.

- How I did it
On syncd.sh script, in case of fast/warm indication, check if pmon.timer is running.
If it is running it means we are at the first boot and continue normally.
If it is not running, meaning the service was restarted, start the timer to keep the system behavior consistent.

- How to verify it
Run fast/warm reboot.
service swss restart.
Observe PMON service starting.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2022-06-17 03:31:18 +00:00
mssonicbld
1817c325d3
[ci/build]: Upgrade SONiC package versions (#11060)
Co-authored-by: mssonicbld <vsts@fv-az125-175.rkccfo2qup5e5ofdktzmdhpvwd.jx.internal.cloudapp.net>
2022-06-16 23:33:23 +08:00
judyjoseph
8fc5c9b31f Cleanup macsec stateDB tables on restart (#11066)
Clean macsec tables in STATE_DB on start
2022-06-16 02:12:59 +00:00
mssonicbld
1c2e361080
[ci/build]: Upgrade SONiC package versions (#11048)
Upgrade SONiC Versions
Co-authored-by: mssonicbld <vsts@fv-az113-110.2axxbwkg0v3e1hk3nyhxwcxvsf.bx.internal.cloudapp.net>
2022-06-07 10:01:24 +08:00
bingwang-ms
76502c821e Update qos template to support SYSTEM_DEFAULT table (#10936)
* Update qos template to support SYSTEM_DEFAULT table

Signed-off-by: bingwang <wang.bing@microsoft.com>
2022-06-05 15:21:10 +00:00
xumia
043656dfe8 Support symcrypt fips config for aboot/uboot (#10729)
Why I did it
Support symcrypt fips config for aboot/uboot
2022-06-05 15:20:20 +00:00
mssonicbld
aecbf4718f
[ci/build]: Upgrade SONiC package versions (#11013)
Co-authored-by: mssonicbld <vsts@fv-az95-899.pq21ngt4mckezax5v03dvw0kka.ex.internal.cloudapp.net>
2022-06-03 09:08:13 +08:00
Lukas Stockner
c9b27cde71
[swss] Clear VXLAN tunnel table from State DB on startup (#10822)
* When reloading config after crashes, VTEP interfaces are sometimes not created since the tunnel still exists in the STATE_DB.
* Adding VXLAN_TUNNEL_TABLE to the list of tables to be cleaned in swss.sh fixes the problem.
2022-05-31 08:54:31 -07:00
davidpil2002
ab0930313b
[YANG] Add support for Password Hardening (#10322)
- Why I did it
Yang Model about password hardening feature, the sonic CLI of this feature was autogenerated from this Yang model

- How I did it
Create new Yang model in src/sonic-yang-models/yang-models/sonic-passwh.yang.

- How to verify it
There are unitests(yang test) in this P.R covering all the passwords policies with good and bad values cases.
Or is possible manually using the config/show password commands that were autogenerated from this Yang model. (this CLI code added in sonic-utilities)
2022-05-29 13:54:51 +03:00
xumia
f0dfd398a6
Revert "Reduce image size for lazy installation packages (#10775)" (#10916)
This reverts commit 15cf9b0d70.
Why I did it
Revert the PR #10775, for it has impact on onie installation.
It is caused by the symbol links not supported in some of the onie unzip.
We will enable after fixing the issue, see #10914
2022-05-26 09:39:48 +08:00
abdosi
0285bfe42e
[chassis] Fix issues regarding database service failure handling and mid-plane connectivity for namespace. (#10500)
What/Why I did:

Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability. 

Scenario: For example if let's say database service goes in fail state  then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created.

Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface.

Issue 2: Use of IPVLAN vs MACVLAN

Currently we are using ipvlan mode.  However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device."  based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN)  it remains in Kernel Database and never goes to free pool even though namespace is deleted. 

Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address.

Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis   Server.

Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis  Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle

Solution: Added explicit PING from namespace that will check this reachability.

Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet.

Solution: Handle gracefully via try..except and log the message.
2022-05-24 16:54:12 -07:00
Maxime Lorrillere
392899682f
[Arista] Add support for Wolverine linecards (#8887)
Add support for WolverineQCpu, WolverineQCpuMs, WolverineQCpuBk, WolverineQCpuBkMs

Co-authored-by: Maxime Lorrillere <mlorrillere@arista.com>
2022-05-20 14:11:06 -07:00
Senthil Kumar Guruswamy
f37dd770cd
System Ready (#10479)
Why I did it
At present, there is no mechanism in an event driven model to know that the system is up with all the essential sonic services and also, all the docker apps are ready along with port ready status to start the network traffic. With the asynchronous architecture of SONiC, we will not be able to verify if the config has been applied all the way down to the HW. But we can get the closest up status of each app and arrive at the system readiness.

How I did it
A new python based system monitor tool is introduced under system-health framework to monitor all the essential system host services including docker wrapper services on an event based model and declare the system is ready. This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any.

How to verify it
"show system-health sysready-status" click CLI
Syslogs for system ready
2022-05-20 13:25:11 -07:00
Arun Saravanan Balachandran
f4b22f67a4
[initramfs]: SSD firmware upgrade in initramfs (#10748)
Why I did it
To upgrade SSD firmware in initramfs while rebooting from SONiC to SONiC and during NOS to SONiC migration.

How I did it
New option 'ssd-upgrader-part’ is introduced in grub command line, to indicate the partition and its filesystem type in which the SSD firmware updater is present. ‘ssd-upgrader-part’ syntax is ssd-upgrader-part=<partition>,<filesystem type>. Example: ssd-upgrader-part=/dev/sda8,ext4

A new initramfs script ‘ssd-upgrade’ is included in init-premount and it invokes the SSD firmware updater (ssd-fw-upgrade) present in the partition indicated by the boot option 'ssd-upgrader-part'

How to verify it
In SONiC, the SSD firmware updater is copied to “/host/” directory.
Fast-reboot is to be initiated with the ‘-u’ option ([scripts/fast-reboot] Add option to include ssd-upgrader-part boot option with SONiC partition sonic-utilities#2150)
After reboot, while booting into SONiC the SSD firmware updater will be executed in initramfs.
2022-05-12 08:11:02 -07:00
Marty Y. Lok
23f9126f59
[VoQ][config] Multiasic Supervisor card fails to load config_db#.json in chassis when system is reboot (#10106)
Supervisor card fails to load config_db#.json in chassis when system reboot. 
This is an intermittent issue, fixes #10105
2022-05-09 11:06:11 -07:00
xumia
15cf9b0d70
Reduce image size for lazy installation packages (#10775)
Why I did it
The image size is too large, when there are multiple lazy packages and multiple platforms. It is not necessary to keep the lazy installation packages in multiple copies.
For cisco image, the image size will reduce from 3.5G to 1.7G.

How I did it
Use symbol links to only keep one package for each of the lazy package.
Make a new folder fsroot/platform/common
Copy the lazy packages into the folder.
When using a package in each of the platform, such as x86_64-grub, x86_64-8800_rp-r0, x86_64-8201_on-r0, etc, only make a symbol link to the package in the common folder.
2022-05-09 08:26:09 -07:00
xumia
8ec8900d31
Support SONiC OpenSSL FIPS 140-3 based on SymCrypt engine (#9573)
Why I did it
Support OpenSSL FIPS 140-3, see design doc: https://github.com/Azure/SONiC/blob/master/doc/fips/SONiC-OpenSSL-FIPS-140-3.md.

How I did it
Install the fips packages.
To build the fips packages, see https://github.com/Azure/sonic-fips
Azure pipelines: https://dev.azure.com/mssonic/build/_build?definitionId=412

How to verify it
Validate the SymCrypt engine:

admin@sonic:~$ dpkg-query -W | grep openssl
openssl 1.1.1k-1+deb11u1+fips
symcrypt-openssl        0.1

admin@sonic:~$ openssl engine -v | grep -i symcrypt
(symcrypt) SCOSSL (SymCrypt engine for OpenSSL)
admin@sonic:~$
2022-05-06 07:21:30 +08:00
Junchao-Mellanox
681c24878b
Fix race condition between networking service and interface-config service (#10573)
Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
Systemd starts interface-config service who depends on networking service
Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
Interface-config service updates /etc/networking/interface to static configuration.
Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it
Manual test.
2022-05-05 15:21:44 -07:00
shlomibitton
4ec3af86af
[Fastboot] Delay PMON service for better fastboot performance (#10567)
- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, PMON is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for PMON service.
Exclude for MLNX platform the start trigger of PMON when SYNCD starts in case of fastboot.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
2022-05-02 10:44:17 +03:00
shlomibitton
1d84e0d7df
[Fastboot] Delay LLDP service for better fastboot performance (#10568)
- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, LLDP is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for LLDP service.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
This PR is dependent on PR: #10567
2022-04-28 10:35:14 +03:00
ganglv
9d7387a18e
[sonic-host-services]: Fix import and invalid path (#10660)
Why I did it
Can not start sonic-hostservice

How I did it
Install python3-dbus and systemd-python, and replace invalid path

How to verify it
Start the service with below commands:
sudo systemctl start sonic-hostservice
sudo systemctl status sonic-hostservice

Signed-off-by: Gang Lv ganglv@microsoft.com
2022-04-27 07:14:51 +08:00
Saikrishna Arcot
64187a1b15
Remove SSH host keys after installing the custom version of sshd (#10633)
* Remove SSH host keys after installing the custom version of sshd

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Use an override for for sshd instead of overwriting the service file

Don't overwrite upstream's .service file, and instead use an override
file for making sure the host key(s) are generated.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-04-25 10:38:52 -07:00
bingwang-ms
3fc3259a35
Define qos map AZURE_TUNNEL for QoS remapping of tunnel traffic (#10565)
* Add AZURE_TUNNEL map

Signed-off-by: bingwang <wang.bing@microsoft.com>
2022-04-25 15:06:10 +08:00
yozhao101
e24fe9bc60
[Monit] Fix the issue which shows Monit can not reset its counter. (#10288)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
2022-04-20 18:08:06 -07:00
Samuel Angebault
fb147764b5
[Arista] Fix arista-net initramfs hook (#10624)
The interface renaming logic fails if one interface is missing.
Because of the `set -e` the whole initramfs hook would abort early on
error.
This change fixes the current behavior to make sure missing interfaces
are properly skipped and ensure existing interface are renamed.
2022-04-20 10:03:05 -07:00