Commit Graph

495 Commits

Author SHA1 Message Date
Oleksandr Ivantsiv
c7ecd92c54
Clear DNS configuration received from DHCP during networking reconfiguration in Linux. (#13516)
- Why I did it
fixes #12907

When the management interface IP address configuration changes from dynamic to static the DNS configuration (retrieved from the DHCP server) in /etc/resolv.conf remains uncleared. This leads to a DNS configuration pointing to the wrong nameserver. To make the behavior clear DNS configuration received from DHCP should be cleared.

- How I did it
Use resolvconf package for managing DNS configuration. It is capable of tracking the source of DNS configuration and puts the configuration retrieved from the DHCP servers into a separate file. This allows the implementation of DNS configuration cleanup retrieved from DHCP during networking reconfiguration.

- How to verify it
Ensure that the management interface has no static configuration.
Check that /etc/resolv.conf has DNS configuration.
Configure a static IP address on the management interface.
Verify that /etc/resolv.conf has no DNS configuration.
Remove the static IP address from the management interface.
Verify that /etc/resolv.conf has DNS configuration retrieved form DHCP server.
2023-01-30 22:13:10 +02:00
Devesh Pathak
c93716a142
rsyslog to start after interfaces-config (#13503)
Fixes #12408

Why I did it
We are running into #12408 very frequently.
This results in no syslogs from any containers as rsyslog server could not start.
some of the sonic-mgmt scripts look for log statements and error out if log is not present.

Interfaces-config service configures the loopback interface along with other interfaces. rsyslog-config reads ip address of loopback interface and generates /etc/rsyslog.conf. When this race condition happens, lo interface ip is not yet programmed and rsyslog-config ends up writing UDP server as null in /etc/rsyslog.conf.

How I did it
rsyslog-config service is started after interfaces-config service.

How to verify it
Did multiple reboots and verified that $UDPServerAddress is valid.
2023-01-26 20:39:13 -08:00
Jing Zhang
dabb31c5f6
[sudoers] add /usr/local/bin/storyteller to READ_ONLY_CMDS (#13422)
Adding /usr/local/bin/storyteller to READ_ONLY_CMDS. So no write access or prompt for password is needed to run storyteller.

Tested on 202205 clusters, user who didn't request write access was able to grep log using storyteller.

sign-off: Jing Zhang zhangjing@microsoft.com
2023-01-26 20:38:29 -08:00
Zain Budhwani
c9a33cb00e
Fix segfault issue inside memory_checker (#13066)
#### Why I did it

Segfault was occuring when running memory_checker

#### How I did it

Deinit publisher immediately after publishing

#### How to verify it

Manual testing
2023-01-24 15:30:41 -08:00
xumia
e6a01ca5eb
[Bug] Fix SONiC installation failure caused by pip/pip3 not found (#13284)
The main issue is the pip/pip3 command cannot be found when the package is being installed by apt-get.
When using the dpkg install, the searching path is PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
When using the apt-get install, the searching path is PATH=/usr/sbin:/usr/bin:/sbin:/bin
But the pip/pip3 default path is at /usr/local/bin, so dpkg works, but apt-get not work.

How I did it
Export the path /usr/local/bin for pip/pip3.
Make the deb packages can be installed by apt-get.
2023-01-11 08:54:24 -08:00
centecqianj
4b933bd566
[Centec arm64] Solve the abnormal console speed of centec-arm64 switch board (#13126)
The console of the centec-arm64 board is ttyAMA0.The current regular expression cannot be correctly parsed.

Signed-off-by: centecqianj <qianj@centec.com>
2023-01-07 21:10:03 -08:00
Junchao-Mellanox
2126def04e
[infra] Support syslog rate limit configuration (#12490)
- Why I did it
Support syslog rate limit configuration feature

- How I did it
Remove unused rsyslog.conf from containers
Modify docker startup script to generate rsyslog.conf from template files
Add metadata/init data for syslog rate limit configuration

- How to verify it
Manual test
New sonic-mgmt regression cases
2022-12-20 10:53:58 +02:00
Saikrishna Arcot
00b11ec4e2
Replace logrotate cron file with (adapted) systemd timer file (#12921)
Debian is shipping a systemd timer unit for logrotate, but we're also
packaging in a cron job, which means both of them will run, potentially
at the same time. Remove our cron file, and add an override to the
shipped timer file to have it be run every 10 minutes.

Fixes #12392.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-12-08 14:13:11 -08:00
Junchao-Mellanox
3b3837a636
[containercfgd] Add containercfgd and syslog rate limit configuration support (#12489)
* [containercfgd] Add containercfgd and syslog rate limit configuration support

* Fix build issue

* Fix checker issue

* Fix review comment

* Fix review comment

* Update containercfgd.py
2022-12-08 08:58:35 -08:00
Zain Budhwani
4b001e5115
Change value type of params in memory_checker (#12797)
Fix error when calling events API, required value is string, passing float
2022-11-23 17:37:28 -08:00
Lorne Long
7e525d96b3
[Build] Use apt-get to predictably support dependency ordered configuration of lazy packages (#12164)
Why I did it
The current lazy installer relies on a filename sort for both unpack and configuration steps. When systemd services are configured [started] by multiple packages the order is by filename not by the declared package dependencies. This can cause the start order of services to differ between first-boot and subsequent boots. Declared systemd service dependencies further exacerbate the issue (e.g. blocking the first-boot script).

The current installer leaves packages un-configured if the package dependency order does not match the filename order.

This also fixes a trivial bug in [Build]: Support to use symbol links for lazy installation targets to reduce the image size #10923 where externally downloaded dependencies are duplicated across lazy package device directories.

How I did it
Changed the staging and first-boot scripts to use apt-get:

dpkg -i /host/image-$SONIC_VERSION/platform/$platform/*.deb

becomes

apt-get -y install /host/image-$SONIC_VERSION/platform/$platform/*.deb

when dependencies are detected during image staging.

How to verify it
Apt-get critical rules

Add a Depends= to the control information of a package. Grep the syslog for rc.local between images and observe the configuration order of packages change.
2022-11-17 11:20:42 +08:00
Devesh Pathak
0ea4f4d00e
Clear /etc/resolv.conf before building image (#12592)
Why I did it
nameserver and domain entries from build system fsroot gets into sonic image.

How I did it
Clear /etc/resolv.conf before building image

How to verify it
Built image with it and verified with install that /etc/resolv.conf is empty
2022-11-09 16:54:56 -08:00
Sudharsan Dhamal Gopalarathnam
e6a0fba9ea
[logrotate]Fix logrotate firstaction script to reflect correct size (#12599)
- Why I did it
Fix logrotate firstaction script to reflect correct size. The size was modified to change dynamically based on disk size. However this variable was not updated
#9504

- How I did it
Updated the variable based on disk size

- How to verify it
Verify in the generated rsyslog file if the variable is correctly generated from jinja template
2022-11-08 13:38:14 +02:00
Zain Budhwani
8f48773fd1
Publish additional events (#12563)
Add event_publish code or regex for rsyslog plugin for additional events
2022-11-07 09:57:57 -08:00
Mai Bui
61a085e55e
Replace os.system and remove subprocess with shell=True (#12177)
Signed-off-by: maipbui <maibui@microsoft.com>
#### Why I did it
`subprocess` is used with `shell=True`, which is very dangerous for shell injection.
`os` - not secure against maliciously constructed input and dangerous if used to evaluate dynamic content
#### How I did it
remove `shell=True`, use `shell=False`
Replace `os` by `subprocess`
2022-11-04 10:48:51 -04:00
Devesh Pathak
85e3a81f47
Fix to improve hostname handling (#12064)
* Fix to improve hostname handling
If config_db.json is missing hostname entry, hostname-config.sh ends
up deleting existing entry too and hostname changes to default 'localhost'

* default hostname to 'sonic` if missing in config file
2022-10-25 14:51:02 -07:00
Zain Budhwani
09fe3f467f
Add Structured Events w/ YANG Models (#12270)
Add events for dhcp-relay, bgp, syncd, & kernel.
2022-10-09 20:23:31 -07:00
Prince George
ac1d392d4c
Disable brackted-paste mode off by default (#12285)
* Disable brackted-paste mode off by default

* address review comment
2022-10-06 07:55:09 -07:00
Saikrishna Arcot
9251d4ba8b
[docker-wait-any]: Exit worker thread if main thread is expected to exit (#12255)
There's an odd crash that intermittently happens after the teamd container
exits, and a signal is raised to the main thread to exit. This thread (watching
teamd) continues execution because it's in a `while True`. The subsequent wait
call on the teamd container very likely returns immediately, and it calls
`is_warm_restart_enabled` and `is_fast_reboot_enabled`. In either of these
cases, sometimes, there is a crash in the transition from C code to Python code
(after the function gets executed).  Python sees that this thread got a signal
to exit, because the main thread is exiting, and tells pthread to exit the
thread.  However, during the stack unwinding, _something_ is telling the
unwinder to call `std::terminate`.  The reason is unknown.

This then results in a python3 SIGABRT, and systemd then doesn't call the stop
script to actually stop the container (possibly because the main process exited
with a SIGABRT, so it's a hard crash). This means that the container doesn't
actually get stopped or restarted, resulting in an inconsistent state
afterwards.

The workaround appears to be that if we know the main thread needs to exit,
just return here, and don't continue execution. This at least tries to avoid it
from getting into the problematic code path. However, it's still feasible to
get a SIGABRT, depending on thread/process timings (i.e. teamd exits, signals
the main thread to exit, and then syncd exits, and syncd calls one of the two C
functions, potentially hitting the issue).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-10-05 18:14:10 -07:00
Muhammad Danish
8c10851c2a
Update azure.github.io links to sonic-net.github.io (#12209)
Why I did it
azure.github.io/SONiC/ no longer works and returns 404 Not Found. Updated it to the correct sonic-net.github.io/SONiC/
2022-10-02 14:02:10 +08:00
Zain Budhwani
fd6a1b0ce2
Add events to host and create rsyslog_plugin deb pkg (#12059)
Why I did it

Create rsyslog plugin deb for other containers/host to install
Add events for bgp and host events
2022-09-21 09:20:53 -07:00
Stepan Blyshchak
a8b2a538a5
[docker-wait-any] immediately start to wait (#11595)
It could happen that a container has already crashed but docker-wait-any
will wait forever till it starts. It should, however, immediately exit
to make the serivce restart.

#### Why I did it

It is observed in some circumstances that the auto-restart mechanism does not work. Specifically for ```swss.service```, ```orchagent``` had crashed before ```docker-wait-any``` started in ```swss.sh```. This led ```docker-wait-any``` wait forever for ```swss``` to be in ```"Running"``` state and it results in:

```
CONTAINER ID   IMAGE                                COMMAND                  CREATED        STATUS                    PORTS     NAMES
1abef1ecebff   bcbca2b74df6                         "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         what-just-happened
3c924d405cd5   docker-lldp:latest                   "/usr/bin/docker-lld…"   22 hours ago   Up 22 hours                         lldp
eb2b12a98c13   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   22 hours ago   Up 22 hours                         radv
d6aac4a46974   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         mgmt-framework
d880fd07aab9   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   22 hours ago   Up 22 hours                         pmon
75f9e22d4fdd   docker-snmp:latest                   "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         snmp
76d570a4bd1c   docker-sonic-telemetry:latest        "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         telemetry
ee49f50344b3   docker-syncd-mlnx:latest             "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         syncd
1f0b0bab3687   docker-teamd:latest                  "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         teamd
917aeeaf9722   docker-orchagent:latest              "/usr/bin/docker-ini…"   22 hours ago   Exited (0) 22 hours ago             swss
81a4d3e820e8   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   22 hours ago   Up 22 hours                         bgp
f6eee8be282c   docker-database:latest               "/usr/local/bin/dock…"   22 hours ago   Up 22 hours                         database
```

The check for ```"Running"``` state is not needed because for cold boot case we do ```start_peer_and_dependent_services``` and for warm boot case the loop will retry to wait for container if this container is doing warm boot:
d01a91a569/files/image_config/misc/docker-wait-any (L56)

#### How I did it

Removed the check for ```"Running"```.

#### How to verify it

Kill swss before ```docker-wait-any``` is reached and verify auto restart will restart swss serivce.
2022-09-06 09:26:54 -07:00
Hua Liu
214e394ac0
Remove swsssdk from rules and image. (#11469)
#### Why I did it
To deprecate swsssdk, remove all dependency to it. 

#### How I did it
Remove swsssdk from rules and build image scripts.

#### How to verify it
Pass all UT and E2E test case

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205

#### Description for the changelog
Remove swsssdk from rules and build image scripts.

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
2022-08-25 08:35:51 +08:00
anamehra
f404ce60e0
container_checker on supervisor should check containers based on asic presence (#11442)
Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

The monit 'container_checker' fails in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).
2022-08-22 10:08:29 -07:00
lixiaoyuner
8d6431e754
Add k8s master feature (#11637)
* Add k8s master feature

Signed-off-by: Yun Li <yunli1@microsoft.com>

* Update kubernetes version mistake and make variable passing clear

Signed-off-by: Yun Li <yunli1@microsoft.com>

* Add CRI-dockerd package

Signed-off-by: Yun Li <yunli1@microsoft.com>

* Update version variable passing logic

Signed-off-by: Yun Li <yunli1@microsoft.com>

* Upgrade the worker kubernetes version

Signed-off-by: Yun Li <yunli1@microsoft.com>

* Install xml file parse tool

Signed-off-by: Yun Li <yunli1@microsoft.com>

Signed-off-by: Yun Li <yunli1@microsoft.com>
2022-08-13 23:01:35 +08:00
Jing Zhang
626919e250
Update WARM START FINALIZER to wait for linkmgrd to reconcile (#11477)
Spanning from sonic-net/sonic-linkmgrd#76, this PR is to update warm restart finalizer to wait for linkmgrd to be reconciled.

sign-off: Jing Zhang zhangjing@microsoft.com

Why I did it
To make sure finalizer save config after linkmgrd's reconciliation.

How I did it
Add linkmgrd to the reconciliation wait list of warmboot finalizer.

How to verify it
Verified on lab device, linkmgrd reconciled as expected.
2022-07-28 09:08:53 -07:00
Stepan Blyshchak
925a393e3d
[swss.sh] clear counters cache folder on swss cold/fast reload (#11244)
A change in sonic-utilities makes all cache files be saved into a
/tmp/cache. On swss restart this cache has to be removed in case swss
starts in cold or fast mode. A related cache restoration in the warmboot
finalizer script is also updated to use new location.

- Why I did it
To fix #9817. Clear the cache directory on swss.sh except for warm start.
Also, adopted finalize-warmboot script to take the cache directory.

- How I did it
A change in sonic-utilities makes all cache files be saved into a /tmp/cache. On swss restart this cache has to be removed in case swss starts in cold or fast mode. A related cache restoration in the warmboot finalizer script is also updated to use new location.

- How to verify it
Run togather with Azure/sonic-utilities#2232. Verify counters cache is removed on config reload, cold/fast reboots, swss restart.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-07-28 12:03:22 +03:00
Lior Avramov
069b3a4669
[memory_checker] Do not check memory usage of containers if docker daemon is not running (#11476)
Fix in Monit memory_checker plugin. Skip fetching running containers if docker engine is down (can happen in deinit).
This PR fixes issue #11472.

Signed-off-by: liora liora@nvidia.com

Why I did it
In the case where Monit runs during deinit flow, memory_checker plugin is fetching the running containers without checking if Docker service is still running. I added this check.

How I did it
Use systemctl is-active to check if Docker engine is still running.

How to verify it
Use systemctl to stop docker engine and reload Monit, no errors in log and relevant print appears in log.

Which release branch to backport (provide reason below if selected)
The fix is required in 202205 and 202012 since the PR that introduced the issue was cherry picked to those branches (#11129).
2022-07-27 16:18:36 -07:00
Nazarii Hnydyn
e4e3adcbc2
[ssip]: Update config generator (#10991)
- Why I did it

To implement Syslog Source IP feature
In order to include the following commit: 8e5d478 [ssip]: Add CLI (#2191)

- How I did it
Updated syslog config template
Advanced submodule sonic-utilities

ea11b22 [sonic-bootchart] add sonic-bootchart (#2195)
8e5d478 [ssip]: Add CLI (#2191)
1dacb7f Replace pyswsssdk with swsscommon (#2251)

- How to verify it
make configure PLATFORM=mellanox
make target/sonic-mellanox.bin

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2022-07-20 10:05:13 +03:00
Alexander Allen
429254cb2d
[arm] Refactor installer and build to allow arm builds targeted at grub platforms (#11341)
Refactors the SONiC Installer to support greater flexibility in building for a given architecture and bootloader. 

#### Why I did it

Currently the SONiC installer assumes that if a platform is ARM based that it uses the `uboot` bootloader and uses the `grub` bootloader otherwise. This is not a correct assumption to make as ARM is not strictly tied to uboot and x86 is not strictly tied to grub. 

#### How I did it

To implement this I introduce the following changes:
* Remove the different arch folders from the `installer/` directory
* Merge the generic components of the ARM and x86 installer into `installer/installer.sh`
* Refactor x86 + grub specific functions into `installer/default_platform.conf` 
* Modify installer to call `default_platform.conf` file and also call `platform/[platform]/patform.conf` file as well to override as needed
* Update references to the installer in the `build_image.sh` script
* Add `TARGET_BOOTLOADER` variable that is by default `uboot` for ARM devices and `grub` for x86 unless overridden in `platform/[platform]/rules.mk`
* Update bootloader logic in `build_debian.sh` to be based on `TARGET_BOOTLOADER` instead of `TARGET_ARCH` and to reference the grub package in a generic manner

#### How to verify it

This has been tested on a ARM test platform as well as on Mellanox amd64 switches as well to ensure there was no impact. 

#### Description for the changelog
[arm] Refactor installer and build to allow arm builds targeted at grub platforms

#### Link to config_db schema for YANG module changes

N/A
2022-07-12 15:00:57 -07:00
geogchen
5171589f4d
Add support to generate /e/n/i when there are multiple MGMT_INTERFACE (#11368)
Why I did it
Currently interfaces.j2 hardcodes to eth0 even when there are multiple interfaces in MGMT_INTERFACE. This change adds support to generate /e/n/i when there are multiple interfaces in MGMT_INTERFACE.

How I did it
By removing hardcoded eth0 when looping through MGMT_INTERFACE.

How to verify it
Verified through unit test.

Which release branch to backport (provide reason below if selected)
 201811
 201911
 202006
 202012
 202106
 202111
 202205
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)
2022-07-12 14:55:15 -07:00
tjchadaga
4f95974669
Add load_minigraph option to include traffic-shift-away during config migration (#11403) 2022-07-12 10:08:58 -07:00
geogchen
6a9c058a92
Revert "Add support for generating interface configuration in /etc/network/interfaces for multiple management interfaces (#11204)" (#11241)
This reverts commit 90a849ea85.

#### Why I did it
The interfaces unit test did not cover some of the conditions in interfaces.j2 that was changed in #11204.  Therefore reverting the change and add the tests before making the change to interfaces.j2.

#### How I did it
Git revert.

#### How to verify it

#### Which release branch to backport (provide reason below if selected)


- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205

#### Description for the changelog

#### Link to config_db schema for YANG module changes

#### A picture of a cute animal (not mandatory but encouraged)
2022-06-24 06:30:33 -07:00
geogchen
90a849ea85
Add support for generating interface configuration in /etc/network/interfaces for multiple management interfaces (#11204)
* [Interfaces] Modify template to support multiple management interfaces

* Modify minigraph to process interfaces in sorted order

Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>

* Add UT minigraph

Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>

* make case insensitve comparison

Signed-off-by: George Chen <gechen@microsoft.com>

* Use natural sort

Signed-off-by: George Chen <gechen@microsoft.com>

Co-authored-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>
2022-06-21 10:16:10 -07:00
jingwenxie
fdc65d7600
Remove minigraph loading in updategraph script (#11146)
Why I did it
Minigraph will be deprecated in the future. So minigraph related reload should be deleted.

How I did it
Remove unused load_minigraph
2022-06-21 08:57:57 +08:00
Stepan Blyshchak
42576d2664
[auto-ts] add memory check (#10433)
#### Why I did it

To support automatic techsupport invokation in case memory usage is too high.

#### How I did it

Implemented according to https://github.com/Azure/SONiC/pull/939

#### How to verify it

UT, manual test on the switch.

*DEPENDS* on https://github.com/Azure/sonic-utilities/pull/2116
2022-06-20 09:39:05 -07:00
yozhao101
241f4454b4
[memory_checker] Do not check memory usage of containers which are not created (#11129)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to fix an issue (#10088) by enhancing the script memory_checker.

Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage.

How I did it
In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage.

How to verify it
I tested on a lab device by following the steps:

Stops telemetry container with command sudo systemctl stop telemetry.service

Removes telemetry container with command docker rm telemetry

Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.
2022-06-17 12:13:18 -07:00
jingwenxie
cca3b5be5b
Reduce logic in updategraph (#11010)
Why I did it
The dhcp_graph_url used by internal service is always set as "N/A". So we can make the updategraph logic short.

How I did it
Shorten 'if statement' logic for /tmp/dhcp_graph_url
2022-06-14 22:18:47 +08:00
abdosi
0285bfe42e
[chassis] Fix issues regarding database service failure handling and mid-plane connectivity for namespace. (#10500)
What/Why I did:

Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability. 

Scenario: For example if let's say database service goes in fail state  then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created.

Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface.

Issue 2: Use of IPVLAN vs MACVLAN

Currently we are using ipvlan mode.  However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device."  based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN)  it remains in Kernel Database and never goes to free pool even though namespace is deleted. 

Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address.

Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis   Server.

Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis  Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle

Solution: Added explicit PING from namespace that will check this reachability.

Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet.

Solution: Handle gracefully via try..except and log the message.
2022-05-24 16:54:12 -07:00
Arun Saravanan Balachandran
f4b22f67a4
[initramfs]: SSD firmware upgrade in initramfs (#10748)
Why I did it
To upgrade SSD firmware in initramfs while rebooting from SONiC to SONiC and during NOS to SONiC migration.

How I did it
New option 'ssd-upgrader-part’ is introduced in grub command line, to indicate the partition and its filesystem type in which the SSD firmware updater is present. ‘ssd-upgrader-part’ syntax is ssd-upgrader-part=<partition>,<filesystem type>. Example: ssd-upgrader-part=/dev/sda8,ext4

A new initramfs script ‘ssd-upgrade’ is included in init-premount and it invokes the SSD firmware updater (ssd-fw-upgrade) present in the partition indicated by the boot option 'ssd-upgrader-part'

How to verify it
In SONiC, the SSD firmware updater is copied to “/host/” directory.
Fast-reboot is to be initiated with the ‘-u’ option ([scripts/fast-reboot] Add option to include ssd-upgrader-part boot option with SONiC partition sonic-utilities#2150)
After reboot, while booting into SONiC the SSD firmware updater will be executed in initramfs.
2022-05-12 08:11:02 -07:00
Junchao-Mellanox
681c24878b
Fix race condition between networking service and interface-config service (#10573)
Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
Systemd starts interface-config service who depends on networking service
Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
Interface-config service updates /etc/networking/interface to static configuration.
Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it
Manual test.
2022-05-05 15:21:44 -07:00
yozhao101
e24fe9bc60
[Monit] Fix the issue which shows Monit can not reset its counter. (#10288)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
2022-04-20 18:08:06 -07:00
Vivek R
ed14eb5263
[interfaces-config] "main exception: cannot find interfaces: eth0" error log avoided (#10463)
- Why I did it
Fixes #9628
During bootup, this error log is seen

Dec 22 04:26:29 sonic interfaces-config.sh[2546]: error: main exception: cannot find interfaces: eth0 (interface was probably never up ?)
This is of non-functional nature and doesn't affect the flow.

- How I did it
Dont take the ifdown if not needed

- How to verify it
Verified during reboot. Log did not appear and IP was acquired on eth0 as expected

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
2022-04-06 16:59:47 +03:00
Santhosh Kumar T
e2502edefd
Refactoring DELL platform init to reduce rc.local processing time porting changes in master (#10318)
Why I did it
To reduce the processing time of rc.local, refactoring s6100 platform initialization.
Porting changes from 202012 branch [202012] Refactoring DELL platform init to reduce rc.local processing time #10171
2022-03-24 11:14:37 -07:00
xumia
1017ee6002
[Build]: Use one debian mirror config (#10274)
Why I did it
Use one debian mirror config.
The empty config in https://github.com/Azure/sonic-buildimage/blob/master/files/image_config/apt/sources.list overrides the file https://github.com/Azure/sonic-buildimage/blob/master/files/apt/sources.list.amd64 (armhf/arm64), it does not make sense.
All the content in files/image_config/apt is no use, any one wants to add mirror config, please add in files/apt.

How I did it
Remove files/image_config/apt and the reference.
2022-03-21 16:47:20 +08:00
Stepan Blyshchak
2919b4820f
[hostcfgd] record feature state in STATE DB (#9842)
- Why I did it
To implement blocking feature state change.

- How I did it
Record the actual feature state in STATE DB from hostcfg.

- How to verify it
UT + verification by running on the switch and checking STATE DB.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-03-14 13:45:27 +02:00
Samuel Angebault
8d419ca2c5
[Arista] Remove arista.log from rsyslog default logrotate (#9731)
Why I did it
In parallel of this change Arista added a custom logrotate configuration as part of its driver library.
Having 2 logrotate configuration for the same log file triggers an issue.

Fixes aristanetworks/sonic#38

How I did it
Arista merged a few changes in sonic-buildimage which added a logrotate configuration aristanetworks/sonic@e43c797
It is therefore the right path to remove the arista.log line from the logrotate.d/rsyslog configuration.

How to verify it
Logrotate works without any error message, arista log rotation happens and arista daemons still append logs once file was truncated.
2022-03-11 08:09:07 -08:00
Marty Y. Lok
c40f04f0e2
[chassis][supervisor]monit container-checker failed due to unexpected "database-chassis" docker running #9042 (#9043)
Why I did it
Fixed the monit container_checker fails due to unexpected "database-chassis" docker running on Supervisor card in the VOQ chassis. fixes #9042

How I did it
Added database-chassis to the always running docker list if platform is supervisor card.

How to verify it
Execute the CLI command "sudo monit status container_checker"


Signed-off-by: mlok <marty.lok@nokia.com>
2022-03-03 17:56:08 -08:00
wenyiz2021
2d0b063191
Update container_checker for multi-asic devices when state is 'always_enabled' (#10067)
* Update container_checker for multi-asic devices 

Update container_checker for multi-asic devices to add database containers in always_running_containers. 
Previous change was made for single-asic, and that database containers were not considered as feature when writing to state_db.

* Update container_checker

Update an indent
2022-02-23 18:06:30 -08:00
Stepan Blyshchak
fb752a4ae5
[rsyslog.j2] fix typo in VAR_LOG_SIZE_KB (#9954)
This issue causes negative threshold value and thus deleting log files even when there is enough space.

This issue causes negative threshold value and thus deleting log files even when there is enough space.

- Why I did it
To fix an issue when log files get deleted even if there is enough space.

- How I did it
Fixed an typo.

- How to verify it
Run the portion of the script that calculates threshold, see that the threshold is calculated correctly.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-02-17 10:16:44 +02:00
Prince George
ff14aebef9
Close console session due to user inactivity (#9890)
Signed-off-by: Prince George <prgeor@microsoft.com>
2022-02-02 09:41:21 +05:30
dflynn-Nokia
b6939b9927
[firsttime boot] suppress error message on platforms not supporting kdump (#9521)
Why I did it
Eliminate benign firsttime boot error reported when running on platforms that do not support kdump.

How I did it
Change rc.local to check for presence of the file /etc/default/kdump-tools before referencing it.

How to verify it
Install a new image on an armhf or arm64 platform and check for a failed reference to /etc/default/kdump-tools on firsttime boot.
2022-01-20 18:27:10 -08:00
liuh-80
f166b991a7
[image]: Prevent radius passkey and snmp community string into syslog. (#9727)
[image]: Prevent radius passkey and snmp community string into syslog.  (#9727)

#### Why I did it
    Prevent radius passkey and snmp community string into syslog.

#### How I did it
    Add radius and snmp config command to PASSWD_CMDS

#### How to verify it
    Run and pass all UTs.

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106

#### Description for the changelog
    Add radius and snmp config command to PASSWD_CMDS to prevent radius passkey and snmp community string into syslog.

#### A picture of a cute animal (not mandatory but encouraged)
2022-01-17 16:26:22 +08:00
Sudharsan Dhamal Gopalarathnam
bd0a19aa17
[rsyslog]Setting log file size to 16Mb (#9504)
Why I did it
The existing log file size in sonic is 1 Mb. Over a period of time this leads to huge number of log files which becomes difficult for monitoring applications to handle.
Instead of large number of small files, the size of the log file is not set to 16 Mb which reduces the number of files over a period of time.

How I did it
Changed the size parameter and related macros in logrotate config for rsyslog

How to verify it
Execute logrotate manually and verify the limit when the file gets rotated.

Signed-off-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>
2022-01-14 10:24:07 -08:00
noaOrMlnx
0908f9ec49
[CoPP] Add always_enabled field (#9302)
*Add the "always_enabled" field to copp_cfg.j2 file, in order to allow traps without an entry in features table, to be installed automatically.
2021-11-30 11:04:15 -08:00
Brian O'Connor
002827f08e
[PINS] Add APPL_STATE_DB and response path log (#9082)
- Add APPL_STATE_DB to database_config.json
- Clear APPL_STATE_DB during SwSS container restarts
- Add response path log file to logrotate config: responsepublisher.rec

Co-authored-by: PINS Working Group <sonic-pins-subgroup@googlegroups.com>
2021-11-24 10:31:06 -08:00
Renuka Manavalan
a685fe1765
add arista.log to logrotate (#9245) 2021-11-15 07:29:30 -08:00
Guohan Lu
5f11eb320e Revert "sysready (#8889)"
This reverts commit d7e5372e54.
2021-11-10 15:36:20 -08:00
LuiSzee
5b284767f6 Update Centec platform support for Bullseye and 5.10 kernel (#7)
1. Fix build for armhf and arm64
2. upgrade centec tsingma bsp support to 5.10 kernel
3. modify centec platform driver for linux 5.10

Co-authored-by: Shi Lei <shil@centecnetworks.com>
2021-11-10 15:27:22 -08:00
Saikrishna Arcot
b8a7a6355b Update the base Debian system installation script to get Bullseye
Python 2 is no longer available, so remove those packages, and remove
the pip2 commands. For picocom and systemd, just install from the
regular repo, since there's no backports yet.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-11-10 15:27:22 -08:00
Senthil Kumar Guruswamy
d7e5372e54
sysready (#8889) 2021-11-10 14:52:52 -08:00
abdosi
ea91a72b79
[multi-asic] fix syslog not getting generated. (#9160)
Fixes #9159
2021-11-03 18:29:09 -07:00
Maxime Lorrillere
81f4fca3dc
Allow database instances on multi-asic linecards to connect to chassis DB (#8583)
Add code to interfaces-config.sh to configure eth1 in multi-asic
containers so that they can access midplane subnet.

Co-authored-by: Maxime Lorrillere <mlorrillere@arista.com>
2021-10-26 18:27:09 -07:00
Ying Xie
638c287837
[copp] bind copp-config.service to sonic.target (#8969)
copp-config service needs to be started after sonic.target so that it could
render the copp-config with the latest information.

It also needs to be restarted when config reload or load_minigraph is invoked.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2021-10-13 21:07:44 -07:00
abdosi
13ec43bc68
[baseimage]: Logrotate for wtmp and btmp files. (#8743)
Added logrotate file for wtmp and btmp to override default conf and set size cap as 100K as done in 
PR: #865. For buster this is control by separate file wtmp and btmp.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2021-09-15 23:28:27 -07:00
Sudharsan Dhamal Gopalarathnam
db529af203
Removing execute permission from copp config file (#8680)
*Removed execute permissions from the systemd copp-config.service file. 
Without this we will get a warning: "Configuration file /lib/systemd/system/copp-config.service is marked executable. Please remove executable permission bits. Proceeding anyway."
2021-09-13 09:10:21 -07:00
Ying Xie
41643a9729
[202012][fstrim] delay fstrim timer after sonic.target (#8737)
Why I did it
fstrim has dependency on pmon docker.

How I did it
start fstrim timer after sonic.target.

How to verify it
local test and PR test.

Signed-off-by: Ying Xie ying.xie@microsoft.com
2021-09-13 07:37:46 -07:00
dflynn-Nokia
7bae388e2f
[Nokia ixs7215] Add support for changing the console baud rate (#8595)
This commit adds support for changing the default console baud rate configured
within the U-Boot bootloader. That default baud rate is exposed via the value
of the U-Boot 'baudrate' environment variable. This commit removes logic that
hardcoded the console baud rate to 115200 and instead ensures that the U-Boot
'baudrate' variable is always used when constructing the Linux kernel boot
arguments used when booting Sonic.

A change is also made to rc.local to ensure that the specified baud rate is set
correctly in the serial getty service.
2021-08-26 07:14:34 -07:00
byu343
cdfb4855dc
[macsec] Add eapol to copp config (#8416)
This change enables the control packets of MACsec to be processed by CPU.
2021-08-23 18:56:23 -07:00
Volodymyr Samotiy
e3a30deea9
[monit] Periodically monitor VNET route consistency (#8266)
*To run VNET route consistency check periodically.
*For any failure, the monit will raise alert based on return code.
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-08-19 16:29:25 -07:00
abdosi
2348794ef0
Enable sysctl fib_multipath_use_neigh (#8502)
Enable fib_multipath_use_neigh for v4
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

Why I did:
This is helpful if the neighbor are not directly connected then Kernel forward to unreachable neighbor option. With this option forwarding using neighbor state to be valid.
2021-08-18 15:53:17 -07:00
Saikrishna Arcot
c8b5daed27 Upgrade to ifupdown2 3.0.0 with a patch to fix using broadcast addresses
In version 3.0.0, If a broadcast address is specified in
/etc/network/interfaces, then when ifup is run, it will fail with an
error saying `'str' object has no attribute 'packed'`. This appears to
be because it expects all attributes for an interface to be "packable"
into a compact binary representation. However, it doesn't actually
convert the broadcast address into an IPNetwork object (other addresses
are handled).

Therefore, convert the broadcast address it reads in from a str to an
IPNetwork object.

Also explicitly specify the scope of the loopback address in
/etc/network/interfaces as host scope. Otherwise, it will get added as
global scope by default. As part of this, use JSON to parse ip's output
instead of text, for robustness.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2021-08-12 23:18:01 -07:00
Blueve
3da6f12b0b
[port_config] Introduce ad-hoc mport_config.json file (#8066)
Signed-off-by: Jing Kan jika@microsoft.com
2021-07-15 08:56:35 +08:00
Stepan Blyshchak
3a2b8c6ba5
[SONiC Application Extension] support warm/fast reboot for extension packages (#7286)
#### Why I did it

I made this change to support warm/fast reboot for SONiC extension packages as per HLD Azure/SONiC#682.

#### How I did it

I extended manifest.json.j2 with new warm/fast reboot related fields and also extended sonic_debian_extension.j2 script template to generate the shutdown order files for warm and fast reboot.
2021-07-11 06:58:05 -07:00
shlomibitton
776a446d76
[dhcp_relay] Disable dhcp_relay for ToRRouter switches type by the feature manager (#7789)
- Why I did it
Currently dhcp packets are disabled by the COPP manager for non ToRRouter type switches.
Even if the feature is enabled, DHCP packets wont hook to the CPU since the COPP manager will not trap this packets.
This change is to disable dhcp_relay by default for non ToRRouter switches from init_cfg.json.
With this approach, if the user want to enable the feature for non ToRRouter switches, manual enablement is required by the 'feature' configuration.
This is to keep the current approach for MSFT production issue with dhcp relay for non ToRRouter switched and allow the user to decide if to use it or not.

- How I did it
Configure dhcp_relay 'disabled' by default on init_cfg.json for non ToRRouter switches.
Remove the exclusion of dhcp packets on copp_cfg.json

- How to verify it
Enable dhcp_relay feature on a non ToRRouter switch.
Unit-tests modified so the default values on mocked CONFIG DB in 'test_vectors.py' for dhcp_relay will be 'disabled'.
This is by the change for 'init_cfg.json.j2'.
For ToRRouter the state will change from 'disabled' to 'enabled'.
Another test case added for a 'ToR' switch type, this is to test the state is 'enabled' if the user configured it to be so.
2021-07-08 09:10:46 +03:00
rajendra-dendukuri
f4b0c8fe4e
[kdump] Fix kdump error message when a reboot is issued (#7985)
dash doesn't support += operation to append to a variable's value. Use KDUMP_CMDLINE_APPEND="${KDUMP_CMDLINE_APPEND} " instead

The below error message is seen when a reboot is issued.

[ 342.439096] kdump-tools[13655]: /etc/init.d/kdump-tools: 117: /etc/default/kdump-tools: KDUMP_CMDLINE_APPEND+= panic=10 debug hpet=disable pcie_port=compat pci=nommconf sonic_platform=x86_64-accton_as7326_56x-r0: not found
2021-07-01 11:52:38 -07:00
xumia
5c503b81ae
Fix vtysh shell-ingestion security issue (#7759)
Fix vtysh shell-ingestion security issue
Only expose the limited parameters of the command vtysh show.
2021-06-28 09:57:08 +08:00
Sujin Kang
ecc5073731
Support multiple pcie configuration file and change the pcie status table name to match with pcied changes (#7886)
Why I did it
Support multiple pcie configuration file and change the pcie status table name
This is to match with below two PRs.
Azure/sonic-platform-common#195
Azure/sonic-platform-daemons#189

How I did it
Check pcie configuration file with wild card and change the device status table name

How to verify it
Restart with changes and see if the pcie check works as expected.
2021-06-16 16:05:48 -07:00
yozhao101
1a3cab43ac
[Monit] Deprecate the feature of monitoring the critical processes by Monit (#7676)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.

How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.

How to verify it
I verified this on the device str-7260cx3-acs-1.
2021-06-04 10:16:53 -07:00
Renuka Manavalan
73447efc31
Add service to restore TACACS from old config (#7560)
Why I did it
In upgrade scenarios, where config_db.json is not carry forwarded to new image, it could be left w/o TACACS credentials.
Added a service to trigger 5 minutes after boot and restore TACACS, if /etc/sonic/old_config/tacacs.json is present.

How I did it
By adding a service, that would fire 5 mins after boot.
This service apply tacacs if available.

How to verify it
Upgrade and watch status of tacacs.timer & tacacs.service
You may create /etc/sonic/old_config/tacacs.json, with updated credentials
(before 5mins after boot) and see that appears in config & persisted too.

Which release branch to backport (provide reason below if selected)
 201911
 202006
 202012
2021-06-03 20:07:17 -07:00
Andriy Kokhan
6931a45ecf
Fixed typos in config-setup (#7754)
Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com>
2021-06-03 08:59:38 -07:00
yozhao101
37863ac854
[Monit] Restart telemetry container if memory usage is beyond the threshold (#7645)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to monitor the memory usage of streaming telemetry container and restart streaming telemetry container if memory usage is larger than the pre-defined threshold.

How I did it
I borrowed the system tool Monit to run a script memory_checker which will periodically check the memory usage of streaming telemetry container. If the memory usage of telemetry container is larger than the pre-defined threshold for 10 times during 20 cycles, then an alerting message will be written into syslog and at the same time Monit will run the script restart_service to restart the streaming telemetry container.

How to verify it
I verified this implementation on device str-7260cx3-acs-1.
2021-05-28 11:13:44 -07:00
Renuka Manavalan
2cd61bc136
Invoke disk check periodically. (#7374)
Why I did it
Helps with periodic scan of disk for RO state.
If found, this script makes transient fix and raise error message.
2021-05-26 17:59:08 -07:00
shlomibitton
9930e738fe
Remove 'vm.panic_on_oom=1' (#7678)
#### Why I did it
If a process limits using nodes by mempolicy/cpusets, and those nodes become memory exhaustion status, one process may be killed by oom-killer.
No panic occurs in this case, because other node's memory may be free.
This means system total status may be not fatal yet.

#### How I did it
Remove 'vm.panic_on_oom=1' kernel flag from 'vmcore-sysctl.conf '
2021-05-24 17:23:49 -07:00
Alexander Allen
da7533aad4
[ntp] Fix ntp.conf template to allow setting of source port in CONFIG_DB (#7586)
Why I did it
Currently, there is a bug in the ntp.conf jinja2 template where it will ignore the src_intf directive in CONFIG_DB if there are multiple IP addresses associated with an interface. This code change fixes that bug and allows the template to select the correct source interface for NTP.

How I did it
I did this by modifying the macro in ntp.conf.j2 which determines if there is an ip address associated with an interface to set a state variable when it detects a valid interface entry in CONFIG_DB instead of outputting "true" directly (which could result in multiple "trues" outputted for interfaces with multiple valid IP addresses).

How to verify it
Add two ipv4 addresses to an interface in SONiC

Add the following configuration to config_db.json

{
"NTP": {
    "global": {
        "src_intf": "Ethernet1"
        }
    }
}
Replace Ethernet1 with the interface name of the one you assigned the IP addresses to.

Run sudo config reload -y

Open /etc/ntp.conf and verify that the following line exists

...
interface listen Ethernet1
...
The interface specified should be the one set in the previous steps.

Description for the changelog
[ntp] Fix ntp.conf template to allow setting of source port in CONFIG_DB
2021-05-23 13:40:43 -07:00
Sujin Kang
c6462577a9
add config-setup.service as dependency for pcie-check.service (#7599)
Why I did it
start pcie-check.service after config-setup.service since pcie_util depends on device_info which is available with config db metadata.

How I did it
Add config-setup.service as a dependency of pcie-check.service

How to verify it
Upon reboot, check if the pcie-check.sh throws the platform api error which is dependent on DEVICE_METADATA
2021-05-18 14:19:02 -07:00
Renuka Manavalan
7a575b3d00
[container_checker] Use Feature table to get running containers (#7474)
Why I did it
Finding running containers through "docker ps" breaks when kubernetes deploys container, as the names are mangled.

How I did it
The data is is available from FEATURE table, which takes care of kubernetes deployment too.

How to verify it
Deploy a feature via kubernetes and don't expect error from container_check.
2021-05-07 08:42:15 -07:00
xumia
56bdd750ab
Support readonly vtysh for sudoers (#7383)
Why I did it
Support readonly version of the command vtysh

How I did it
Check if the command starting with "show", and verify only contains single command in script.
2021-04-25 16:32:02 +08:00
Stepan Blyshchak
ae339c95d2
[systemd] disable default systemd udev rules for interfaces (#7369)
Fix #7364

99-default.link - was always in SONiC, but previous systemd (<247) had an issue and it did not work due to issue systemd/systemd#3374. Now systemd 247 works.

However, such policy overrides teamd provided mac address which causes teamd netdev to use a random mac
address. Therefore, needs to be disabled.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-04-21 17:50:00 -07:00
Kuanyu Chen
01f2b5f250
[config-setup]: Fix a bug in checking if updategraph is enabled (#7093)
Encounter error during "config-setup boot" if the updategraph is enabled.

How I did it
Correct the code inside the config-setup script.
Remove the space between the assignment operator.

How to verify it
Remove the /etc/sonic/config_db.json and reboot the device.
Originally, it will return following error after boot up.
rv: command not found
After modification, it can correctly parse the status of updategraph without error.
2021-04-19 11:40:52 -07:00
jmmikkel
43342b33b8
[chassis] Add templates and code to support VoQ chassis iBGP peers (#5622)
This commit has following changes:

* Add templates and code to support VoQ chassis iBGP peers

* Add support to convert a new VoQChassisInternal element in the
   BGPSession element of the minigraph to a new BGP_VOQ_CHASSIS_NEIGHBOR 
   table in CONFIG_DB.
* Add a new set of "voq_chassis" templates to docker-fpm-frr
* Add a new BGP peer manager to bgpcfgd to add neighbors from the
  BGP_VOQ_CHASSIS_NEIGHBOR table using the voq_chassis templates.
* Add a test case for minigraph.py, making sure the VoQChassisInternal
  element creates a BGP_VOQ_CHASSIS_NEIGHBOR entry, but not if its
  value is "false".
* Add a set of test cases for the new voq_chassis templates in
  sonic-bgpcfgd tests.

Note that the templates expect the new
"bgp bestpath peer-type multipath-relax" bgpd configuration to be
available.

Signed-off-by: Joanne Mikkelson <jmmikkel@arista.com>
2021-04-16 11:11:32 -07:00
yozhao101
2737c9681f
[container_checker] Exclude the 'always_disabled' container from expected running container list (#7217)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
Since we introduced a new value always_disabled for the state field in FEATURE table, the expected running container list
should exclude the always_diabled containers. This bug was found by nightly test and posted at here: issue. This PR fixes #7210.

How I did it
I added a logic condition to decide whether the value of state field of a container was always_disabled or not.

How to verify it
I verified this on the device str-dx010-acs-1.

Which release branch to backport (provide reason below if selected)
 201811
 201911
 202006
[ x] 202012
2021-04-02 08:05:46 -07:00
Stepan Blyshchak
e179ec2fae
[services] introduce sonic.target (#5705)
- Why I did it
Group all SONiC services together and able to manage them together. Will be used in config reload command as much simpler and generic way to restart services.

- How I did it
Add services to sonic.target

- How to verify it
Together with Azure/sonic-utilities#1199
config reload -y

Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
2021-02-25 14:26:24 +02:00
arlakshm
f77157f09d
[baseimage] add ipintutil in sudoer file (#6845)
show ip interfaces is enhanced recently to support multi ASIC platforms in this PR- https://github.com/Azure/sonic-utilities/pull/1396 .
The ipintutil script as to run as sudo user, to get the ip interface from each namespace.
Add this script to the sudoer file so that show ip interface command is available for user with read-only permissions

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2021-02-22 23:34:28 -08:00
Sujin Kang
d5238ae8dd
[pcie.yaml] Move pcie configuration file path to platform directory (#6475)
- Why I did it
The pcie configuration file location is under plugin directory not under platform directory.
#6437

- How I did it

Move all pcie.yaml configuration file from plugin to platform directory.
Remove unnecessary timer to start pcie-check.service
Move pcie-check.service to sonic-host-services
- How to verify it
Verify on the device
2021-02-21 08:27:37 -08:00
SuvarnaMeenakshi
5a49a0f499
[multi-asic][vs]: Update topology script to retrieve hwsku from minigraph (#6219)
Update topology script to retrieve hwsku from minigraph
if hwsku information is not available in config_db.
Fix clean up of interfaces in msft_multi_asic_vs hwsku
topology script.
- Why I did it
When bringing up multi-asic VS switch, topology service is started during boot up.
Topology service starts a shell script which runs the topology script present in /usr/share/sonic/device// directory. To invoke hwsku specific script, the topology script tries to retrieve hwsku information from config_db.
During initial boot up config_db might not be populated. In order to start topology service before config_db is updated,
update topology script to get hwsku information from minigraph.xml if it is available.
This will be helpful to bring up multi-asic VS testbed by loading minigraph and starting topology service.
- How I did it
Update topology.sh script to retrieve hwsku information from minigraph.xml.
Fix clean up function on msft_multi_asic_vs toplogy script.
- How to verify it
single-asic VS - no change; topology service is only enabled for multi-asic VS.
multi-asic VS - Bring up multi-asic VS image, copy minigraph to vs image, start topology service. Topology service should be successful.
to test clean up function fix, start topology service - make sure interfaces are created and moved to the right namespaces.
stop topology service - make sure namespace do not have any interface and all front end interfaces are present in default namespace.
2021-02-18 22:02:29 -08:00
Joe LeVeque
820d350301
[pcie-check] Update underlying pcieutil command and add to sudoers file (#6682)
- Why I did it

As of Azure/sonic-utilities#1297, subcommands of pcieutil have changed to remove the redundant pcie- prefix. This PR adapts calling applications (pcie-check) to the new syntax.

Resolves #6676

- How I did it

Remove pcie- prefix from pcieutil subcommands in calling applications
Also add pcieutil * to sudoers file, as pcieutil requires elevated permissions
2021-02-04 12:14:08 -08:00
Samuel Angebault
0c4d4ace76
[kdump] Fix OOM events in crashkernel (#6447)
A few issues where discovered with crashkernel on Arista platforms.

1) platforms using `docker_inram=on` would end up OOM in kdump environment.
This happens because the same initramfs is used by SONiC and the crashkernel.
With `docker_inram=on` the `dockerfs.tar.gz` is extracted in a `tmpfs` created for the occasion.
Since `dockerfs.tar.gz` weights more than 1.5G, it doesn't fit into the kdump environment and ends up OOM.
This OOM event can in turn trigger a panic.

2) Arista platforms with `secureboot` enabled would fail to load the crashkernel because the kernel parameter would be discarded on boot.
This happens because the `boot0` in secureboot mode is strict about kernel parameter injection.

3) The secureboot path allowlist would remove kernel crash reports.

4) The kdump service would fail on Arista products since `/boot/` is empty in `secureboot`

**- How I did it**

1) To prevent an OOM event in the crashkernel the fix is to avoid the codepaths in `union-mount` that create tmpfs and populate them. Some more codepath specific to Arista devices are also skipped to make the kdump process faster.
This relies on detecting that the initramfs is starting in a kdump environment and skipping some initialization.
The `/usr/sbin/kdump-config` tool appends a few kernel cmdline arguments when loading the crashkernel.
The most unique one is `systemd.unit=kdump-tools.service` which is used in a few initramfs hooks to set `in_kdump`.

2) To allow `kdump` to work in `secureboot` environment the cmdline generation in boot0 was slightly modified.
The codepath to load kernel parameters changed by SONiC is now running for booting in secure mode.
It was altered to prevent an append only behavior which would grow the `kernel-cmdline` at every reboot.
This ever growing behavior would lead `kexec` to fail to load the kernel due to a too long cmdline.

3) To get the kernel crash under /var/crash this path has to be added to `allowlist_paths`

4) The `/host/image-XXX/boot` folder is now populated in `secureboot` mode but not used.

**- How to verify it**

Regular boot:
 - enable kdump
 - enable docker_inram=on via kernel-params
 - reboot
 - generate a crash `echo c > /proc/sysrq-trigger`
 - before: witness OOM events on the console
 - after: crash kernel works and crash available under /var/crash

Secure boot:
 - enable kdump
 - reboot
 - generate a crash `echo c > /proc/sysrq-trigger`
 - before: witness no kdump
 - after: crash kernel works and crash available under /var/crash


Co-authored-by: Boyang Yu <byu@arista.com>
2021-02-02 01:55:09 -08:00
arlakshm
b5225407ef
[baseimage]: add docker ps to the sudoer file (#6604)
fixes Azure/sonic-utilities#1389

With the recent changes in sudoer files. The  show commands fails for the read-only users. 
The problem here is the 'docker ps' is failing in the function [get_routing_stack()](8a1109ed30/show/main.py (L54)) therefore all the CLI commands are failing.

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2021-01-29 08:16:32 -08:00
arlakshm
ff8cc49b18
[multi asic] add ip netns identify command to sudoer (#6591)
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>

- Why I did it
The command sudo ip netns identify <pid> is used in function get_current_namespace
to check in the cli command is running in host context or within a namespace.

This function is used for every CLI command and command sudo ip netns identify <pid> needs to be added in sudoer files to allow users with RO access to run show cli commands

This problem is not there on single asic platforms.

- How I did it
Add ip netns identify [0-9]* to sudoers file.
2021-01-28 23:12:01 -08:00